| Crates.io | oaxaca_blinder |
| lib.rs | oaxaca_blinder |
| version | 0.2.2 |
| created_at | 2025-09-16 01:05:49.008425+00 |
| updated_at | 2025-12-18 02:06:21.111483+00 |
| description | A Rust library for performing Oaxaca-Blinder decomposition on Polars DataFrames, with support for categorical variables and bootstrapped standard errors. |
| homepage | |
| repository | https://github.com/dot-comma-hyphen/oaxaca-blinder-rs |
| max_upload_size | |
| id | 1840751 |
| size | 372,601 |
A high-performance Rust library for performing Oaxaca-Blinder decomposition, designed for economists, data scientists, and HR analysts. It decomposes the gap in an outcome variable (like wage) between two groups into "explained" (characteristics) and "unexplained" (discrimination/coefficients) components.
Beyond standard decomposition, it supports Quantile Decomposition (RIF & Machado-Mata), AKM (Abowd-Kramarz-Margolis) Models, Propensity Score Matching, DFL Reweighting, and Budget Optimization for policy simulation.
| Feature | Support |
|---|---|
| OLS Mean Decomposition | β |
| Quantile Decomposition (Machado-Mata) | β |
| Quantile Decomposition (RIF Regression) | β |
| Categorical Normalization (Yun) | β |
| Bootstrapped Standard Errors | β |
| Budget Optimization Solver | β |
| JMP Decomposition (Time Series) | β |
| DFL Reweighting (Counterfactuals) | β |
| Sample Weights | β |
| Heckman Correction (Selection Bias) | β |
| AKM (Worker-Firm Fixed Effects) | β |
| Matching (Euclidean, Mahalanobis, PSM) | β |
Most economists rely on the oaxaca R package or statsmodels in Python. While excellent, they have limitations that this library addresses:
oaxaca for decomposition, rifreg for quantiles, MatchIt for matching, and lfe for AKM. In Python, statsmodels lacks built-in RIF, Matching, and AKM. This library unifies all of them into a single, consistent API.NaN propagation) that can plague dynamic languages.Don't want to write Rust code? You can use the oaxaca-cli tool directly from your terminal to analyze CSV files.
cargo install oaxaca_blinder --features cli
Basic Decomposition:
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience --categorical sector
Using R-style Formula:
oaxaca-cli --data wage.csv --group gender --reference F \
--formula "wage ~ education + experience + C(sector)"
With Sample Weights (WLS):
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience \
--weights sampling_weight
With Heckman Correction (Selection Bias):
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience \
--selection-outcome employed \
--selection-predictors education experience age marital_status
Export Results:
oaxaca-cli --data wage.csv ... --output-json results.json --output-markdown report.md
Supports both --analysis-type mean (default) and --analysis-type quantile.
Add to Cargo.toml:
[dependencies]
oaxaca_blinder = "0.1.0"
polars = { version = "0.38", features = ["lazy", "csv"] }
use polars::prelude::*;
use oaxaca_blinder::{OaxacaBuilder, ReferenceCoefficients};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let df = df!(
"wage" => &[25.0, 30.0, 35.0, 40.0, 45.0, 20.0, 22.0, 28.0, 32.0, 38.0],
"education" => &[16.0, 18.0, 14.0, 20.0, 16.0, 12.0, 14.0, 16.0, 12.0, 18.0],
"gender" => &["M", "M", "M", "M", "M", "F", "F", "F", "F", "F"]
)?;
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
.predictors(&["education"])
.reference_coefficients(ReferenceCoefficients::Pooled)
.run()?;
results.summary();
Ok(())
}
import oaxaca_blinder
results = oaxaca_blinder.decompose_from_csv(
"wage.csv",
outcome="wage",
predictors=["education", "experience"],
categorical_predictors=["sector"],
group="gender",
reference_group="F",
bootstrap_reps=100
)
print(f"Total Gap: {results.total_gap}")
print(f"Unexplained: {results.unexplained}")
"The Cheapest Fix"
This unique feature is designed for HR analytics. It answers: "Given a limited budget, how can we reduce the pay gap as much as possible?"
It identifies individuals in the disadvantaged group with the largest negative unexplained residuals (i.e., the most "underpaid" relative to their qualifications) and calculates the optimal raises.
// Scenario: You have $200,000 to reduce the gap to 5%
let adjustments = results.optimize_budget(200_000.0, 0.05);
for adj in adjustments {
println!("Give ${:.2} raise to employee #{}", adj.adjustment, adj.index);
}
The library supports two robust methods for decomposing the wage gap across the distribution:
| Method | Best For... | Builder |
|---|---|---|
| Machado-Mata (Simulation) | Constructing full counterfactual distributions and "glass ceiling" analysis. | QuantileDecompositionBuilder |
| RIF Regression (Analytical) | Fast, detailed decomposition of specific quantiles (e.g., "Why is the 90th percentile gap so large?"). | OaxacaBuilder::decompose_quantile(0.9) |
// Fast decomposition of the 90th percentile gap
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
.predictors(&["education", "experience"])
.decompose_quantile(0.9)?;
oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
--predictors education experience \
--analysis-type quantile --quantiles 0.1,0.5,0.9
Note: Python bindings for quantile decomposition are coming soon.
DiNardo-Fortin-Lemieux (DFL) reweighting (Rust Only) is a non-parametric alternative that allows you to visualize what the wage distribution of Group B would look like if they had the characteristics of Group A.
The run_dfl function returns density vectors perfect for plotting in Python (matplotlib) or Rust (plotters).
use oaxaca_blinder::run_dfl;
let dfl = run_dfl(&df, "wage", "gender", "F", &["education", "experience"])?;
// dfl.grid <- X-axis (Wage levels)
// dfl.density_a <- Actual Group A Density
// dfl.density_b <- Actual Group B Density
// dfl.density_b_counterfactual <- "What B would earn with A's characteristics"
Tip: Plot density_b vs density_b_counterfactual to visualize the "explained" gap.
Designed for performance, utilizing Rust's speed and parallelization (Rayon) for bootstrapping.
Performance vs Python (statsmodels) vs R (oaxaca)
Dataset: 100k rows, 10 predictors
| Reps | Rust (oaxaca_blinder) |
Python (statsmodels) |
R (oaxaca) |
|---|---|---|---|
| 1 (Raw) | 0.14s π | 0.15s | ? |
| 100 | 0.76s π | N/A | ? |
| 500 | 3.11s π | N/A | ~119.4s |
Rust's raw decomposition is significantly faster than statsmodels, and the bootstrap performance is orders of magnitude faster than R.
The library includes a high-performance Matching Engine for causal inference, supporting Euclidean, Mahalanobis, and Propensity Score Matching (PSM).
use oaxaca_blinder::MatchingEngine;
use polars::prelude::*;
// Load data...
let engine = MatchingEngine::new(df, "treatment", "outcome", &["age", "education"]);
// 1-Nearest Neighbor Matching with Mahalanobis distance
let weights = engine.run_matching(1, true)?;
import oaxaca_blinder
# Match units
weights = oaxaca_blinder.match_units(
"data.csv",
treatment="treatment",
outcome="wage",
covariates=["education", "experience"],
k=1,
method="mahalanobis" # or "euclidean", "psm"
)
oaxaca-cli --data wage.csv --outcome wage --group treatment --reference 0 \
--predictors education,experience \
--analysis-type match --matching-method mahalanobis --k-neighbors 1
The decomposition depends on the choice of the non-discriminatory coefficient vector $\beta^*$. The general decomposition equation is:
This library supports:
Standard detailed decomposition is sensitive to the choice of the omitted base category for dummy variables. This library implements Yun's normalization, which transforms coefficients to be invariant to the base category choice:
Where $\bar{\beta}_k$ is the mean of the coefficients for the categorical variable $k$. This ensures robust detailed results.
The Juhn-Murphy-Pierce (JMP) method decomposes the change in the gap over time (or between distributions) into three components:
The AKM model decomposes wage variation into individual and firm-specific components:
Identification: The model is identified only within the Largest Connected Set (LCS) of workers and firms linked by mobility. This library automatically extracts the LCS using a graph-based approach (BFS) before estimation.
PSM estimates the Average Treatment Effect on the Treated (ATT) by matching treated units to control units with similar probabilities of treatment:
DiNardo, Fortin, and Lemieux (1996) proposed a non-parametric method to decompose the entire distribution of wages. It constructs a counterfactual density for Group B (e.g., women) as if they had the characteristics of Group A (e.g., men) by applying a reweighting factor $\Psi(x)$:
This allows for visual comparison of the "explained" gap across the entire distribution (e.g., via Kernel Density Estimation).
This project is licensed under the MIT License.