arjun10g's picture
v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr
dae4f6c verified
|
raw
history blame
7.27 kB
---
license: mit
library_name: scikit-learn
tags:
- regression
- salary-prediction
- north-america
- tabular
metrics:
- mae
- mape
- r2
model-index:
- name: na-tech-jobs-salary-v1
results:
- task:
type: regression
name: Salary Regression
dataset:
name: arjun10g/na-tech-jobs (curated/jobs.parquet)
type: arjun10g/na-tech-jobs
metrics:
- type: mae
name: MAE (USD/year)
value: 29091
- type: mape
name: MAPE (%)
value: 14.73
- type: r2
name: (log scale)
value: 0.7300
---
# na-tech-jobs salary regressor — v1
Predicts the **maximum disclosed salary** of a North American senior tech
job posting, in USD per year, given tabular features from the
[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
weekly snapshot.
## 1. Headline metrics
| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|---|---|---|---|---|
| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
| MAPE | 14.73% | — | 15.62% | — |
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |
The CLAUDE.md §7 target was **MAE < $25k USD/year**.
Not yet hit (current: $29,091).
The two columns answer different questions: **test-MAE** is generalization to
the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
out-of-fold MAEs on the training set, capturing **split variance**. When the
two agree we have evidence the test draw was representative; if test-MAE
materially differs from CV-MAE we'd suspect either a lucky test draw or
overfitting.
## 2. The ladder
We trained six tiers from a constant baseline up to XGBoost, all on the
**same frozen test set**. The selected model is `tier5_xgboost_optuna`.
| tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test |
|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
| tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 |
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 |
| tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 |
| tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 |
| tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 |
| tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 |
The leaderboard answers the question "is the gain from XGBoost worth its
complexity?" — read down the bootstrap CIs to see where they overlap.
## 3. Honest framing — selection bias & MNAR
The model is trained on the **disclosed-salary subset only** (~50% of NA
tech postings). Disclosure is **not random**:
- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
ON / BC / PEI) have higher disclosure rates by law.
- **Voluntary disclosure** is concentrated at transparency-leaning employers
(Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
unobserved component depends on the latent salary itself — i.e. the
process is MNAR.
**Therefore the model predicts "salary as priced by disclosing employers
in our corpus", not ground truth.** A 2-stage Heckman correction is
flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.
## 4. Inputs
Tabular features (no text yet — `description_md` enters via bge-m3
embeddings in Phase 5). See
[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
for full definitions and
[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
for encoding choices.
## 5. Stratified evaluation
Per-stratum MAE on the test set (1,226 rows total):
| stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log |
|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
| US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 |
| CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 |
| US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 |
| US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 |
CA strata are small (~few hundred rows); narrow CI claims on CA
performance should be read with that caveat. The choice to **emphasise
uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.
## 6. Intended use
- **Recruiters and candidates**: rough salary anchoring for a given role
+ location + seniority bucket.
- **The builder**: their own NA senior-DS job search.
- **Researchers**: a transparent baseline for compensation prediction
using only public ATS data.
## 7. Out-of-scope
- Non-NA job markets (the dataset is US/CA only).
- Non-tech sectors (banks, healthcare, retail are largely on Workday
and not yet in the dataset; Phase 4 of the project plan).
- Total compensation (the target is base salary max; equity / bonus
are mentioned only as boolean features).
- Individual-offer negotiation (the model predicts a posting, not an
offer).
## 8. Training data
- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
- Snapshot: 2026-05-08.
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
- Train / test split: deterministic, hash-keyed by `id` and seed=42,
stratified by `(country, source)`. 80/20 split. Test row IDs frozen
at `data/eda/test_split_ids.json`.
## 9. Limitations
- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
yet support. Their salaries are missing from training data.
- **No total-comp signal** — the regressor sees `offers_equity`
(boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
$200k base + $100k equity package from $300k cash.
- **Title-derived seniority/role family is regex-noisy** — ~70% of
rows label as `Other` for role family. Phase 4 will replace these
with DeBERTa classifiers.
- **Description text not yet used** — bge-m3 dense embedding lands in
Phase 5. The current model is purely tabular.
## 10. Reproducibility
```bash
git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train
```
This will rebuild the dataset, re-run the ladder, refit the winning
tier on the same frozen split, and reproduce the metrics above. Random
seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).
## Citation
> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1