license: mit
library_name: scikit-learn
tags:
- regression
- salary-prediction
- north-america
- tabular
metrics:
- mae
- mape
- r2
model-index:
- name: na-tech-jobs-salary-v1
results:
- task:
type: regression
name: Salary Regression
dataset:
name: arjun10g/na-tech-jobs (curated/jobs.parquet)
type: arjun10g/na-tech-jobs
metrics:
- type: mae
name: MAE (USD/year)
value: 29091
- type: mape
name: MAPE (%)
value: 14.73
- type: r2
name: R² (log scale)
value: 0.73
na-tech-jobs salary regressor — v1
Predicts the maximum disclosed salary of a North American senior tech
job posting, in USD per year, given tabular features from the
arjun10g/na-tech-jobs
weekly snapshot.
1. Headline metrics
| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|---|---|---|---|---|
| MAE (USD / year) | $29,091 | $27,095 – $31,157 | $30,533 | $29,409 – $31,537 |
| MAPE | 14.73% | — | 15.62% | — |
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
| n | 1,226 | (frozen test, stratified by country × source) |
4,917 | (5-fold OOF) |
The CLAUDE.md §7 target was MAE < $25k USD/year. Not yet hit (current: $29,091).
The two columns answer different questions: test-MAE is generalization to the frozen 20% holdout (a single draw); CV-MAE is the average of 5 out-of-fold MAEs on the training set, capturing split variance. When the two agree we have evidence the test draw was representative; if test-MAE materially differs from CV-MAE we'd suspect either a lucky test draw or overfitting.
2. The ladder
We trained six tiers from a constant baseline up to XGBoost, all on the
same frozen test set. The selected model is tier5_xgboost_optuna.
| tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test |
|---|---|---|---|---|---|---|---|
| tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 |
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 |
| tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 |
| tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 |
| tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 |
| tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 |
The leaderboard answers the question "is the gain from XGBoost worth its complexity?" — read down the bootstrap CIs to see where they overlap.
3. Honest framing — selection bias & MNAR
The model is trained on the disclosed-salary subset only (~50% of NA tech postings). Disclosure is not random:
- Mandated jurisdictions (CA / NY / WA / CO / CT / MD / IL / HI; ON / BC / PEI) have higher disclosure rates by law.
- Voluntary disclosure is concentrated at transparency-leaning employers (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
- Strategic non-disclosure (Cullen & Pakzad-Hurson 2023) means the unobserved component depends on the latent salary itself — i.e. the process is MNAR.
Therefore the model predicts "salary as priced by disclosing employers
in our corpus", not ground truth. A 2-stage Heckman correction is
flagged in LITERATURE_REVIEW.md §1.2 as a v2 deliverable.
4. Inputs
Tabular features (no text yet — description_md enters via bge-m3
embeddings in Phase 5). See
DATA_DICTIONARY.md
for full definitions and
LITERATURE_REVIEW.md §14
for encoding choices.
5. Stratified evaluation
Per-stratum MAE on the test set (1,226 rows total):
| stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log |
|---|---|---|---|---|---|---|
| US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 |
| CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 |
| US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 |
| US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 |
CA strata are small (~few hundred rows); narrow CI claims on CA
performance should be read with that caveat. The choice to emphasise
uncertainty directly via bootstrap CIs (rather than a Cohen-style power
analysis) is documented in LITERATURE_REVIEW.md §15.3 #15.
6. Intended use
- Recruiters and candidates: rough salary anchoring for a given role
- location + seniority bucket.
- The builder: their own NA senior-DS job search.
- Researchers: a transparent baseline for compensation prediction using only public ATS data.
7. Out-of-scope
- Non-NA job markets (the dataset is US/CA only).
- Non-tech sectors (banks, healthcare, retail are largely on Workday and not yet in the dataset; Phase 4 of the project plan).
- Total compensation (the target is base salary max; equity / bonus are mentioned only as boolean features).
- Individual-offer negotiation (the model predicts a posting, not an offer).
8. Training data
- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
- Snapshot: 2026-05-08.
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
- Train / test split: deterministic, hash-keyed by
idand seed=42, stratified by(country, source). 80/20 split. Test row IDs frozen atdata/eda/test_split_ids.json.
9. Limitations
- Workday gap — major employers (Snowflake, Coinbase, Shopify, Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't yet support. Their salaries are missing from training data.
- No total-comp signal — the regressor sees
offers_equity(boolean) andbonus_mentioned(boolean) but cannot distinguish a $200k base + $100k equity package from $300k cash. - Title-derived seniority/role family is regex-noisy — ~70% of
rows label as
Otherfor role family. Phase 4 will replace these with DeBERTa classifiers. - Description text not yet used — bge-m3 dense embedding lands in Phase 5. The current model is purely tabular.
10. Reproducibility
git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train
This will rebuild the dataset, re-run the ladder, refit the winning
tier on the same frozen split, and reproduce the metrics above. Random
seeds are fixed throughout (42 for split, Optuna sampler, RF, XGB).
Citation
Ghumman, A. (2026). na-tech-jobs salary regressor v1. https://huggingface.co/arjun10g/na-tech-jobs-salary-v1