| --- |
| license: mit |
| library_name: scikit-learn |
| tags: |
| - regression |
| - salary-prediction |
| - north-america |
| - tabular |
| metrics: |
| - mae |
| - mape |
| - r2 |
| model-index: |
| - name: na-tech-jobs-salary-v1 |
| results: |
| - task: |
| type: regression |
| name: Salary Regression |
| dataset: |
| name: arjun10g/na-tech-jobs (curated/jobs.parquet) |
| type: arjun10g/na-tech-jobs |
| metrics: |
| - type: mae |
| name: MAE (USD/year) |
| value: 29091 |
| - type: mape |
| name: MAPE (%) |
| value: 14.73 |
| - type: r2 |
| name: R² (log scale) |
| value: 0.7300 |
| --- |
| |
| # na-tech-jobs salary regressor — v1 |
|
|
| Predicts the **maximum disclosed salary** of a North American senior tech |
| job posting, in USD per year, given tabular features from the |
| [`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs) |
| weekly snapshot. |
|
|
| ## 1. Headline metrics |
|
|
| | Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI | |
| |---|---|---|---|---| |
| | MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 | |
| | MAPE | 14.73% | — | 15.62% | — | |
| | R² (on log10 target) | 0.7300 | — | 0.7051 | — | |
| | n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) | |
|
|
| The CLAUDE.md §7 target was **MAE < $25k USD/year**. |
| Not yet hit (current: $29,091). |
|
|
| The two columns answer different questions: **test-MAE** is generalization to |
| the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5 |
| out-of-fold MAEs on the training set, capturing **split variance**. When the |
| two agree we have evidence the test draw was representative; if test-MAE |
| materially differs from CV-MAE we'd suspect either a lucky test draw or |
| overfitting. |
|
|
| ## 2. The ladder |
|
|
| We trained six tiers from a constant baseline up to XGBoost, all on the |
| **same frozen test set**. The selected model is `tier5_xgboost_optuna`. |
|
|
| | tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test | |
| |:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:| |
| | tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 | |
| | tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 | |
| | tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 | |
| | tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 | |
| | tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 | |
| | tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 | |
|
|
| The leaderboard answers the question "is the gain from XGBoost worth its |
| complexity?" — read down the bootstrap CIs to see where they overlap. |
|
|
| ## 3. Honest framing — selection bias & MNAR |
|
|
| The model is trained on the **disclosed-salary subset only** (~50% of NA |
| tech postings). Disclosure is **not random**: |
|
|
| - **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI; |
| ON / BC / PEI) have higher disclosure rates by law. |
| - **Voluntary disclosure** is concentrated at transparency-leaning employers |
| (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample). |
| - **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the |
| unobserved component depends on the latent salary itself — i.e. the |
| process is MNAR. |
|
|
| **Therefore the model predicts "salary as priced by disclosing employers |
| in our corpus", not ground truth.** A 2-stage Heckman correction is |
| flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable. |
|
|
| ## 4. Inputs |
|
|
| Tabular features (no text yet — `description_md` enters via bge-m3 |
| embeddings in Phase 5). See |
| [`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md) |
| for full definitions and |
| [`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md) |
| for encoding choices. |
|
|
| ## 5. Stratified evaluation |
|
|
| Per-stratum MAE on the test set (1,226 rows total): |
|
|
| | stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log | |
| |:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:| |
| | US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 | |
| | CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 | |
| | US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 | |
| | US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 | |
|
|
| CA strata are small (~few hundred rows); narrow CI claims on CA |
| performance should be read with that caveat. The choice to **emphasise |
| uncertainty directly** via bootstrap CIs (rather than a Cohen-style power |
| analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15. |
|
|
| ## 6. Intended use |
|
|
| - **Recruiters and candidates**: rough salary anchoring for a given role |
| + location + seniority bucket. |
| - **The builder**: their own NA senior-DS job search. |
| - **Researchers**: a transparent baseline for compensation prediction |
| using only public ATS data. |
|
|
| ## 7. Out-of-scope |
|
|
| - Non-NA job markets (the dataset is US/CA only). |
| - Non-tech sectors (banks, healthcare, retail are largely on Workday |
| and not yet in the dataset; Phase 4 of the project plan). |
| - Total compensation (the target is base salary max; equity / bonus |
| are mentioned only as boolean features). |
| - Individual-offer negotiation (the model predicts a posting, not an |
| offer). |
|
|
| ## 8. Training data |
|
|
| - Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs. |
| - Snapshot: 2026-05-08. |
| - Total active rows: 12,334. Disclosed-salary subset: ~6,143. |
| - Train / test split: deterministic, hash-keyed by `id` and seed=42, |
| stratified by `(country, source)`. 80/20 split. Test row IDs frozen |
| at `data/eda/test_split_ids.json`. |
|
|
| ## 9. Limitations |
|
|
| - **Workday gap** — major employers (Snowflake, Coinbase, Shopify, |
| Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't |
| yet support. Their salaries are missing from training data. |
| - **No total-comp signal** — the regressor sees `offers_equity` |
| (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a |
| $200k base + $100k equity package from $300k cash. |
| - **Title-derived seniority/role family is regex-noisy** — ~70% of |
| rows label as `Other` for role family. Phase 4 will replace these |
| with DeBERTa classifiers. |
| - **Description text not yet used** — bge-m3 dense embedding lands in |
| Phase 5. The current model is purely tabular. |
|
|
| ## 10. Reproducibility |
|
|
| ```bash |
| git clone https://github.com/Arjun10g/na-tech-jobs |
| cd na-tech-jobs |
| uv sync --extra ml --extra eda --group dev |
| uv run python -m models.salary.train |
| ``` |
|
|
| This will rebuild the dataset, re-run the ladder, refit the winning |
| tier on the same frozen split, and reproduce the metrics above. Random |
| seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB). |
|
|
| ## Citation |
|
|
| > Ghumman, A. (2026). _na-tech-jobs salary regressor v1._ |
| > https://huggingface.co/arjun10g/na-tech-jobs-salary-v1 |
|
|