--- license: mit library_name: scikit-learn tags: - regression - salary-prediction - north-america - tabular metrics: - mae - mape - r2 model-index: - name: na-tech-jobs-salary-v1 results: - task: type: regression name: Salary Regression dataset: name: arjun10g/na-tech-jobs (curated/jobs.parquet) type: arjun10g/na-tech-jobs metrics: - type: mae name: MAE (USD/year) value: 29091 - type: mape name: MAPE (%) value: 14.73 - type: r2 name: R² (log scale) value: 0.7300 --- # na-tech-jobs salary regressor — v1 Predicts the **maximum disclosed salary** of a North American senior tech job posting, in USD per year, given tabular features from the [`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs) weekly snapshot. ## 1. Headline metrics | Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI | |---|---|---|---|---| | MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 | | MAPE | 14.73% | — | 15.62% | — | | R² (on log10 target) | 0.7300 | — | 0.7051 | — | | n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) | The CLAUDE.md §7 target was **MAE < $25k USD/year**. Not yet hit (current: $29,091). The two columns answer different questions: **test-MAE** is generalization to the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5 out-of-fold MAEs on the training set, capturing **split variance**. When the two agree we have evidence the test draw was representative; if test-MAE materially differs from CV-MAE we'd suspect either a lucky test draw or overfitting. ## 2. The ladder We trained six tiers from a constant baseline up to XGBoost, all on the **same frozen test set**. The selected model is `tier5_xgboost_optuna`. | tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test | |:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:| | tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 | | tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 | | tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 | | tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 | | tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 | | tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 | The leaderboard answers the question "is the gain from XGBoost worth its complexity?" — read down the bootstrap CIs to see where they overlap. ## 3. Honest framing — selection bias & MNAR The model is trained on the **disclosed-salary subset only** (~50% of NA tech postings). Disclosure is **not random**: - **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI; ON / BC / PEI) have higher disclosure rates by law. - **Voluntary disclosure** is concentrated at transparency-leaning employers (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample). - **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the unobserved component depends on the latent salary itself — i.e. the process is MNAR. **Therefore the model predicts "salary as priced by disclosing employers in our corpus", not ground truth.** A 2-stage Heckman correction is flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable. ## 4. Inputs Tabular features (no text yet — `description_md` enters via bge-m3 embeddings in Phase 5). See [`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md) for full definitions and [`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md) for encoding choices. ## 5. Stratified evaluation Per-stratum MAE on the test set (1,226 rows total): | stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log | |:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:| | US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 | | CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 | | US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 | | US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 | CA strata are small (~few hundred rows); narrow CI claims on CA performance should be read with that caveat. The choice to **emphasise uncertainty directly** via bootstrap CIs (rather than a Cohen-style power analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15. ## 6. Intended use - **Recruiters and candidates**: rough salary anchoring for a given role + location + seniority bucket. - **The builder**: their own NA senior-DS job search. - **Researchers**: a transparent baseline for compensation prediction using only public ATS data. ## 7. Out-of-scope - Non-NA job markets (the dataset is US/CA only). - Non-tech sectors (banks, healthcare, retail are largely on Workday and not yet in the dataset; Phase 4 of the project plan). - Total compensation (the target is base salary max; equity / bonus are mentioned only as boolean features). - Individual-offer negotiation (the model predicts a posting, not an offer). ## 8. Training data - Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs. - Snapshot: 2026-05-08. - Total active rows: 12,334. Disclosed-salary subset: ~6,143. - Train / test split: deterministic, hash-keyed by `id` and seed=42, stratified by `(country, source)`. 80/20 split. Test row IDs frozen at `data/eda/test_split_ids.json`. ## 9. Limitations - **Workday gap** — major employers (Snowflake, Coinbase, Shopify, Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't yet support. Their salaries are missing from training data. - **No total-comp signal** — the regressor sees `offers_equity` (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a $200k base + $100k equity package from $300k cash. - **Title-derived seniority/role family is regex-noisy** — ~70% of rows label as `Other` for role family. Phase 4 will replace these with DeBERTa classifiers. - **Description text not yet used** — bge-m3 dense embedding lands in Phase 5. The current model is purely tabular. ## 10. Reproducibility ```bash git clone https://github.com/Arjun10g/na-tech-jobs cd na-tech-jobs uv sync --extra ml --extra eda --group dev uv run python -m models.salary.train ``` This will rebuild the dataset, re-run the ladder, refit the winning tier on the same frozen split, and reproduce the metrics above. Random seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB). ## Citation > Ghumman, A. (2026). _na-tech-jobs salary regressor v1._ > https://huggingface.co/arjun10g/na-tech-jobs-salary-v1