v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr

dae4f6c verified about 1 month ago

7.27 kB

license: mit
library_name: scikit-learn
tags:
  - regression
  - salary-prediction
  - north-america
  - tabular
metrics:
  - mae
  - mape
  - r2
model-index:
  - name: na-tech-jobs-salary-v1
    results:
      - task:
          type: regression
          name: Salary Regression
        dataset:
          name: arjun10g/na-tech-jobs (curated/jobs.parquet)
          type: arjun10g/na-tech-jobs
        metrics:
          - type: mae
            name: MAE (USD/year)
            value: 29091
          - type: mape
            name: MAPE (%)
            value: 14.73
          - type: r2
            name: R² (log scale)
            value: 0.73

na-tech-jobs salary regressor — v1

Predicts the maximum disclosed salary of a North American senior tech job posting, in USD per year, given tabular features from the arjun10g/na-tech-jobs weekly snapshot.

1. Headline metrics

Metric	Test-set value	95% bootstrap CI	5-fold CV-OOF on train	CV 95% bootstrap CI
MAE (USD / year)	$29,091	$27,095 – $31,157	$30,533	$29,409 – $31,537
MAPE	14.73%	—	15.62%	—
R² (on log10 target)	0.7300	—	0.7051	—
n	1,226	(frozen test, stratified by `country × source`)	4,917	(5-fold OOF)

The CLAUDE.md §7 target was MAE < $25k USD/year. Not yet hit (current: $29,091).

The two columns answer different questions: test-MAE is generalization to the frozen 20% holdout (a single draw); CV-MAE is the average of 5 out-of-fold MAEs on the training set, capturing split variance. When the two agree we have evidence the test draw was representative; if test-MAE materially differs from CV-MAE we'd suspect either a lucky test draw or overfitting.

2. The ladder

We trained six tiers from a constant baseline up to XGBoost, all on the same frozen test set. The selected model is tier5_xgboost_optuna.

tier	mae	mae_ci	cv_mae	cv_mae_ci	mape_pct	r2_log	n_test
tier0_constant	$60,509	$57,405-$63,536	$62,279	$60,528-$63,759	33.65	-0.0001	1226
tier1_stratified_mean	$59,589	$56,469-$62,505	$61,623	$59,939-$63,068	32.89	0.0447	1226
tier2_mincer_ols	$51,322	$48,366-$54,199	$52,041	$50,416-$53,389	27.36	0.2827	1226
tier3_ridge_full	$43,199	$40,907-$45,464	$42,179	$40,951-$43,236	23.25	0.462	1226
tier4_random_forest	$35,935	$33,906-$38,129	$37,016	$35,799-$38,046	19.02	0.6151	1226
tier5_xgboost_optuna	$29,091	$27,095-$31,157	$30,533	$29,409-$31,537	14.73	0.73	1226

The leaderboard answers the question "is the gain from XGBoost worth its complexity?" — read down the bootstrap CIs to see where they overlap.

3. Honest framing — selection bias & MNAR

The model is trained on the disclosed-salary subset only (~50% of NA tech postings). Disclosure is not random:

Mandated jurisdictions (CA / NY / WA / CO / CT / MD / IL / HI; ON / BC / PEI) have higher disclosure rates by law.
Voluntary disclosure is concentrated at transparency-leaning employers (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
Strategic non-disclosure (Cullen & Pakzad-Hurson 2023) means the unobserved component depends on the latent salary itself — i.e. the process is MNAR.

Therefore the model predicts "salary as priced by disclosing employers in our corpus", not ground truth. A 2-stage Heckman correction is flagged in LITERATURE_REVIEW.md §1.2 as a v2 deliverable.

4. Inputs

Tabular features (no text yet — description_md enters via bge-m3 embeddings in Phase 5). See DATA_DICTIONARY.md for full definitions and LITERATURE_REVIEW.md §14 for encoding choices.

5. Stratified evaluation

Per-stratum MAE on the test set (1,226 rows total):

stratum	n	mae	mae_ci_low	mae_ci_high	mape_pct	r2_log
US/greenhouse	1117	$29,306	$27,393	$31,211	14.47	0.7273
CA/greenhouse	56	$28,610	$20,222	$39,174	21.96	0.673
US/lever	32	$10,276	$7,051	$13,835	7.92	0.7615
US/ashby	21	$47,632	$27,030	$74,180	19.47	-0.0792

CA strata are small (~few hundred rows); narrow CI claims on CA performance should be read with that caveat. The choice to emphasise uncertainty directly via bootstrap CIs (rather than a Cohen-style power analysis) is documented in LITERATURE_REVIEW.md §15.3 #15.

6. Intended use

Recruiters and candidates: rough salary anchoring for a given role
- location + seniority bucket.
The builder: their own NA senior-DS job search.
Researchers: a transparent baseline for compensation prediction using only public ATS data.

7. Out-of-scope

Non-NA job markets (the dataset is US/CA only).
Non-tech sectors (banks, healthcare, retail are largely on Workday and not yet in the dataset; Phase 4 of the project plan).
Total compensation (the target is base salary max; equity / bonus are mentioned only as boolean features).
Individual-offer negotiation (the model predicts a posting, not an offer).

8. Training data

Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
Snapshot: 2026-05-08.
Total active rows: 12,334. Disclosed-salary subset: ~6,143.
Train / test split: deterministic, hash-keyed by id and seed=42, stratified by (country, source). 80/20 split. Test row IDs frozen at data/eda/test_split_ids.json.

9. Limitations

Workday gap — major employers (Snowflake, Coinbase, Shopify, Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't yet support. Their salaries are missing from training data.
No total-comp signal — the regressor sees offers_equity (boolean) and bonus_mentioned (boolean) but cannot distinguish a $200k base + $100k equity package from $300k cash.
Title-derived seniority/role family is regex-noisy — ~70% of rows label as Other for role family. Phase 4 will replace these with DeBERTa classifiers.
Description text not yet used — bge-m3 dense embedding lands in Phase 5. The current model is purely tabular.

10. Reproducibility

git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train

This will rebuild the dataset, re-run the ladder, refit the winning tier on the same frozen split, and reproduce the metrics above. Random seeds are fixed throughout (42 for split, Optuna sampler, RF, XGB).

Citation

Ghumman, A. (2026). na-tech-jobs salary regressor v1. https://huggingface.co/arjun10g/na-tech-jobs-salary-v1