v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr

dae4f6c verified about 1 month ago

7.27 kB

	---
	license: mit
	library_name: scikit-learn
	tags:
	- regression
	- salary-prediction
	- north-america
	- tabular
	metrics:
	- mae
	- mape
	- r2
	model-index:
	- name: na-tech-jobs-salary-v1
	results:
	- task:
	type: regression
	name: Salary Regression
	dataset:
	name: arjun10g/na-tech-jobs (curated/jobs.parquet)
	type: arjun10g/na-tech-jobs
	metrics:
	- type: mae
	name: MAE (USD/year)
	value: 29091
	- type: mape
	name: MAPE (%)
	value: 14.73
	- type: r2
	name: R² (log scale)
	value: 0.7300
	---

	# na-tech-jobs salary regressor — v1

	Predicts the maximum disclosed salary of a North American senior tech
	job posting, in USD per year, given tabular features from the
	[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
	weekly snapshot.

	## 1. Headline metrics

	\| Metric \| Test-set value \| 95% bootstrap CI \| 5-fold CV-OOF on train \| CV 95% bootstrap CI \|
	\|---\|---\|---\|---\|---\|
	\| MAE (USD / year) \| $29,091 \| $27,095 – $31,157 \| $30,533 \| $29,409 – $31,537 \|
	\| MAPE \| 14.73% \| — \| 15.62% \| — \|
	\| R² (on log10 target) \| 0.7300 \| — \| 0.7051 \| — \|
	\| n \| 1,226 \| (frozen test, stratified by `country × source`) \| 4,917 \| (5-fold OOF) \|

	The CLAUDE.md §7 target was MAE < $25k USD/year.
	Not yet hit (current: $29,091).

	The two columns answer different questions: test-MAE is generalization to
	the frozen 20% holdout (a single draw); CV-MAE is the average of 5
	out-of-fold MAEs on the training set, capturing split variance. When the
	two agree we have evidence the test draw was representative; if test-MAE
	materially differs from CV-MAE we'd suspect either a lucky test draw or
	overfitting.

	## 2. The ladder

	We trained six tiers from a constant baseline up to XGBoost, all on the
	same frozen test set. The selected model is `tier5_xgboost_optuna`.

	\| tier \| mae \| mae_ci \| cv_mae \| cv_mae_ci \| mape_pct \| r2_log \| n_test \|
	\|:----------------------\|:--------\|:----------------\|:---------\|:----------------\|-----------:\|---------:\|---------:\|
	\| tier0_constant \| $60,509 \| $57,405-$63,536 \| $62,279 \| $60,528-$63,759 \| 33.65 \| -0.0001 \| 1226 \|
	\| tier1_stratified_mean \| $59,589 \| $56,469-$62,505 \| $61,623 \| $59,939-$63,068 \| 32.89 \| 0.0447 \| 1226 \|
	\| tier2_mincer_ols \| $51,322 \| $48,366-$54,199 \| $52,041 \| $50,416-$53,389 \| 27.36 \| 0.2827 \| 1226 \|
	\| tier3_ridge_full \| $43,199 \| $40,907-$45,464 \| $42,179 \| $40,951-$43,236 \| 23.25 \| 0.462 \| 1226 \|
	\| tier4_random_forest \| $35,935 \| $33,906-$38,129 \| $37,016 \| $35,799-$38,046 \| 19.02 \| 0.6151 \| 1226 \|
	\| tier5_xgboost_optuna \| $29,091 \| $27,095-$31,157 \| $30,533 \| $29,409-$31,537 \| 14.73 \| 0.73 \| 1226 \|

	The leaderboard answers the question "is the gain from XGBoost worth its
	complexity?" — read down the bootstrap CIs to see where they overlap.

	## 3. Honest framing — selection bias & MNAR

	The model is trained on the disclosed-salary subset only (~50% of NA
	tech postings). Disclosure is not random:

	- Mandated jurisdictions (CA / NY / WA / CO / CT / MD / IL / HI;
	ON / BC / PEI) have higher disclosure rates by law.
	- Voluntary disclosure is concentrated at transparency-leaning employers
	(Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
	- Strategic non-disclosure (Cullen & Pakzad-Hurson 2023) means the
	unobserved component depends on the latent salary itself — i.e. the
	process is MNAR.

	**Therefore the model predicts "salary as priced by disclosing employers
	in our corpus", not ground truth.** A 2-stage Heckman correction is
	flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.

	## 4. Inputs

	Tabular features (no text yet — `description_md` enters via bge-m3
	embeddings in Phase 5). See
	[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
	for full definitions and
	[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
	for encoding choices.

	## 5. Stratified evaluation

	Per-stratum MAE on the test set (1,226 rows total):

	\| stratum \| n \| mae \| mae_ci_low \| mae_ci_high \| mape_pct \| r2_log \|
	\|:--------------\|-----:\|:--------\|:-------------\|:--------------\|-----------:\|---------:\|
	\| US/greenhouse \| 1117 \| $29,306 \| $27,393 \| $31,211 \| 14.47 \| 0.7273 \|
	\| CA/greenhouse \| 56 \| $28,610 \| $20,222 \| $39,174 \| 21.96 \| 0.673 \|
	\| US/lever \| 32 \| $10,276 \| $7,051 \| $13,835 \| 7.92 \| 0.7615 \|
	\| US/ashby \| 21 \| $47,632 \| $27,030 \| $74,180 \| 19.47 \| -0.0792 \|

	CA strata are small (~few hundred rows); narrow CI claims on CA
	performance should be read with that caveat. The choice to **emphasise
	uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
	analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.

	## 6. Intended use

	- Recruiters and candidates: rough salary anchoring for a given role
	+ location + seniority bucket.
	- The builder: their own NA senior-DS job search.
	- Researchers: a transparent baseline for compensation prediction
	using only public ATS data.

	## 7. Out-of-scope

	- Non-NA job markets (the dataset is US/CA only).
	- Non-tech sectors (banks, healthcare, retail are largely on Workday
	and not yet in the dataset; Phase 4 of the project plan).
	- Total compensation (the target is base salary max; equity / bonus
	are mentioned only as boolean features).
	- Individual-offer negotiation (the model predicts a posting, not an
	offer).

	## 8. Training data

	- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
	- Snapshot: 2026-05-08.
	- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
	- Train / test split: deterministic, hash-keyed by `id` and seed=42,
	stratified by `(country, source)`. 80/20 split. Test row IDs frozen
	at `data/eda/test_split_ids.json`.

	## 9. Limitations

	- Workday gap — major employers (Snowflake, Coinbase, Shopify,
	Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
	yet support. Their salaries are missing from training data.
	- No total-comp signal — the regressor sees `offers_equity`
	(boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
	$200k base + $100k equity package from $300k cash.
	- Title-derived seniority/role family is regex-noisy — ~70% of
	rows label as `Other` for role family. Phase 4 will replace these
	with DeBERTa classifiers.
	- Description text not yet used — bge-m3 dense embedding lands in
	Phase 5. The current model is purely tabular.

	## 10. Reproducibility

	```bash
	git clone https://github.com/Arjun10g/na-tech-jobs
	cd na-tech-jobs
	uv sync --extra ml --extra eda --group dev
	uv run python -m models.salary.train
	```

	This will rebuild the dataset, re-run the ladder, refit the winning
	tier on the same frozen split, and reproduce the metrics above. Random
	seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).

	## Citation

	> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
	> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1