File size: 7,270 Bytes
dae4f6c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ---
license: mit
library_name: scikit-learn
tags:
- regression
- salary-prediction
- north-america
- tabular
metrics:
- mae
- mape
- r2
model-index:
- name: na-tech-jobs-salary-v1
results:
- task:
type: regression
name: Salary Regression
dataset:
name: arjun10g/na-tech-jobs (curated/jobs.parquet)
type: arjun10g/na-tech-jobs
metrics:
- type: mae
name: MAE (USD/year)
value: 29091
- type: mape
name: MAPE (%)
value: 14.73
- type: r2
name: R² (log scale)
value: 0.7300
---
# na-tech-jobs salary regressor — v1
Predicts the **maximum disclosed salary** of a North American senior tech
job posting, in USD per year, given tabular features from the
[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
weekly snapshot.
## 1. Headline metrics
| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|---|---|---|---|---|
| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
| MAPE | 14.73% | — | 15.62% | — |
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |
The CLAUDE.md §7 target was **MAE < $25k USD/year**.
Not yet hit (current: $29,091).
The two columns answer different questions: **test-MAE** is generalization to
the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
out-of-fold MAEs on the training set, capturing **split variance**. When the
two agree we have evidence the test draw was representative; if test-MAE
materially differs from CV-MAE we'd suspect either a lucky test draw or
overfitting.
## 2. The ladder
We trained six tiers from a constant baseline up to XGBoost, all on the
**same frozen test set**. The selected model is `tier5_xgboost_optuna`.
| tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test |
|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
| tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 |
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 |
| tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 |
| tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 |
| tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 |
| tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 |
The leaderboard answers the question "is the gain from XGBoost worth its
complexity?" — read down the bootstrap CIs to see where they overlap.
## 3. Honest framing — selection bias & MNAR
The model is trained on the **disclosed-salary subset only** (~50% of NA
tech postings). Disclosure is **not random**:
- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
ON / BC / PEI) have higher disclosure rates by law.
- **Voluntary disclosure** is concentrated at transparency-leaning employers
(Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
unobserved component depends on the latent salary itself — i.e. the
process is MNAR.
**Therefore the model predicts "salary as priced by disclosing employers
in our corpus", not ground truth.** A 2-stage Heckman correction is
flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.
## 4. Inputs
Tabular features (no text yet — `description_md` enters via bge-m3
embeddings in Phase 5). See
[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
for full definitions and
[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
for encoding choices.
## 5. Stratified evaluation
Per-stratum MAE on the test set (1,226 rows total):
| stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log |
|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
| US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 |
| CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 |
| US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 |
| US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 |
CA strata are small (~few hundred rows); narrow CI claims on CA
performance should be read with that caveat. The choice to **emphasise
uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.
## 6. Intended use
- **Recruiters and candidates**: rough salary anchoring for a given role
+ location + seniority bucket.
- **The builder**: their own NA senior-DS job search.
- **Researchers**: a transparent baseline for compensation prediction
using only public ATS data.
## 7. Out-of-scope
- Non-NA job markets (the dataset is US/CA only).
- Non-tech sectors (banks, healthcare, retail are largely on Workday
and not yet in the dataset; Phase 4 of the project plan).
- Total compensation (the target is base salary max; equity / bonus
are mentioned only as boolean features).
- Individual-offer negotiation (the model predicts a posting, not an
offer).
## 8. Training data
- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
- Snapshot: 2026-05-08.
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
- Train / test split: deterministic, hash-keyed by `id` and seed=42,
stratified by `(country, source)`. 80/20 split. Test row IDs frozen
at `data/eda/test_split_ids.json`.
## 9. Limitations
- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
yet support. Their salaries are missing from training data.
- **No total-comp signal** — the regressor sees `offers_equity`
(boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
$200k base + $100k equity package from $300k cash.
- **Title-derived seniority/role family is regex-noisy** — ~70% of
rows label as `Other` for role family. Phase 4 will replace these
with DeBERTa classifiers.
- **Description text not yet used** — bge-m3 dense embedding lands in
Phase 5. The current model is purely tabular.
## 10. Reproducibility
```bash
git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train
```
This will rebuild the dataset, re-run the ladder, refit the winning
tier on the same frozen split, and reproduce the metrics above. Random
seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).
## Citation
> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1
|