File size: 7,270 Bytes
dae4f6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: mit
library_name: scikit-learn
tags:
- regression
- salary-prediction
- north-america
- tabular
metrics:
- mae
- mape
- r2
model-index:
- name: na-tech-jobs-salary-v1
  results:
  - task:
      type: regression
      name: Salary Regression
    dataset:
      name: arjun10g/na-tech-jobs (curated/jobs.parquet)
      type: arjun10g/na-tech-jobs
    metrics:
    - type: mae
      name: MAE (USD/year)
      value: 29091
    - type: mape
      name: MAPE (%)
      value: 14.73
    - type: r2
      name:  (log scale)
      value: 0.7300
---

# na-tech-jobs salary regressor — v1

Predicts the **maximum disclosed salary** of a North American senior tech
job posting, in USD per year, given tabular features from the
[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
weekly snapshot.

## 1. Headline metrics

| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|---|---|---|---|---|
| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
| MAPE | 14.73% | — | 15.62% | — |
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |

The CLAUDE.md §7 target was **MAE < $25k USD/year**.
Not yet hit (current: $29,091).

The two columns answer different questions: **test-MAE** is generalization to
the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
out-of-fold MAEs on the training set, capturing **split variance**. When the
two agree we have evidence the test draw was representative; if test-MAE
materially differs from CV-MAE we'd suspect either a lucky test draw or
overfitting.

## 2. The ladder

We trained six tiers from a constant baseline up to XGBoost, all on the
**same frozen test set**. The selected model is `tier5_xgboost_optuna`.

| tier                  | mae     | mae_ci          | cv_mae   | cv_mae_ci       |   mape_pct |   r2_log |   n_test |
|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
| tier0_constant        | $60,509 | $57,405-$63,536 | $62,279  | $60,528-$63,759 |      33.65 |  -0.0001 |     1226 |
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623  | $59,939-$63,068 |      32.89 |   0.0447 |     1226 |
| tier2_mincer_ols      | $51,322 | $48,366-$54,199 | $52,041  | $50,416-$53,389 |      27.36 |   0.2827 |     1226 |
| tier3_ridge_full      | $43,199 | $40,907-$45,464 | $42,179  | $40,951-$43,236 |      23.25 |   0.462  |     1226 |
| tier4_random_forest   | $35,935 | $33,906-$38,129 | $37,016  | $35,799-$38,046 |      19.02 |   0.6151 |     1226 |
| tier5_xgboost_optuna  | $29,091 | $27,095-$31,157 | $30,533  | $29,409-$31,537 |      14.73 |   0.73   |     1226 |

The leaderboard answers the question "is the gain from XGBoost worth its
complexity?" — read down the bootstrap CIs to see where they overlap.

## 3. Honest framing — selection bias & MNAR

The model is trained on the **disclosed-salary subset only** (~50% of NA
tech postings). Disclosure is **not random**:

- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
  ON / BC / PEI) have higher disclosure rates by law.
- **Voluntary disclosure** is concentrated at transparency-leaning employers
  (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
  unobserved component depends on the latent salary itself — i.e. the
  process is MNAR.

**Therefore the model predicts "salary as priced by disclosing employers
in our corpus", not ground truth.** A 2-stage Heckman correction is
flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.

## 4. Inputs

Tabular features (no text yet — `description_md` enters via bge-m3
embeddings in Phase 5). See
[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
for full definitions and
[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
for encoding choices.

## 5. Stratified evaluation

Per-stratum MAE on the test set (1,226 rows total):

| stratum       |    n | mae     | mae_ci_low   | mae_ci_high   |   mape_pct |   r2_log |
|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
| US/greenhouse | 1117 | $29,306 | $27,393      | $31,211       |      14.47 |   0.7273 |
| CA/greenhouse |   56 | $28,610 | $20,222      | $39,174       |      21.96 |   0.673  |
| US/lever      |   32 | $10,276 | $7,051       | $13,835       |       7.92 |   0.7615 |
| US/ashby      |   21 | $47,632 | $27,030      | $74,180       |      19.47 |  -0.0792 |

CA strata are small (~few hundred rows); narrow CI claims on CA
performance should be read with that caveat. The choice to **emphasise
uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.

## 6. Intended use

- **Recruiters and candidates**: rough salary anchoring for a given role
  + location + seniority bucket.
- **The builder**: their own NA senior-DS job search.
- **Researchers**: a transparent baseline for compensation prediction
  using only public ATS data.

## 7. Out-of-scope

- Non-NA job markets (the dataset is US/CA only).
- Non-tech sectors (banks, healthcare, retail are largely on Workday
  and not yet in the dataset; Phase 4 of the project plan).
- Total compensation (the target is base salary max; equity / bonus
  are mentioned only as boolean features).
- Individual-offer negotiation (the model predicts a posting, not an
  offer).

## 8. Training data

- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
- Snapshot: 2026-05-08.
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
- Train / test split: deterministic, hash-keyed by `id` and seed=42,
  stratified by `(country, source)`. 80/20 split. Test row IDs frozen
  at `data/eda/test_split_ids.json`.

## 9. Limitations

- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
  Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
  yet support. Their salaries are missing from training data.
- **No total-comp signal** — the regressor sees `offers_equity`
  (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
  $200k base + $100k equity package from $300k cash.
- **Title-derived seniority/role family is regex-noisy** — ~70% of
  rows label as `Other` for role family. Phase 4 will replace these
  with DeBERTa classifiers.
- **Description text not yet used** — bge-m3 dense embedding lands in
  Phase 5. The current model is purely tabular.

## 10. Reproducibility

```bash
git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train
```

This will rebuild the dataset, re-run the ladder, refit the winning
tier on the same frozen split, and reproduce the metrics above. Random
seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).

## Citation

> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1