File size: 7,270 Bytes

dae4f6c

---
license: mit
library_name: scikit-learn
tags:
- regression
- salary-prediction
- north-america
- tabular
metrics:
- mae
- mape
- r2
model-index:
- name: na-tech-jobs-salary-v1
  results:
  - task:
      type: regression
      name: Salary Regression
    dataset:
      name: arjun10g/na-tech-jobs (curated/jobs.parquet)
      type: arjun10g/na-tech-jobs
    metrics:
    - type: mae
      name: MAE (USD/year)
      value: 29091
    - type: mape
      name: MAPE (%)
      value: 14.73
    - type: r2
      name: R² (log scale)
      value: 0.7300
---

# na-tech-jobs salary regressor — v1

Predicts the **maximum disclosed salary** of a North American senior tech
job posting, in USD per year, given tabular features from the
[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
weekly snapshot.

## 1. Headline metrics

| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|---|---|---|---|---|
| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
| MAPE | 14.73% | — | 15.62% | — |
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |

The CLAUDE.md §7 target was **MAE < $25k USD/year**.
Not yet hit (current: $29,091).

The two columns answer different questions: **test-MAE** is generalization to
the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
out-of-fold MAEs on the training set, capturing **split variance**. When the
two agree we have evidence the test draw was representative; if test-MAE
materially differs from CV-MAE we'd suspect either a lucky test draw or
overfitting.

## 2. The ladder

We trained six tiers from a constant baseline up to XGBoost, all on the
**same frozen test set**. The selected model is `tier5_xgboost_optuna`.

| tier                  | mae     | mae_ci          | cv_mae   | cv_mae_ci       |   mape_pct |   r2_log |   n_test |
|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
| tier0_constant        | $60,509 | $57,405-$63,536 | $62,279  | $60,528-$63,759 |      33.65 |  -0.0001 |     1226 |
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623  | $59,939-$63,068 |      32.89 |   0.0447 |     1226 |
| tier2_mincer_ols      | $51,322 | $48,366-$54,199 | $52,041  | $50,416-$53,389 |      27.36 |   0.2827 |     1226 |
| tier3_ridge_full      | $43,199 | $40,907-$45,464 | $42,179  | $40,951-$43,236 |      23.25 |   0.462  |     1226 |
| tier4_random_forest   | $35,935 | $33,906-$38,129 | $37,016  | $35,799-$38,046 |      19.02 |   0.6151 |     1226 |
| tier5_xgboost_optuna  | $29,091 | $27,095-$31,157 | $30,533  | $29,409-$31,537 |      14.73 |   0.73   |     1226 |

The leaderboard answers the question "is the gain from XGBoost worth its
complexity?" — read down the bootstrap CIs to see where they overlap.

## 3. Honest framing — selection bias & MNAR

The model is trained on the **disclosed-salary subset only** (~50% of NA
tech postings). Disclosure is **not random**:

- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
  ON / BC / PEI) have higher disclosure rates by law.
- **Voluntary disclosure** is concentrated at transparency-leaning employers
  (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
  unobserved component depends on the latent salary itself — i.e. the
  process is MNAR.

**Therefore the model predicts "salary as priced by disclosing employers
in our corpus", not ground truth.** A 2-stage Heckman correction is
flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.

## 4. Inputs

Tabular features (no text yet — `description_md` enters via bge-m3
embeddings in Phase 5). See
[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
for full definitions and
[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
for encoding choices.

## 5. Stratified evaluation

Per-stratum MAE on the test set (1,226 rows total):

| stratum       |    n | mae     | mae_ci_low   | mae_ci_high   |   mape_pct |   r2_log |
|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
| US/greenhouse | 1117 | $29,306 | $27,393      | $31,211       |      14.47 |   0.7273 |
| CA/greenhouse |   56 | $28,610 | $20,222      | $39,174       |      21.96 |   0.673  |
| US/lever      |   32 | $10,276 | $7,051       | $13,835       |       7.92 |   0.7615 |
| US/ashby      |   21 | $47,632 | $27,030      | $74,180       |      19.47 |  -0.0792 |

CA strata are small (~few hundred rows); narrow CI claims on CA
performance should be read with that caveat. The choice to **emphasise
uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.

## 6. Intended use

- **Recruiters and candidates**: rough salary anchoring for a given role
  + location + seniority bucket.
- **The builder**: their own NA senior-DS job search.
- **Researchers**: a transparent baseline for compensation prediction
  using only public ATS data.

## 7. Out-of-scope

- Non-NA job markets (the dataset is US/CA only).
- Non-tech sectors (banks, healthcare, retail are largely on Workday
  and not yet in the dataset; Phase 4 of the project plan).
- Total compensation (the target is base salary max; equity / bonus
  are mentioned only as boolean features).
- Individual-offer negotiation (the model predicts a posting, not an
  offer).

## 8. Training data

- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
- Snapshot: 2026-05-08.
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
- Train / test split: deterministic, hash-keyed by `id` and seed=42,
  stratified by `(country, source)`. 80/20 split. Test row IDs frozen
  at `data/eda/test_split_ids.json`.

## 9. Limitations

- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
  Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
  yet support. Their salaries are missing from training data.
- **No total-comp signal** — the regressor sees `offers_equity`
  (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
  $200k base + $100k equity package from $300k cash.
- **Title-derived seniority/role family is regex-noisy** — ~70% of
  rows label as `Other` for role family. Phase 4 will replace these
  with DeBERTa classifiers.
- **Description text not yet used** — bge-m3 dense embedding lands in
  Phase 5. The current model is purely tabular.

## 10. Reproducibility

```bash
git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train
```

This will rebuild the dataset, re-run the ladder, refit the winning
tier on the same frozen split, and reproduce the metrics above. Random
seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).

## Citation

> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1