arjun10g's picture
v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr
dae4f6c verified
|
raw
history blame
7.27 kB
metadata
license: mit
library_name: scikit-learn
tags:
  - regression
  - salary-prediction
  - north-america
  - tabular
metrics:
  - mae
  - mape
  - r2
model-index:
  - name: na-tech-jobs-salary-v1
    results:
      - task:
          type: regression
          name: Salary Regression
        dataset:
          name: arjun10g/na-tech-jobs (curated/jobs.parquet)
          type: arjun10g/na-tech-jobs
        metrics:
          - type: mae
            name: MAE (USD/year)
            value: 29091
          - type: mape
            name: MAPE (%)
            value: 14.73
          - type: r2
            name:  (log scale)
            value: 0.73

na-tech-jobs salary regressor — v1

Predicts the maximum disclosed salary of a North American senior tech job posting, in USD per year, given tabular features from the arjun10g/na-tech-jobs weekly snapshot.

1. Headline metrics

Metric Test-set value 95% bootstrap CI 5-fold CV-OOF on train CV 95% bootstrap CI
MAE (USD / year) $29,091 $27,095 – $31,157 $30,533 $29,409 – $31,537
MAPE 14.73% 15.62%
R² (on log10 target) 0.7300 0.7051
n 1,226 (frozen test, stratified by country × source) 4,917 (5-fold OOF)

The CLAUDE.md §7 target was MAE < $25k USD/year. Not yet hit (current: $29,091).

The two columns answer different questions: test-MAE is generalization to the frozen 20% holdout (a single draw); CV-MAE is the average of 5 out-of-fold MAEs on the training set, capturing split variance. When the two agree we have evidence the test draw was representative; if test-MAE materially differs from CV-MAE we'd suspect either a lucky test draw or overfitting.

2. The ladder

We trained six tiers from a constant baseline up to XGBoost, all on the same frozen test set. The selected model is tier5_xgboost_optuna.

tier mae mae_ci cv_mae cv_mae_ci mape_pct r2_log n_test
tier0_constant $60,509 $57,405-$63,536 $62,279 $60,528-$63,759 33.65 -0.0001 1226
tier1_stratified_mean $59,589 $56,469-$62,505 $61,623 $59,939-$63,068 32.89 0.0447 1226
tier2_mincer_ols $51,322 $48,366-$54,199 $52,041 $50,416-$53,389 27.36 0.2827 1226
tier3_ridge_full $43,199 $40,907-$45,464 $42,179 $40,951-$43,236 23.25 0.462 1226
tier4_random_forest $35,935 $33,906-$38,129 $37,016 $35,799-$38,046 19.02 0.6151 1226
tier5_xgboost_optuna $29,091 $27,095-$31,157 $30,533 $29,409-$31,537 14.73 0.73 1226

The leaderboard answers the question "is the gain from XGBoost worth its complexity?" — read down the bootstrap CIs to see where they overlap.

3. Honest framing — selection bias & MNAR

The model is trained on the disclosed-salary subset only (~50% of NA tech postings). Disclosure is not random:

  • Mandated jurisdictions (CA / NY / WA / CO / CT / MD / IL / HI; ON / BC / PEI) have higher disclosure rates by law.
  • Voluntary disclosure is concentrated at transparency-leaning employers (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
  • Strategic non-disclosure (Cullen & Pakzad-Hurson 2023) means the unobserved component depends on the latent salary itself — i.e. the process is MNAR.

Therefore the model predicts "salary as priced by disclosing employers in our corpus", not ground truth. A 2-stage Heckman correction is flagged in LITERATURE_REVIEW.md §1.2 as a v2 deliverable.

4. Inputs

Tabular features (no text yet — description_md enters via bge-m3 embeddings in Phase 5). See DATA_DICTIONARY.md for full definitions and LITERATURE_REVIEW.md §14 for encoding choices.

5. Stratified evaluation

Per-stratum MAE on the test set (1,226 rows total):

stratum n mae mae_ci_low mae_ci_high mape_pct r2_log
US/greenhouse 1117 $29,306 $27,393 $31,211 14.47 0.7273
CA/greenhouse 56 $28,610 $20,222 $39,174 21.96 0.673
US/lever 32 $10,276 $7,051 $13,835 7.92 0.7615
US/ashby 21 $47,632 $27,030 $74,180 19.47 -0.0792

CA strata are small (~few hundred rows); narrow CI claims on CA performance should be read with that caveat. The choice to emphasise uncertainty directly via bootstrap CIs (rather than a Cohen-style power analysis) is documented in LITERATURE_REVIEW.md §15.3 #15.

6. Intended use

  • Recruiters and candidates: rough salary anchoring for a given role
    • location + seniority bucket.
  • The builder: their own NA senior-DS job search.
  • Researchers: a transparent baseline for compensation prediction using only public ATS data.

7. Out-of-scope

  • Non-NA job markets (the dataset is US/CA only).
  • Non-tech sectors (banks, healthcare, retail are largely on Workday and not yet in the dataset; Phase 4 of the project plan).
  • Total compensation (the target is base salary max; equity / bonus are mentioned only as boolean features).
  • Individual-offer negotiation (the model predicts a posting, not an offer).

8. Training data

  • Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
  • Snapshot: 2026-05-08.
  • Total active rows: 12,334. Disclosed-salary subset: ~6,143.
  • Train / test split: deterministic, hash-keyed by id and seed=42, stratified by (country, source). 80/20 split. Test row IDs frozen at data/eda/test_split_ids.json.

9. Limitations

  • Workday gap — major employers (Snowflake, Coinbase, Shopify, Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't yet support. Their salaries are missing from training data.
  • No total-comp signal — the regressor sees offers_equity (boolean) and bonus_mentioned (boolean) but cannot distinguish a $200k base + $100k equity package from $300k cash.
  • Title-derived seniority/role family is regex-noisy — ~70% of rows label as Other for role family. Phase 4 will replace these with DeBERTa classifiers.
  • Description text not yet used — bge-m3 dense embedding lands in Phase 5. The current model is purely tabular.

10. Reproducibility

git clone https://github.com/Arjun10g/na-tech-jobs
cd na-tech-jobs
uv sync --extra ml --extra eda --group dev
uv run python -m models.salary.train

This will rebuild the dataset, re-run the ladder, refit the winning tier on the same frozen split, and reproduce the metrics above. Random seeds are fixed throughout (42 for split, Optuna sampler, RF, XGB).

Citation

Ghumman, A. (2026). na-tech-jobs salary regressor v1. https://huggingface.co/arjun10g/na-tech-jobs-salary-v1