v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr

Browse files

Files changed (9) hide show

README.md +176 -0
artifacts/tier2_mincer_coefficients.csv +8 -0
artifacts/tier3_ridge_top_coefs.csv +21 -0
artifacts/tier4_rf_importance.csv +21 -0
artifacts/tier5_xgb_best_params.json +10 -0
artifacts/tier5_xgb_importance.csv +21 -0
ladder_report.json +92 -0
leaderboard.csv +7 -0
salary_predictor.joblib +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+---
+license: mit
+library_name: scikit-learn
+tags:
+- regression
+- salary-prediction
+- north-america
+- tabular
+metrics:
+- mae
+- mape
+- r2
+model-index:
+- name: na-tech-jobs-salary-v1
+  results:
+  - task:
+      type: regression
+      name: Salary Regression
+    dataset:
+      name: arjun10g/na-tech-jobs (curated/jobs.parquet)
+      type: arjun10g/na-tech-jobs
+    metrics:
+    - type: mae
+      name: MAE (USD/year)
+      value: 29091
+    - type: mape
+      name: MAPE (%)
+      value: 14.73
+    - type: r2
+      name: R² (log scale)
+      value: 0.7300
+---
+# na-tech-jobs salary regressor — v1
+Predicts the **maximum disclosed salary** of a North American senior tech
+job posting, in USD per year, given tabular features from the
+[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
+weekly snapshot.
+## 1. Headline metrics
+| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
+|---|---|---|---|---|
+| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
+| MAPE | 14.73% | — | 15.62% | — |
+| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
+| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |
+The CLAUDE.md §7 target was **MAE < $25k USD/year**.
+Not yet hit (current: $29,091).
+The two columns answer different questions: **test-MAE** is generalization to
+the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
+out-of-fold MAEs on the training set, capturing **split variance**. When the
+two agree we have evidence the test draw was representative; if test-MAE
+materially differs from CV-MAE we'd suspect either a lucky test draw or
+overfitting.
+## 2. The ladder
+We trained six tiers from a constant baseline up to XGBoost, all on the
+**same frozen test set**. The selected model is `tier5_xgboost_optuna`.
+| tier                  | mae     | mae_ci          | cv_mae   | cv_mae_ci       |   mape_pct |   r2_log |   n_test |
+|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
+| tier0_constant        | $60,509 | $57,405-$63,536 | $62,279  | $60,528-$63,759 |      33.65 |  -0.0001 |     1226 |
+| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623  | $59,939-$63,068 |      32.89 |   0.0447 |     1226 |
+| tier2_mincer_ols      | $51,322 | $48,366-$54,199 | $52,041  | $50,416-$53,389 |      27.36 |   0.2827 |     1226 |
+| tier3_ridge_full      | $43,199 | $40,907-$45,464 | $42,179  | $40,951-$43,236 |      23.25 |   0.462  |     1226 |
+| tier4_random_forest   | $35,935 | $33,906-$38,129 | $37,016  | $35,799-$38,046 |      19.02 |   0.6151 |     1226 |
+| tier5_xgboost_optuna  | $29,091 | $27,095-$31,157 | $30,533  | $29,409-$31,537 |      14.73 |   0.73   |     1226 |
+The leaderboard answers the question "is the gain from XGBoost worth its
+complexity?" — read down the bootstrap CIs to see where they overlap.
+## 3. Honest framing — selection bias & MNAR
+The model is trained on the **disclosed-salary subset only** (~50% of NA
+tech postings). Disclosure is **not random**:
+- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
+  ON / BC / PEI) have higher disclosure rates by law.
+- **Voluntary disclosure** is concentrated at transparency-leaning employers
+  (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
+- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
+  unobserved component depends on the latent salary itself — i.e. the
+  process is MNAR.
+**Therefore the model predicts "salary as priced by disclosing employers
+in our corpus", not ground truth.** A 2-stage Heckman correction is
+flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.
+## 4. Inputs
+Tabular features (no text yet — `description_md` enters via bge-m3
+embeddings in Phase 5). See
+[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
+for full definitions and
+[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
+for encoding choices.
+## 5. Stratified evaluation
+Per-stratum MAE on the test set (1,226 rows total):
+| stratum       |    n | mae     | mae_ci_low   | mae_ci_high   |   mape_pct |   r2_log |
+|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
+| US/greenhouse | 1117 | $29,306 | $27,393      | $31,211       |      14.47 |   0.7273 |
+| CA/greenhouse |   56 | $28,610 | $20,222      | $39,174       |      21.96 |   0.673  |
+| US/lever      |   32 | $10,276 | $7,051       | $13,835       |       7.92 |   0.7615 |
+| US/ashby      |   21 | $47,632 | $27,030      | $74,180       |      19.47 |  -0.0792 |
+CA strata are small (~few hundred rows); narrow CI claims on CA
+performance should be read with that caveat. The choice to **emphasise
+uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
+analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.
+## 6. Intended use
+- **Recruiters and candidates**: rough salary anchoring for a given role
+  + location + seniority bucket.
+- **The builder**: their own NA senior-DS job search.
+- **Researchers**: a transparent baseline for compensation prediction
+  using only public ATS data.
+## 7. Out-of-scope
+- Non-NA job markets (the dataset is US/CA only).
+- Non-tech sectors (banks, healthcare, retail are largely on Workday
+  and not yet in the dataset; Phase 4 of the project plan).
+- Total compensation (the target is base salary max; equity / bonus
+  are mentioned only as boolean features).
+- Individual-offer negotiation (the model predicts a posting, not an
+  offer).
+## 8. Training data
+- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
+- Snapshot: 2026-05-08.
+- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
+- Train / test split: deterministic, hash-keyed by `id` and seed=42,
+  stratified by `(country, source)`. 80/20 split. Test row IDs frozen
+  at `data/eda/test_split_ids.json`.
+## 9. Limitations
+- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
+  Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
+  yet support. Their salaries are missing from training data.
+- **No total-comp signal** — the regressor sees `offers_equity`
+  (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
+  $200k base + $100k equity package from $300k cash.
+- **Title-derived seniority/role family is regex-noisy** — ~70% of
+  rows label as `Other` for role family. Phase 4 will replace these
+  with DeBERTa classifiers.
+- **Description text not yet used** — bge-m3 dense embedding lands in
+  Phase 5. The current model is purely tabular.
+## 10. Reproducibility
+```bash
+git clone https://github.com/Arjun10g/na-tech-jobs
+cd na-tech-jobs
+uv sync --extra ml --extra eda --group dev
+uv run python -m models.salary.train
+```
+This will rebuild the dataset, re-run the ladder, refit the winning
+tier on the same frozen split, and reproduce the metrics above. Random
+seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).
+## Citation
+> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
+> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1

artifacts/tier2_mincer_coefficients.csv ADDED Viewed

	@@ -0,0 +1,8 @@

+,coef,ci_low,ci_high,p_value
+intercept,3.31502,3.30485,3.32518,0.0
+yoe,0.06436,0.06093,0.06779,0.0
+yoe_sq,-0.00258,-0.00278,-0.00238,0.0
+yoe_isna,0.0159,0.00399,0.02781,0.00888
+min_education,0.00055,-0.00198,0.00308,0.66951
+country_CA,1.59092,1.57775,1.60408,0.0
+country_US,1.7241,1.71602,1.73218,0.0

artifacts/tier3_ridge_top_coefs.csv ADDED Viewed

	@@ -0,0 +1,21 @@

+,abs_coef,coef
+city,0.445022240951709,0.445022240951709
+role_family_extracted_RS,0.13563110815975582,0.13563110815975582
+tech__MongoDB,0.1180341601685041,-0.1180341601685041
+role_family_extracted_Manager,0.10783989492290307,-0.10783989492290307
+region,0.09960651641622026,0.09960651641622026
+tech__Looker,0.06835206094001053,-0.06835206094001053
+tech__Datadog,0.06779723895227639,-0.06779723895227639
+country_CA,0.06702324429803322,-0.06702324429803322
+country_US,0.06702324429803203,0.06702324429803203
+tech__Databricks,0.057431745440142803,-0.057431745440142803
+tech__Spark,0.04974707191249773,-0.04974707191249773
+contract_type_temporary,0.0484047919566009,0.0484047919566009
+role_family_extracted_DA,0.046431879796239514,-0.046431879796239514
+tech__Computer Vision,0.04256973759990993,-0.04256973759990993
+tech__SQL,0.04210123163036985,-0.04210123163036985
+bonus_type_performance,0.042026738140661185,0.042026738140661185
+tech__Python,0.041980070912409614,0.041980070912409614
+tech__TypeScript,0.040630982719931104,0.040630982719931104
+contract_type_full_time,0.03957939970767172,-0.03957939970767172
+source_lever,0.038847896627198036,-0.038847896627198036

artifacts/tier4_rf_importance.csv ADDED Viewed

	@@ -0,0 +1,21 @@

+,importance
+city,0.1782588190637426
+min_years_experience,0.12615172134755526
+seniority_extracted,0.10297994983591262
+region,0.046490130613550514
+manager_role,0.03200925212483024
+tech__count,0.030483690466229462
+posted__days_since,0.027520343655934115
+equity_form_other,0.025714269659314552
+tech__has_modern_ml,0.022602676892918033
+citizenship__US,0.02146552217281381
+citizenship__has_any,0.01974817136931594
+tech__LLMs,0.01931475511338892
+min_education,0.017627260773023328
+tech__Python,0.01685591833565055
+country_CA,0.01672705199145887
+role_family_extracted_Other,0.016102858931326843
+country_US,0.015580582798209765
+role_family_extracted_RS,0.014510162866055623
+remote_policy_None,0.012757884769025804
+offers_equity__isna,0.0124716967181598

artifacts/tier5_xgb_best_params.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "n_estimators": 900,
+  "max_depth": 9,
+  "learning_rate": 0.023181594273804906,
+  "subsample": 0.8685732269018456,
+  "colsample_bytree": 0.63526932252589,
+  "min_child_weight": 2,
+  "reg_alpha": 0.010143223174105025,
+  "reg_lambda": 0.5004444504302404
+}

artifacts/tier5_xgb_importance.csv ADDED Viewed

	@@ -0,0 +1,21 @@

+,importance
+role_family_extracted_RS,0.10528762
+city,0.039080948
+tech__has_modern_ml,0.030671984
+country_CA,0.03056622
+country_US,0.02591851
+tech__MongoDB,0.02283917
+tech__LLMs,0.02192291
+min_years_experience,0.021910045
+tech__Tableau,0.02137762
+seniority_extracted,0.018940147
+equity_form_None,0.018449869
+offers_equity__isna,0.016862338
+bonus_type_None,0.016850032
+tech__MLflow,0.016713984
+role_family_extracted_Other,0.01599092
+tech__Looker,0.01598068
+citizenship__has_any,0.015499068
+manager_role,0.0152803175
+bonus_type_annual,0.013840843
+source_greenhouse,0.013541409

ladder_report.json ADDED Viewed

	@@ -0,0 +1,92 @@

+{
+  "parent_run_id": "9a63f1b9110c4eddb356f18172607135",
+  "n_train": 4917,
+  "n_test": 1226,
+  "tiers": [
+    {
+      "tier": "tier0_constant",
+      "mae": 60509.0,
+      "mae_ci_low": 57405.0,
+      "mae_ci_high": 63536.0,
+      "mape_pct": 33.65,
+      "r2_log": -0.0001,
+      "n_test": 1226,
+      "cv_mae": 62279.0,
+      "cv_mae_ci_low": 60528.0,
+      "cv_mae_ci_high": 63759.0,
+      "cv_mape_pct": 34.2065,
+      "cv_r2_log": -0.0015
+    },
+    {
+      "tier": "tier1_stratified_mean",
+      "mae": 59589.0,
+      "mae_ci_low": 56469.0,
+      "mae_ci_high": 62505.0,
+      "mape_pct": 32.89,
+      "r2_log": 0.0447,
+      "n_test": 1226,
+      "cv_mae": 61623.0,
+      "cv_mae_ci_low": 59939.0,
+      "cv_mae_ci_high": 63068.0,
+      "cv_mape_pct": 33.8121,
+      "cv_r2_log": 0.025
+    },
+    {
+      "tier": "tier2_mincer_ols",
+      "mae": 51322.0,
+      "mae_ci_low": 48366.0,
+      "mae_ci_high": 54199.0,
+      "mape_pct": 27.36,
+      "r2_log": 0.2827,
+      "n_test": 1226,
+      "cv_mae": 52041.0,
+      "cv_mae_ci_low": 50416.0,
+      "cv_mae_ci_high": 53389.0,
+      "cv_mape_pct": 27.5026,
+      "cv_r2_log": 0.2866
+    },
+    {
+      "tier": "tier3_ridge_full",
+      "mae": 43199.0,
+      "mae_ci_low": 40907.0,
+      "mae_ci_high": 45464.0,
+      "mape_pct": 23.25,
+      "r2_log": 0.462,
+      "n_test": 1226,
+      "cv_mae": 42179.0,
+      "cv_mae_ci_low": 40951.0,
+      "cv_mae_ci_high": 43236.0,
+      "cv_mape_pct": 22.3104,
+      "cv_r2_log": 0.5092
+    },
+    {
+      "tier": "tier4_random_forest",
+      "mae": 35935.0,
+      "mae_ci_low": 33906.0,
+      "mae_ci_high": 38129.0,
+      "mape_pct": 19.02,
+      "r2_log": 0.6151,
+      "n_test": 1226,
+      "cv_mae": 37016.0,
+      "cv_mae_ci_low": 35799.0,
+      "cv_mae_ci_high": 38046.0,
+      "cv_mape_pct": 19.3399,
+      "cv_r2_log": 0.6099
+    },
+    {
+      "tier": "tier5_xgboost_optuna",
+      "mae": 29091.0,
+      "mae_ci_low": 27095.0,
+      "mae_ci_high": 31157.0,
+      "mape_pct": 14.73,
+      "r2_log": 0.73,
+      "n_test": 1226,
+      "cv_mae": 30533.0,
+      "cv_mae_ci_low": 29409.0,
+      "cv_mae_ci_high": 31537.0,
+      "cv_mape_pct": 15.6159,
+      "cv_r2_log": 0.7051
+    }
+  ],
+  "winning_tier": "tier5_xgboost_optuna"
+}

leaderboard.csv ADDED Viewed

	@@ -0,0 +1,7 @@

+tier,mae,mae_ci_low,mae_ci_high,mape_pct,r2_log,n_test,cv_mae,cv_mae_ci_low,cv_mae_ci_high,cv_mape_pct,cv_r2_log
+tier5_xgboost_optuna,29091.0,27095.0,31157.0,14.73,0.73,1226,30533.0,29409.0,31537.0,15.6159,0.7051
+tier4_random_forest,35935.0,33906.0,38129.0,19.02,0.6151,1226,37016.0,35799.0,38046.0,19.3399,0.6099
+tier3_ridge_full,43199.0,40907.0,45464.0,23.25,0.462,1226,42179.0,40951.0,43236.0,22.3104,0.5092
+tier2_mincer_ols,51322.0,48366.0,54199.0,27.36,0.2827,1226,52041.0,50416.0,53389.0,27.5026,0.2866
+tier1_stratified_mean,59589.0,56469.0,62505.0,32.89,0.0447,1226,61623.0,59939.0,63068.0,33.8121,0.025
+tier0_constant,60509.0,57405.0,63536.0,33.65,-0.0001,1226,62279.0,60528.0,63759.0,34.2065,-0.0015

salary_predictor.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d5b75bbf280949625604fd1b8a734f9b0ba104e03772a42e6c9ff5180191382e
+size 11332367