arjun10g commited on
Commit
dae4f6c
·
verified ·
1 Parent(s): 4567fd6

v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr

Browse files
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: scikit-learn
4
+ tags:
5
+ - regression
6
+ - salary-prediction
7
+ - north-america
8
+ - tabular
9
+ metrics:
10
+ - mae
11
+ - mape
12
+ - r2
13
+ model-index:
14
+ - name: na-tech-jobs-salary-v1
15
+ results:
16
+ - task:
17
+ type: regression
18
+ name: Salary Regression
19
+ dataset:
20
+ name: arjun10g/na-tech-jobs (curated/jobs.parquet)
21
+ type: arjun10g/na-tech-jobs
22
+ metrics:
23
+ - type: mae
24
+ name: MAE (USD/year)
25
+ value: 29091
26
+ - type: mape
27
+ name: MAPE (%)
28
+ value: 14.73
29
+ - type: r2
30
+ name: R² (log scale)
31
+ value: 0.7300
32
+ ---
33
+
34
+ # na-tech-jobs salary regressor — v1
35
+
36
+ Predicts the **maximum disclosed salary** of a North American senior tech
37
+ job posting, in USD per year, given tabular features from the
38
+ [`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
39
+ weekly snapshot.
40
+
41
+ ## 1. Headline metrics
42
+
43
+ | Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
44
+ |---|---|---|---|---|
45
+ | MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
46
+ | MAPE | 14.73% | — | 15.62% | — |
47
+ | R² (on log10 target) | 0.7300 | — | 0.7051 | — |
48
+ | n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |
49
+
50
+ The CLAUDE.md §7 target was **MAE < $25k USD/year**.
51
+ Not yet hit (current: $29,091).
52
+
53
+ The two columns answer different questions: **test-MAE** is generalization to
54
+ the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
55
+ out-of-fold MAEs on the training set, capturing **split variance**. When the
56
+ two agree we have evidence the test draw was representative; if test-MAE
57
+ materially differs from CV-MAE we'd suspect either a lucky test draw or
58
+ overfitting.
59
+
60
+ ## 2. The ladder
61
+
62
+ We trained six tiers from a constant baseline up to XGBoost, all on the
63
+ **same frozen test set**. The selected model is `tier5_xgboost_optuna`.
64
+
65
+ | tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test |
66
+ |:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
67
+ | tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 |
68
+ | tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 |
69
+ | tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 |
70
+ | tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 |
71
+ | tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 |
72
+ | tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 |
73
+
74
+ The leaderboard answers the question "is the gain from XGBoost worth its
75
+ complexity?" — read down the bootstrap CIs to see where they overlap.
76
+
77
+ ## 3. Honest framing — selection bias & MNAR
78
+
79
+ The model is trained on the **disclosed-salary subset only** (~50% of NA
80
+ tech postings). Disclosure is **not random**:
81
+
82
+ - **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
83
+ ON / BC / PEI) have higher disclosure rates by law.
84
+ - **Voluntary disclosure** is concentrated at transparency-leaning employers
85
+ (Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
86
+ - **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
87
+ unobserved component depends on the latent salary itself — i.e. the
88
+ process is MNAR.
89
+
90
+ **Therefore the model predicts "salary as priced by disclosing employers
91
+ in our corpus", not ground truth.** A 2-stage Heckman correction is
92
+ flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.
93
+
94
+ ## 4. Inputs
95
+
96
+ Tabular features (no text yet — `description_md` enters via bge-m3
97
+ embeddings in Phase 5). See
98
+ [`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
99
+ for full definitions and
100
+ [`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
101
+ for encoding choices.
102
+
103
+ ## 5. Stratified evaluation
104
+
105
+ Per-stratum MAE on the test set (1,226 rows total):
106
+
107
+ | stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log |
108
+ |:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
109
+ | US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 |
110
+ | CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 |
111
+ | US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 |
112
+ | US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 |
113
+
114
+ CA strata are small (~few hundred rows); narrow CI claims on CA
115
+ performance should be read with that caveat. The choice to **emphasise
116
+ uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
117
+ analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.
118
+
119
+ ## 6. Intended use
120
+
121
+ - **Recruiters and candidates**: rough salary anchoring for a given role
122
+ + location + seniority bucket.
123
+ - **The builder**: their own NA senior-DS job search.
124
+ - **Researchers**: a transparent baseline for compensation prediction
125
+ using only public ATS data.
126
+
127
+ ## 7. Out-of-scope
128
+
129
+ - Non-NA job markets (the dataset is US/CA only).
130
+ - Non-tech sectors (banks, healthcare, retail are largely on Workday
131
+ and not yet in the dataset; Phase 4 of the project plan).
132
+ - Total compensation (the target is base salary max; equity / bonus
133
+ are mentioned only as boolean features).
134
+ - Individual-offer negotiation (the model predicts a posting, not an
135
+ offer).
136
+
137
+ ## 8. Training data
138
+
139
+ - Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
140
+ - Snapshot: 2026-05-08.
141
+ - Total active rows: 12,334. Disclosed-salary subset: ~6,143.
142
+ - Train / test split: deterministic, hash-keyed by `id` and seed=42,
143
+ stratified by `(country, source)`. 80/20 split. Test row IDs frozen
144
+ at `data/eda/test_split_ids.json`.
145
+
146
+ ## 9. Limitations
147
+
148
+ - **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
149
+ Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
150
+ yet support. Their salaries are missing from training data.
151
+ - **No total-comp signal** — the regressor sees `offers_equity`
152
+ (boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
153
+ $200k base + $100k equity package from $300k cash.
154
+ - **Title-derived seniority/role family is regex-noisy** — ~70% of
155
+ rows label as `Other` for role family. Phase 4 will replace these
156
+ with DeBERTa classifiers.
157
+ - **Description text not yet used** — bge-m3 dense embedding lands in
158
+ Phase 5. The current model is purely tabular.
159
+
160
+ ## 10. Reproducibility
161
+
162
+ ```bash
163
+ git clone https://github.com/Arjun10g/na-tech-jobs
164
+ cd na-tech-jobs
165
+ uv sync --extra ml --extra eda --group dev
166
+ uv run python -m models.salary.train
167
+ ```
168
+
169
+ This will rebuild the dataset, re-run the ladder, refit the winning
170
+ tier on the same frozen split, and reproduce the metrics above. Random
171
+ seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).
172
+
173
+ ## Citation
174
+
175
+ > Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
176
+ > https://huggingface.co/arjun10g/na-tech-jobs-salary-v1
artifacts/tier2_mincer_coefficients.csv ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ ,coef,ci_low,ci_high,p_value
2
+ intercept,3.31502,3.30485,3.32518,0.0
3
+ yoe,0.06436,0.06093,0.06779,0.0
4
+ yoe_sq,-0.00258,-0.00278,-0.00238,0.0
5
+ yoe_isna,0.0159,0.00399,0.02781,0.00888
6
+ min_education,0.00055,-0.00198,0.00308,0.66951
7
+ country_CA,1.59092,1.57775,1.60408,0.0
8
+ country_US,1.7241,1.71602,1.73218,0.0
artifacts/tier3_ridge_top_coefs.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ,abs_coef,coef
2
+ city,0.445022240951709,0.445022240951709
3
+ role_family_extracted_RS,0.13563110815975582,0.13563110815975582
4
+ tech__MongoDB,0.1180341601685041,-0.1180341601685041
5
+ role_family_extracted_Manager,0.10783989492290307,-0.10783989492290307
6
+ region,0.09960651641622026,0.09960651641622026
7
+ tech__Looker,0.06835206094001053,-0.06835206094001053
8
+ tech__Datadog,0.06779723895227639,-0.06779723895227639
9
+ country_CA,0.06702324429803322,-0.06702324429803322
10
+ country_US,0.06702324429803203,0.06702324429803203
11
+ tech__Databricks,0.057431745440142803,-0.057431745440142803
12
+ tech__Spark,0.04974707191249773,-0.04974707191249773
13
+ contract_type_temporary,0.0484047919566009,0.0484047919566009
14
+ role_family_extracted_DA,0.046431879796239514,-0.046431879796239514
15
+ tech__Computer Vision,0.04256973759990993,-0.04256973759990993
16
+ tech__SQL,0.04210123163036985,-0.04210123163036985
17
+ bonus_type_performance,0.042026738140661185,0.042026738140661185
18
+ tech__Python,0.041980070912409614,0.041980070912409614
19
+ tech__TypeScript,0.040630982719931104,0.040630982719931104
20
+ contract_type_full_time,0.03957939970767172,-0.03957939970767172
21
+ source_lever,0.038847896627198036,-0.038847896627198036
artifacts/tier4_rf_importance.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ,importance
2
+ city,0.1782588190637426
3
+ min_years_experience,0.12615172134755526
4
+ seniority_extracted,0.10297994983591262
5
+ region,0.046490130613550514
6
+ manager_role,0.03200925212483024
7
+ tech__count,0.030483690466229462
8
+ posted__days_since,0.027520343655934115
9
+ equity_form_other,0.025714269659314552
10
+ tech__has_modern_ml,0.022602676892918033
11
+ citizenship__US,0.02146552217281381
12
+ citizenship__has_any,0.01974817136931594
13
+ tech__LLMs,0.01931475511338892
14
+ min_education,0.017627260773023328
15
+ tech__Python,0.01685591833565055
16
+ country_CA,0.01672705199145887
17
+ role_family_extracted_Other,0.016102858931326843
18
+ country_US,0.015580582798209765
19
+ role_family_extracted_RS,0.014510162866055623
20
+ remote_policy_None,0.012757884769025804
21
+ offers_equity__isna,0.0124716967181598
artifacts/tier5_xgb_best_params.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_estimators": 900,
3
+ "max_depth": 9,
4
+ "learning_rate": 0.023181594273804906,
5
+ "subsample": 0.8685732269018456,
6
+ "colsample_bytree": 0.63526932252589,
7
+ "min_child_weight": 2,
8
+ "reg_alpha": 0.010143223174105025,
9
+ "reg_lambda": 0.5004444504302404
10
+ }
artifacts/tier5_xgb_importance.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ,importance
2
+ role_family_extracted_RS,0.10528762
3
+ city,0.039080948
4
+ tech__has_modern_ml,0.030671984
5
+ country_CA,0.03056622
6
+ country_US,0.02591851
7
+ tech__MongoDB,0.02283917
8
+ tech__LLMs,0.02192291
9
+ min_years_experience,0.021910045
10
+ tech__Tableau,0.02137762
11
+ seniority_extracted,0.018940147
12
+ equity_form_None,0.018449869
13
+ offers_equity__isna,0.016862338
14
+ bonus_type_None,0.016850032
15
+ tech__MLflow,0.016713984
16
+ role_family_extracted_Other,0.01599092
17
+ tech__Looker,0.01598068
18
+ citizenship__has_any,0.015499068
19
+ manager_role,0.0152803175
20
+ bonus_type_annual,0.013840843
21
+ source_greenhouse,0.013541409
ladder_report.json ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "parent_run_id": "9a63f1b9110c4eddb356f18172607135",
3
+ "n_train": 4917,
4
+ "n_test": 1226,
5
+ "tiers": [
6
+ {
7
+ "tier": "tier0_constant",
8
+ "mae": 60509.0,
9
+ "mae_ci_low": 57405.0,
10
+ "mae_ci_high": 63536.0,
11
+ "mape_pct": 33.65,
12
+ "r2_log": -0.0001,
13
+ "n_test": 1226,
14
+ "cv_mae": 62279.0,
15
+ "cv_mae_ci_low": 60528.0,
16
+ "cv_mae_ci_high": 63759.0,
17
+ "cv_mape_pct": 34.2065,
18
+ "cv_r2_log": -0.0015
19
+ },
20
+ {
21
+ "tier": "tier1_stratified_mean",
22
+ "mae": 59589.0,
23
+ "mae_ci_low": 56469.0,
24
+ "mae_ci_high": 62505.0,
25
+ "mape_pct": 32.89,
26
+ "r2_log": 0.0447,
27
+ "n_test": 1226,
28
+ "cv_mae": 61623.0,
29
+ "cv_mae_ci_low": 59939.0,
30
+ "cv_mae_ci_high": 63068.0,
31
+ "cv_mape_pct": 33.8121,
32
+ "cv_r2_log": 0.025
33
+ },
34
+ {
35
+ "tier": "tier2_mincer_ols",
36
+ "mae": 51322.0,
37
+ "mae_ci_low": 48366.0,
38
+ "mae_ci_high": 54199.0,
39
+ "mape_pct": 27.36,
40
+ "r2_log": 0.2827,
41
+ "n_test": 1226,
42
+ "cv_mae": 52041.0,
43
+ "cv_mae_ci_low": 50416.0,
44
+ "cv_mae_ci_high": 53389.0,
45
+ "cv_mape_pct": 27.5026,
46
+ "cv_r2_log": 0.2866
47
+ },
48
+ {
49
+ "tier": "tier3_ridge_full",
50
+ "mae": 43199.0,
51
+ "mae_ci_low": 40907.0,
52
+ "mae_ci_high": 45464.0,
53
+ "mape_pct": 23.25,
54
+ "r2_log": 0.462,
55
+ "n_test": 1226,
56
+ "cv_mae": 42179.0,
57
+ "cv_mae_ci_low": 40951.0,
58
+ "cv_mae_ci_high": 43236.0,
59
+ "cv_mape_pct": 22.3104,
60
+ "cv_r2_log": 0.5092
61
+ },
62
+ {
63
+ "tier": "tier4_random_forest",
64
+ "mae": 35935.0,
65
+ "mae_ci_low": 33906.0,
66
+ "mae_ci_high": 38129.0,
67
+ "mape_pct": 19.02,
68
+ "r2_log": 0.6151,
69
+ "n_test": 1226,
70
+ "cv_mae": 37016.0,
71
+ "cv_mae_ci_low": 35799.0,
72
+ "cv_mae_ci_high": 38046.0,
73
+ "cv_mape_pct": 19.3399,
74
+ "cv_r2_log": 0.6099
75
+ },
76
+ {
77
+ "tier": "tier5_xgboost_optuna",
78
+ "mae": 29091.0,
79
+ "mae_ci_low": 27095.0,
80
+ "mae_ci_high": 31157.0,
81
+ "mape_pct": 14.73,
82
+ "r2_log": 0.73,
83
+ "n_test": 1226,
84
+ "cv_mae": 30533.0,
85
+ "cv_mae_ci_low": 29409.0,
86
+ "cv_mae_ci_high": 31537.0,
87
+ "cv_mape_pct": 15.6159,
88
+ "cv_r2_log": 0.7051
89
+ }
90
+ ],
91
+ "winning_tier": "tier5_xgboost_optuna"
92
+ }
leaderboard.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ tier,mae,mae_ci_low,mae_ci_high,mape_pct,r2_log,n_test,cv_mae,cv_mae_ci_low,cv_mae_ci_high,cv_mape_pct,cv_r2_log
2
+ tier5_xgboost_optuna,29091.0,27095.0,31157.0,14.73,0.73,1226,30533.0,29409.0,31537.0,15.6159,0.7051
3
+ tier4_random_forest,35935.0,33906.0,38129.0,19.02,0.6151,1226,37016.0,35799.0,38046.0,19.3399,0.6099
4
+ tier3_ridge_full,43199.0,40907.0,45464.0,23.25,0.462,1226,42179.0,40951.0,43236.0,22.3104,0.5092
5
+ tier2_mincer_ols,51322.0,48366.0,54199.0,27.36,0.2827,1226,52041.0,50416.0,53389.0,27.5026,0.2866
6
+ tier1_stratified_mean,59589.0,56469.0,62505.0,32.89,0.0447,1226,61623.0,59939.0,63068.0,33.8121,0.025
7
+ tier0_constant,60509.0,57405.0,63536.0,33.65,-0.0001,1226,62279.0,60528.0,63759.0,34.2065,-0.0015
salary_predictor.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5b75bbf280949625604fd1b8a734f9b0ba104e03772a42e6c9ff5180191382e
3
+ size 11332367