v1 :: winning tier = tier5_xgboost_optuna, MAE ≈ 29091 USD/yr
Browse files- README.md +176 -0
- artifacts/tier2_mincer_coefficients.csv +8 -0
- artifacts/tier3_ridge_top_coefs.csv +21 -0
- artifacts/tier4_rf_importance.csv +21 -0
- artifacts/tier5_xgb_best_params.json +10 -0
- artifacts/tier5_xgb_importance.csv +21 -0
- ladder_report.json +92 -0
- leaderboard.csv +7 -0
- salary_predictor.joblib +3 -0
README.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: scikit-learn
|
| 4 |
+
tags:
|
| 5 |
+
- regression
|
| 6 |
+
- salary-prediction
|
| 7 |
+
- north-america
|
| 8 |
+
- tabular
|
| 9 |
+
metrics:
|
| 10 |
+
- mae
|
| 11 |
+
- mape
|
| 12 |
+
- r2
|
| 13 |
+
model-index:
|
| 14 |
+
- name: na-tech-jobs-salary-v1
|
| 15 |
+
results:
|
| 16 |
+
- task:
|
| 17 |
+
type: regression
|
| 18 |
+
name: Salary Regression
|
| 19 |
+
dataset:
|
| 20 |
+
name: arjun10g/na-tech-jobs (curated/jobs.parquet)
|
| 21 |
+
type: arjun10g/na-tech-jobs
|
| 22 |
+
metrics:
|
| 23 |
+
- type: mae
|
| 24 |
+
name: MAE (USD/year)
|
| 25 |
+
value: 29091
|
| 26 |
+
- type: mape
|
| 27 |
+
name: MAPE (%)
|
| 28 |
+
value: 14.73
|
| 29 |
+
- type: r2
|
| 30 |
+
name: R² (log scale)
|
| 31 |
+
value: 0.7300
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
# na-tech-jobs salary regressor — v1
|
| 35 |
+
|
| 36 |
+
Predicts the **maximum disclosed salary** of a North American senior tech
|
| 37 |
+
job posting, in USD per year, given tabular features from the
|
| 38 |
+
[`arjun10g/na-tech-jobs`](https://huggingface.co/datasets/arjun10g/na-tech-jobs)
|
| 39 |
+
weekly snapshot.
|
| 40 |
+
|
| 41 |
+
## 1. Headline metrics
|
| 42 |
+
|
| 43 |
+
| Metric | Test-set value | 95% bootstrap CI | 5-fold CV-OOF on train | CV 95% bootstrap CI |
|
| 44 |
+
|---|---|---|---|---|
|
| 45 |
+
| MAE (USD / year) | **$29,091** | $27,095 – $31,157 | **$30,533** | $29,409 – $31,537 |
|
| 46 |
+
| MAPE | 14.73% | — | 15.62% | — |
|
| 47 |
+
| R² (on log10 target) | 0.7300 | — | 0.7051 | — |
|
| 48 |
+
| n | 1,226 | (frozen test, stratified by `country × source`) | 4,917 | (5-fold OOF) |
|
| 49 |
+
|
| 50 |
+
The CLAUDE.md §7 target was **MAE < $25k USD/year**.
|
| 51 |
+
Not yet hit (current: $29,091).
|
| 52 |
+
|
| 53 |
+
The two columns answer different questions: **test-MAE** is generalization to
|
| 54 |
+
the frozen 20% holdout (a single draw); **CV-MAE** is the average of 5
|
| 55 |
+
out-of-fold MAEs on the training set, capturing **split variance**. When the
|
| 56 |
+
two agree we have evidence the test draw was representative; if test-MAE
|
| 57 |
+
materially differs from CV-MAE we'd suspect either a lucky test draw or
|
| 58 |
+
overfitting.
|
| 59 |
+
|
| 60 |
+
## 2. The ladder
|
| 61 |
+
|
| 62 |
+
We trained six tiers from a constant baseline up to XGBoost, all on the
|
| 63 |
+
**same frozen test set**. The selected model is `tier5_xgboost_optuna`.
|
| 64 |
+
|
| 65 |
+
| tier | mae | mae_ci | cv_mae | cv_mae_ci | mape_pct | r2_log | n_test |
|
| 66 |
+
|:----------------------|:--------|:----------------|:---------|:----------------|-----------:|---------:|---------:|
|
| 67 |
+
| tier0_constant | $60,509 | $57,405-$63,536 | $62,279 | $60,528-$63,759 | 33.65 | -0.0001 | 1226 |
|
| 68 |
+
| tier1_stratified_mean | $59,589 | $56,469-$62,505 | $61,623 | $59,939-$63,068 | 32.89 | 0.0447 | 1226 |
|
| 69 |
+
| tier2_mincer_ols | $51,322 | $48,366-$54,199 | $52,041 | $50,416-$53,389 | 27.36 | 0.2827 | 1226 |
|
| 70 |
+
| tier3_ridge_full | $43,199 | $40,907-$45,464 | $42,179 | $40,951-$43,236 | 23.25 | 0.462 | 1226 |
|
| 71 |
+
| tier4_random_forest | $35,935 | $33,906-$38,129 | $37,016 | $35,799-$38,046 | 19.02 | 0.6151 | 1226 |
|
| 72 |
+
| tier5_xgboost_optuna | $29,091 | $27,095-$31,157 | $30,533 | $29,409-$31,537 | 14.73 | 0.73 | 1226 |
|
| 73 |
+
|
| 74 |
+
The leaderboard answers the question "is the gain from XGBoost worth its
|
| 75 |
+
complexity?" — read down the bootstrap CIs to see where they overlap.
|
| 76 |
+
|
| 77 |
+
## 3. Honest framing — selection bias & MNAR
|
| 78 |
+
|
| 79 |
+
The model is trained on the **disclosed-salary subset only** (~50% of NA
|
| 80 |
+
tech postings). Disclosure is **not random**:
|
| 81 |
+
|
| 82 |
+
- **Mandated jurisdictions** (CA / NY / WA / CO / CT / MD / IL / HI;
|
| 83 |
+
ON / BC / PEI) have higher disclosure rates by law.
|
| 84 |
+
- **Voluntary disclosure** is concentrated at transparency-leaning employers
|
| 85 |
+
(Stripe, Anthropic, Anduril, Cohere over-represent the disclosed sample).
|
| 86 |
+
- **Strategic non-disclosure** (Cullen & Pakzad-Hurson 2023) means the
|
| 87 |
+
unobserved component depends on the latent salary itself — i.e. the
|
| 88 |
+
process is MNAR.
|
| 89 |
+
|
| 90 |
+
**Therefore the model predicts "salary as priced by disclosing employers
|
| 91 |
+
in our corpus", not ground truth.** A 2-stage Heckman correction is
|
| 92 |
+
flagged in `LITERATURE_REVIEW.md` §1.2 as a v2 deliverable.
|
| 93 |
+
|
| 94 |
+
## 4. Inputs
|
| 95 |
+
|
| 96 |
+
Tabular features (no text yet — `description_md` enters via bge-m3
|
| 97 |
+
embeddings in Phase 5). See
|
| 98 |
+
[`DATA_DICTIONARY.md`](https://github.com/Arjun10g/na-tech-jobs/blob/main/DATA_DICTIONARY.md)
|
| 99 |
+
for full definitions and
|
| 100 |
+
[`LITERATURE_REVIEW.md` §14](https://github.com/Arjun10g/na-tech-jobs/blob/main/LITERATURE_REVIEW.md)
|
| 101 |
+
for encoding choices.
|
| 102 |
+
|
| 103 |
+
## 5. Stratified evaluation
|
| 104 |
+
|
| 105 |
+
Per-stratum MAE on the test set (1,226 rows total):
|
| 106 |
+
|
| 107 |
+
| stratum | n | mae | mae_ci_low | mae_ci_high | mape_pct | r2_log |
|
| 108 |
+
|:--------------|-----:|:--------|:-------------|:--------------|-----------:|---------:|
|
| 109 |
+
| US/greenhouse | 1117 | $29,306 | $27,393 | $31,211 | 14.47 | 0.7273 |
|
| 110 |
+
| CA/greenhouse | 56 | $28,610 | $20,222 | $39,174 | 21.96 | 0.673 |
|
| 111 |
+
| US/lever | 32 | $10,276 | $7,051 | $13,835 | 7.92 | 0.7615 |
|
| 112 |
+
| US/ashby | 21 | $47,632 | $27,030 | $74,180 | 19.47 | -0.0792 |
|
| 113 |
+
|
| 114 |
+
CA strata are small (~few hundred rows); narrow CI claims on CA
|
| 115 |
+
performance should be read with that caveat. The choice to **emphasise
|
| 116 |
+
uncertainty directly** via bootstrap CIs (rather than a Cohen-style power
|
| 117 |
+
analysis) is documented in `LITERATURE_REVIEW.md` §15.3 #15.
|
| 118 |
+
|
| 119 |
+
## 6. Intended use
|
| 120 |
+
|
| 121 |
+
- **Recruiters and candidates**: rough salary anchoring for a given role
|
| 122 |
+
+ location + seniority bucket.
|
| 123 |
+
- **The builder**: their own NA senior-DS job search.
|
| 124 |
+
- **Researchers**: a transparent baseline for compensation prediction
|
| 125 |
+
using only public ATS data.
|
| 126 |
+
|
| 127 |
+
## 7. Out-of-scope
|
| 128 |
+
|
| 129 |
+
- Non-NA job markets (the dataset is US/CA only).
|
| 130 |
+
- Non-tech sectors (banks, healthcare, retail are largely on Workday
|
| 131 |
+
and not yet in the dataset; Phase 4 of the project plan).
|
| 132 |
+
- Total compensation (the target is base salary max; equity / bonus
|
| 133 |
+
are mentioned only as boolean features).
|
| 134 |
+
- Individual-offer negotiation (the model predicts a posting, not an
|
| 135 |
+
offer).
|
| 136 |
+
|
| 137 |
+
## 8. Training data
|
| 138 |
+
|
| 139 |
+
- Source: weekly ingest from Greenhouse, Lever, Ashby ATS APIs.
|
| 140 |
+
- Snapshot: 2026-05-08.
|
| 141 |
+
- Total active rows: 12,334. Disclosed-salary subset: ~6,143.
|
| 142 |
+
- Train / test split: deterministic, hash-keyed by `id` and seed=42,
|
| 143 |
+
stratified by `(country, source)`. 80/20 split. Test row IDs frozen
|
| 144 |
+
at `data/eda/test_split_ids.json`.
|
| 145 |
+
|
| 146 |
+
## 9. Limitations
|
| 147 |
+
|
| 148 |
+
- **Workday gap** — major employers (Snowflake, Coinbase, Shopify,
|
| 149 |
+
Etsy, Wayfair, DoorDash) use Workday, which our extractor doesn't
|
| 150 |
+
yet support. Their salaries are missing from training data.
|
| 151 |
+
- **No total-comp signal** — the regressor sees `offers_equity`
|
| 152 |
+
(boolean) and `bonus_mentioned` (boolean) but cannot distinguish a
|
| 153 |
+
$200k base + $100k equity package from $300k cash.
|
| 154 |
+
- **Title-derived seniority/role family is regex-noisy** — ~70% of
|
| 155 |
+
rows label as `Other` for role family. Phase 4 will replace these
|
| 156 |
+
with DeBERTa classifiers.
|
| 157 |
+
- **Description text not yet used** — bge-m3 dense embedding lands in
|
| 158 |
+
Phase 5. The current model is purely tabular.
|
| 159 |
+
|
| 160 |
+
## 10. Reproducibility
|
| 161 |
+
|
| 162 |
+
```bash
|
| 163 |
+
git clone https://github.com/Arjun10g/na-tech-jobs
|
| 164 |
+
cd na-tech-jobs
|
| 165 |
+
uv sync --extra ml --extra eda --group dev
|
| 166 |
+
uv run python -m models.salary.train
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
This will rebuild the dataset, re-run the ladder, refit the winning
|
| 170 |
+
tier on the same frozen split, and reproduce the metrics above. Random
|
| 171 |
+
seeds are fixed throughout (`42` for split, Optuna sampler, RF, XGB).
|
| 172 |
+
|
| 173 |
+
## Citation
|
| 174 |
+
|
| 175 |
+
> Ghumman, A. (2026). _na-tech-jobs salary regressor v1._
|
| 176 |
+
> https://huggingface.co/arjun10g/na-tech-jobs-salary-v1
|
artifacts/tier2_mincer_coefficients.csv
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
,coef,ci_low,ci_high,p_value
|
| 2 |
+
intercept,3.31502,3.30485,3.32518,0.0
|
| 3 |
+
yoe,0.06436,0.06093,0.06779,0.0
|
| 4 |
+
yoe_sq,-0.00258,-0.00278,-0.00238,0.0
|
| 5 |
+
yoe_isna,0.0159,0.00399,0.02781,0.00888
|
| 6 |
+
min_education,0.00055,-0.00198,0.00308,0.66951
|
| 7 |
+
country_CA,1.59092,1.57775,1.60408,0.0
|
| 8 |
+
country_US,1.7241,1.71602,1.73218,0.0
|
artifacts/tier3_ridge_top_coefs.csv
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
,abs_coef,coef
|
| 2 |
+
city,0.445022240951709,0.445022240951709
|
| 3 |
+
role_family_extracted_RS,0.13563110815975582,0.13563110815975582
|
| 4 |
+
tech__MongoDB,0.1180341601685041,-0.1180341601685041
|
| 5 |
+
role_family_extracted_Manager,0.10783989492290307,-0.10783989492290307
|
| 6 |
+
region,0.09960651641622026,0.09960651641622026
|
| 7 |
+
tech__Looker,0.06835206094001053,-0.06835206094001053
|
| 8 |
+
tech__Datadog,0.06779723895227639,-0.06779723895227639
|
| 9 |
+
country_CA,0.06702324429803322,-0.06702324429803322
|
| 10 |
+
country_US,0.06702324429803203,0.06702324429803203
|
| 11 |
+
tech__Databricks,0.057431745440142803,-0.057431745440142803
|
| 12 |
+
tech__Spark,0.04974707191249773,-0.04974707191249773
|
| 13 |
+
contract_type_temporary,0.0484047919566009,0.0484047919566009
|
| 14 |
+
role_family_extracted_DA,0.046431879796239514,-0.046431879796239514
|
| 15 |
+
tech__Computer Vision,0.04256973759990993,-0.04256973759990993
|
| 16 |
+
tech__SQL,0.04210123163036985,-0.04210123163036985
|
| 17 |
+
bonus_type_performance,0.042026738140661185,0.042026738140661185
|
| 18 |
+
tech__Python,0.041980070912409614,0.041980070912409614
|
| 19 |
+
tech__TypeScript,0.040630982719931104,0.040630982719931104
|
| 20 |
+
contract_type_full_time,0.03957939970767172,-0.03957939970767172
|
| 21 |
+
source_lever,0.038847896627198036,-0.038847896627198036
|
artifacts/tier4_rf_importance.csv
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
,importance
|
| 2 |
+
city,0.1782588190637426
|
| 3 |
+
min_years_experience,0.12615172134755526
|
| 4 |
+
seniority_extracted,0.10297994983591262
|
| 5 |
+
region,0.046490130613550514
|
| 6 |
+
manager_role,0.03200925212483024
|
| 7 |
+
tech__count,0.030483690466229462
|
| 8 |
+
posted__days_since,0.027520343655934115
|
| 9 |
+
equity_form_other,0.025714269659314552
|
| 10 |
+
tech__has_modern_ml,0.022602676892918033
|
| 11 |
+
citizenship__US,0.02146552217281381
|
| 12 |
+
citizenship__has_any,0.01974817136931594
|
| 13 |
+
tech__LLMs,0.01931475511338892
|
| 14 |
+
min_education,0.017627260773023328
|
| 15 |
+
tech__Python,0.01685591833565055
|
| 16 |
+
country_CA,0.01672705199145887
|
| 17 |
+
role_family_extracted_Other,0.016102858931326843
|
| 18 |
+
country_US,0.015580582798209765
|
| 19 |
+
role_family_extracted_RS,0.014510162866055623
|
| 20 |
+
remote_policy_None,0.012757884769025804
|
| 21 |
+
offers_equity__isna,0.0124716967181598
|
artifacts/tier5_xgb_best_params.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"n_estimators": 900,
|
| 3 |
+
"max_depth": 9,
|
| 4 |
+
"learning_rate": 0.023181594273804906,
|
| 5 |
+
"subsample": 0.8685732269018456,
|
| 6 |
+
"colsample_bytree": 0.63526932252589,
|
| 7 |
+
"min_child_weight": 2,
|
| 8 |
+
"reg_alpha": 0.010143223174105025,
|
| 9 |
+
"reg_lambda": 0.5004444504302404
|
| 10 |
+
}
|
artifacts/tier5_xgb_importance.csv
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
,importance
|
| 2 |
+
role_family_extracted_RS,0.10528762
|
| 3 |
+
city,0.039080948
|
| 4 |
+
tech__has_modern_ml,0.030671984
|
| 5 |
+
country_CA,0.03056622
|
| 6 |
+
country_US,0.02591851
|
| 7 |
+
tech__MongoDB,0.02283917
|
| 8 |
+
tech__LLMs,0.02192291
|
| 9 |
+
min_years_experience,0.021910045
|
| 10 |
+
tech__Tableau,0.02137762
|
| 11 |
+
seniority_extracted,0.018940147
|
| 12 |
+
equity_form_None,0.018449869
|
| 13 |
+
offers_equity__isna,0.016862338
|
| 14 |
+
bonus_type_None,0.016850032
|
| 15 |
+
tech__MLflow,0.016713984
|
| 16 |
+
role_family_extracted_Other,0.01599092
|
| 17 |
+
tech__Looker,0.01598068
|
| 18 |
+
citizenship__has_any,0.015499068
|
| 19 |
+
manager_role,0.0152803175
|
| 20 |
+
bonus_type_annual,0.013840843
|
| 21 |
+
source_greenhouse,0.013541409
|
ladder_report.json
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"parent_run_id": "9a63f1b9110c4eddb356f18172607135",
|
| 3 |
+
"n_train": 4917,
|
| 4 |
+
"n_test": 1226,
|
| 5 |
+
"tiers": [
|
| 6 |
+
{
|
| 7 |
+
"tier": "tier0_constant",
|
| 8 |
+
"mae": 60509.0,
|
| 9 |
+
"mae_ci_low": 57405.0,
|
| 10 |
+
"mae_ci_high": 63536.0,
|
| 11 |
+
"mape_pct": 33.65,
|
| 12 |
+
"r2_log": -0.0001,
|
| 13 |
+
"n_test": 1226,
|
| 14 |
+
"cv_mae": 62279.0,
|
| 15 |
+
"cv_mae_ci_low": 60528.0,
|
| 16 |
+
"cv_mae_ci_high": 63759.0,
|
| 17 |
+
"cv_mape_pct": 34.2065,
|
| 18 |
+
"cv_r2_log": -0.0015
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"tier": "tier1_stratified_mean",
|
| 22 |
+
"mae": 59589.0,
|
| 23 |
+
"mae_ci_low": 56469.0,
|
| 24 |
+
"mae_ci_high": 62505.0,
|
| 25 |
+
"mape_pct": 32.89,
|
| 26 |
+
"r2_log": 0.0447,
|
| 27 |
+
"n_test": 1226,
|
| 28 |
+
"cv_mae": 61623.0,
|
| 29 |
+
"cv_mae_ci_low": 59939.0,
|
| 30 |
+
"cv_mae_ci_high": 63068.0,
|
| 31 |
+
"cv_mape_pct": 33.8121,
|
| 32 |
+
"cv_r2_log": 0.025
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"tier": "tier2_mincer_ols",
|
| 36 |
+
"mae": 51322.0,
|
| 37 |
+
"mae_ci_low": 48366.0,
|
| 38 |
+
"mae_ci_high": 54199.0,
|
| 39 |
+
"mape_pct": 27.36,
|
| 40 |
+
"r2_log": 0.2827,
|
| 41 |
+
"n_test": 1226,
|
| 42 |
+
"cv_mae": 52041.0,
|
| 43 |
+
"cv_mae_ci_low": 50416.0,
|
| 44 |
+
"cv_mae_ci_high": 53389.0,
|
| 45 |
+
"cv_mape_pct": 27.5026,
|
| 46 |
+
"cv_r2_log": 0.2866
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"tier": "tier3_ridge_full",
|
| 50 |
+
"mae": 43199.0,
|
| 51 |
+
"mae_ci_low": 40907.0,
|
| 52 |
+
"mae_ci_high": 45464.0,
|
| 53 |
+
"mape_pct": 23.25,
|
| 54 |
+
"r2_log": 0.462,
|
| 55 |
+
"n_test": 1226,
|
| 56 |
+
"cv_mae": 42179.0,
|
| 57 |
+
"cv_mae_ci_low": 40951.0,
|
| 58 |
+
"cv_mae_ci_high": 43236.0,
|
| 59 |
+
"cv_mape_pct": 22.3104,
|
| 60 |
+
"cv_r2_log": 0.5092
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"tier": "tier4_random_forest",
|
| 64 |
+
"mae": 35935.0,
|
| 65 |
+
"mae_ci_low": 33906.0,
|
| 66 |
+
"mae_ci_high": 38129.0,
|
| 67 |
+
"mape_pct": 19.02,
|
| 68 |
+
"r2_log": 0.6151,
|
| 69 |
+
"n_test": 1226,
|
| 70 |
+
"cv_mae": 37016.0,
|
| 71 |
+
"cv_mae_ci_low": 35799.0,
|
| 72 |
+
"cv_mae_ci_high": 38046.0,
|
| 73 |
+
"cv_mape_pct": 19.3399,
|
| 74 |
+
"cv_r2_log": 0.6099
|
| 75 |
+
},
|
| 76 |
+
{
|
| 77 |
+
"tier": "tier5_xgboost_optuna",
|
| 78 |
+
"mae": 29091.0,
|
| 79 |
+
"mae_ci_low": 27095.0,
|
| 80 |
+
"mae_ci_high": 31157.0,
|
| 81 |
+
"mape_pct": 14.73,
|
| 82 |
+
"r2_log": 0.73,
|
| 83 |
+
"n_test": 1226,
|
| 84 |
+
"cv_mae": 30533.0,
|
| 85 |
+
"cv_mae_ci_low": 29409.0,
|
| 86 |
+
"cv_mae_ci_high": 31537.0,
|
| 87 |
+
"cv_mape_pct": 15.6159,
|
| 88 |
+
"cv_r2_log": 0.7051
|
| 89 |
+
}
|
| 90 |
+
],
|
| 91 |
+
"winning_tier": "tier5_xgboost_optuna"
|
| 92 |
+
}
|
leaderboard.csv
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
tier,mae,mae_ci_low,mae_ci_high,mape_pct,r2_log,n_test,cv_mae,cv_mae_ci_low,cv_mae_ci_high,cv_mape_pct,cv_r2_log
|
| 2 |
+
tier5_xgboost_optuna,29091.0,27095.0,31157.0,14.73,0.73,1226,30533.0,29409.0,31537.0,15.6159,0.7051
|
| 3 |
+
tier4_random_forest,35935.0,33906.0,38129.0,19.02,0.6151,1226,37016.0,35799.0,38046.0,19.3399,0.6099
|
| 4 |
+
tier3_ridge_full,43199.0,40907.0,45464.0,23.25,0.462,1226,42179.0,40951.0,43236.0,22.3104,0.5092
|
| 5 |
+
tier2_mincer_ols,51322.0,48366.0,54199.0,27.36,0.2827,1226,52041.0,50416.0,53389.0,27.5026,0.2866
|
| 6 |
+
tier1_stratified_mean,59589.0,56469.0,62505.0,32.89,0.0447,1226,61623.0,59939.0,63068.0,33.8121,0.025
|
| 7 |
+
tier0_constant,60509.0,57405.0,63536.0,33.65,-0.0001,1226,62279.0,60528.0,63759.0,34.2065,-0.0015
|
salary_predictor.joblib
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d5b75bbf280949625604fd1b8a734f9b0ba104e03772a42e6c9ff5180191382e
|
| 3 |
+
size 11332367
|