BrejBala commited on
Commit
46e7f0b
·
verified ·
1 Parent(s): 84e5d24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -3
README.md CHANGED
@@ -1,3 +1,196 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ pipeline_tag: tabular-regression
4
+ library_name: autogluon
5
+ tags:
6
+ - autogluon
7
+ - tabular-regression
8
+ - regression
9
+ - automl
10
+ - aws-sagemaker
11
+ - udacity
12
+ - kaggle
13
+ - bike-sharing-demand
14
+ - time-series
15
+ - feature-engineering
16
+ metrics:
17
+ - rmse
18
+ - rmsle
19
+ model-index:
20
+ - name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
21
+ results:
22
+ - task:
23
+ type: tabular-regression
24
+ name: Tabular Regression
25
+ dataset:
26
+ name: Kaggle Bike Sharing Demand (train.csv / test.csv)
27
+ type: csv
28
+ metrics:
29
+ - name: Validation RMSE (best run, internal AutoGluon validation)
30
+ type: rmse
31
+ value: 39.953761
32
+ - name: Kaggle Public Score (RMSLE, best submission)
33
+ type: rmsle
34
+ value: 0.49145
35
+ ---
36
+
37
+ # 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)
38
+
39
+ This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.
40
+
41
+ Repository:
42
+ https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon
43
+
44
+ ## Model Details
45
+
46
+ - Developed by: brej-29
47
+ - Model type: AutoGluon `TabularPredictor` (tabular regression)
48
+ - Target label: `count`
49
+ - Problem type: regression
50
+ - Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
51
+ - Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)
52
+
53
+ ## Intended Use
54
+
55
+ - Educational / portfolio demonstration of:
56
+ - Kaggle-style regression workflow
57
+ - AutoML with AutoGluon
58
+ - Feature engineering from datetime fields
59
+ - Hyperparameter optimization (HPO) experiments
60
+ - Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset
61
+
62
+ Out of scope:
63
+ - Production forecasting without monitoring, retraining strategy, and strong input validation
64
+ - High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis
65
+
66
+ ## Training Data
67
+
68
+ Dataset: Kaggle “Bike Sharing Demand”
69
+
70
+ Typical columns include:
71
+ - Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
72
+ - Leakage columns present in train but not in test: `casual`, `registered`
73
+ - Target: `count`
74
+
75
+ Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.
76
+
77
+ ## Preprocessing and Feature Engineering
78
+
79
+ - `datetime` is parsed as a datetime type.
80
+ - Leakage prevention:
81
+ - The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
82
+ - Feature engineering experiment:
83
+ - Additional time-derived features were created from `datetime`:
84
+ - `year`, `month`, `day`, `hour`
85
+ - These were used in a follow-up training run to measure impact on performance.
86
+ - AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).
87
+
88
+ ## Training Procedure
89
+
90
+ Base configuration used in the notebook:
91
+ - `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
92
+ - Preset: `best_quality`
93
+ - Time limit: 600 seconds (10 minutes)
94
+ - Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)
95
+
96
+ Hyperparameter optimization (HPO) run:
97
+ - Search controlled via `hyperparameter_tune_kwargs`:
98
+ - `num_trials = 20`
99
+ - `searcher = "auto"`
100
+ - `scheduler = "local"`
101
+ - Hyperparameters were provided for:
102
+ - GBM (including extra-trees style trials + a larger preset config)
103
+ - XT (ExtraTrees)
104
+ - XGB (XGBoost)
105
+
106
+ ## Evaluation
107
+
108
+ Important note about AutoGluon leaderboard scores:
109
+ - AutoGluon’s leaderboard displays metrics in “higher is better” format.
110
+ - For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
111
+ - Validation RMSE ≈ absolute value of `score_val`
112
+
113
+ Offline validation (AutoGluon internal validation; best run from the notebook):
114
+ - Best validation `score_val`: -39.953761 (root_mean_squared_error)
115
+ - Interpreted validation RMSE: 39.953761
116
+
117
+ Kaggle public leaderboard (submissions generated from notebook):
118
+ - Initial submission RMSLE: 1.42139
119
+ - With added features submission RMSLE: 1.41560
120
+ - With HPO submission RMSLE: 0.49145
121
+
122
+ ## How to Use
123
+
124
+ Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.
125
+
126
+ Example inference pattern:
127
+
128
+ import pandas as pd
129
+ from huggingface_hub import snapshot_download
130
+ from autogluon.tabular import TabularPredictor
131
+
132
+ repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
133
+
134
+ # Download the whole repo snapshot (works well for AutoGluon folders)
135
+ local_dir = snapshot_download(repo_id=repo_id)
136
+
137
+ # Point this to the directory that contains the AutoGluon predictor artifacts
138
+ predictor = TabularPredictor.load(local_dir)
139
+
140
+ # Example input (use correct values and columns)
141
+ X = pd.DataFrame([{
142
+ "datetime": "2012-12-19 17:00:00",
143
+ "season": 4,
144
+ "holiday": 0,
145
+ "workingday": 1,
146
+ "weather": 1,
147
+ "temp": 10.0,
148
+ "atemp": 12.0,
149
+ "humidity": 60,
150
+ "windspeed": 15.0
151
+ }])
152
+
153
+ preds = predictor.predict(X)
154
+ print(float(preds.iloc[0]))
155
+
156
+ If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.
157
+
158
+ ## Input Requirements
159
+
160
+ - Input must be a tabular dataframe (pandas DataFrame recommended).
161
+ - Required columns should match the Kaggle test schema used for training:
162
+ - `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
163
+ - Do not include the ignored leakage columns at inference:
164
+ - `casual`, `registered`
165
+ - If using engineered datetime columns in your final training run, ensure consistent feature generation:
166
+ - `year`, `month`, `day`, `hour`
167
+ - Datatypes:
168
+ - numeric columns should be valid numeric types (int/float)
169
+ - missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)
170
+
171
+ ## Bias, Risks, and Limitations
172
+
173
+ - This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
174
+ - Kaggle data can contain seasonal/holiday patterns that may not generalize.
175
+ - RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
176
+ - If `datetime` parsing or feature generation differs from training, predictions may be unreliable.
177
+
178
+ ## Environmental Impact
179
+
180
+ AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.
181
+
182
+ ## Technical Specifications
183
+
184
+ - Framework: AutoGluon Tabular (`TabularPredictor`)
185
+ - Task: Tabular regression
186
+ - Eval metric used in training: root mean squared error (RMSE)
187
+ - Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)
188
+
189
+ ## Model Card Authors
190
+
191
+ - BrejBala
192
+
193
+ ## Contact
194
+
195
+ For questions/feedback, please open an issue on the GitHub repository:
196
+ https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon