BrejBala
/

Bike_Sharing_Demand

+---
+language: en
+pipeline_tag: tabular-regression
+library_name: autogluon
+tags:
+  - autogluon
+  - tabular-regression
+  - regression
+  - automl
+  - aws-sagemaker
+  - udacity
+  - kaggle
+  - bike-sharing-demand
+  - time-series
+  - feature-engineering
+metrics:
+  - rmse
+  - rmsle
+model-index:
+  - name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
+    results:
+      - task:
+          type: tabular-regression
+          name: Tabular Regression
+        dataset:
+          name: Kaggle Bike Sharing Demand (train.csv / test.csv)
+          type: csv
+        metrics:
+          - name: Validation RMSE (best run, internal AutoGluon validation)
+            type: rmse
+            value: 39.953761
+          - name: Kaggle Public Score (RMSLE, best submission)
+            type: rmsle
+            value: 0.49145
+---
+# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)
+This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.
+Repository:
+https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon
+## Model Details
+- Developed by: brej-29
+- Model type: AutoGluon `TabularPredictor` (tabular regression)
+- Target label: `count`
+- Problem type: regression
+- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
+- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)
+## Intended Use
+- Educational / portfolio demonstration of:
+  - Kaggle-style regression workflow
+  - AutoML with AutoGluon
+  - Feature engineering from datetime fields
+  - Hyperparameter optimization (HPO) experiments
+- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset
+Out of scope:
+- Production forecasting without monitoring, retraining strategy, and strong input validation
+- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis
+## Training Data
+Dataset: Kaggle “Bike Sharing Demand”
+Typical columns include:
+- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
+- Leakage columns present in train but not in test: `casual`, `registered`
+- Target: `count`
+Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.
+## Preprocessing and Feature Engineering
+- `datetime` is parsed as a datetime type.
+- Leakage prevention:
+  - The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
+- Feature engineering experiment:
+  - Additional time-derived features were created from `datetime`:
+    - `year`, `month`, `day`, `hour`
+  - These were used in a follow-up training run to measure impact on performance.
+- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).
+## Training Procedure
+Base configuration used in the notebook:
+- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
+- Preset: `best_quality`
+- Time limit: 600 seconds (10 minutes)
+- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)
+Hyperparameter optimization (HPO) run:
+- Search controlled via `hyperparameter_tune_kwargs`:
+  - `num_trials = 20`
+  - `searcher = "auto"`
+  - `scheduler = "local"`
+- Hyperparameters were provided for:
+  - GBM (including extra-trees style trials + a larger preset config)
+  - XT (ExtraTrees)
+  - XGB (XGBoost)
+## Evaluation
+Important note about AutoGluon leaderboard scores:
+- AutoGluon’s leaderboard displays metrics in “higher is better” format.
+- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
+  - Validation RMSE ≈ absolute value of `score_val`
+Offline validation (AutoGluon internal validation; best run from the notebook):
+- Best validation `score_val`: -39.953761 (root_mean_squared_error)
+- Interpreted validation RMSE: 39.953761
+Kaggle public leaderboard (submissions generated from notebook):
+- Initial submission RMSLE: 1.42139
+- With added features submission RMSLE: 1.41560
+- With HPO submission RMSLE: 0.49145
+## How to Use
+Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.
+Example inference pattern:
+    import pandas as pd
+    from huggingface_hub import snapshot_download
+    from autogluon.tabular import TabularPredictor
+    repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
+    # Download the whole repo snapshot (works well for AutoGluon folders)
+    local_dir = snapshot_download(repo_id=repo_id)
+    # Point this to the directory that contains the AutoGluon predictor artifacts
+    predictor = TabularPredictor.load(local_dir)
+    # Example input (use correct values and columns)
+    X = pd.DataFrame([{
+        "datetime": "2012-12-19 17:00:00",
+        "season": 4,
+        "holiday": 0,
+        "workingday": 1,
+        "weather": 1,
+        "temp": 10.0,
+        "atemp": 12.0,
+        "humidity": 60,
+        "windspeed": 15.0
+    }])
+    preds = predictor.predict(X)
+    print(float(preds.iloc[0]))
+If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.
+## Input Requirements
+- Input must be a tabular dataframe (pandas DataFrame recommended).
+- Required columns should match the Kaggle test schema used for training:
+  - `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
+- Do not include the ignored leakage columns at inference:
+  - `casual`, `registered`
+- If using engineered datetime columns in your final training run, ensure consistent feature generation:
+  - `year`, `month`, `day`, `hour`
+- Datatypes:
+  - numeric columns should be valid numeric types (int/float)
+  - missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)
+## Bias, Risks, and Limitations
+- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
+- Kaggle data can contain seasonal/holiday patterns that may not generalize.
+- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
+- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.
+## Environmental Impact
+AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.
+## Technical Specifications
+- Framework: AutoGluon Tabular (`TabularPredictor`)
+- Task: Tabular regression
+- Eval metric used in training: root mean squared error (RMSE)
+- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)
+## Model Card Authors
+- BrejBala
+## Contact
+For questions/feedback, please open an issue on the GitHub repository:
+https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon