| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - tabular-regression |
| - tabular-classification |
| - scikit-learn |
| - random-forest |
| - linear-regression |
| - kmeans-clustering |
| - feature-engineering |
| - supervised-learning |
| - predictive-modeling |
| - data-science |
| - real-estate |
| - property-prices |
| - south-america |
| pipeline_tag: tabular-regression |
| --- |
| |
| # South America Property — Price Prediction (Regression, Clustering & Classification) |
|
|
| <video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/video.mp4" controls="controls" style="max-width: 720px;"></video> |
|
|
| ## Presentation |
|
|
| Watch the project overview: |
|
|
| <video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/DS%20-%20assignment%202%20(Avihay%20Amor).mp4" controls="controls" style="max-width: 720px;"></video> |
|
|
| ## Quick Links |
|
|
| - **Analysis Notebook** — the `.ipynb` file in this repository |
| - - **Regression Model** — `regression_model.pkl` (RandomForestRegressor) |
| - - **Classification Model** — `classification_model.pkl` (RandomForestClassifier) |
| - - **Dataset** — Property Listings for 5 South American Countries (Kaggle) |
| |
| ## Project Overview |
| |
| This project predicts property prices in South America (Argentina) from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands. |
| |
| - ## Research Question |
| - **How do a property's size, room count, type, and location determine its price (USD) in South America?** |
| - The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands. |
| ## Project Workflow |
| |
| ``` |
| Raw Dataset (Argentina listings, 193,173 rows after filtering × 25 features) ↓ |
| Part 1: Cleaning & EDA |
| - Keep only properties for sale (Venta), priced in USD |
| - Keep only properties located in Argentina (drop stray non-Argentine rows, e.g. Rio de Janeiro) |
| - Drop missing prices, remove duplicates, cap price to 10k–1M USD |
| - Correlation analysis + 3 research questions |
| ↓ |
| Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38) |
| ↓ |
| Part 3: Feature Engineering + K-Means cluster feature (k = 4) |
| ↓ |
| Part 4: Model Competition → Random Forest wins (R² ≈ 0.585) |
| ↓ |
| Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive) |
| ↓ |
| Part 6: Classification → Random Forest wins (accuracy ≈ 0.693) |
| ↓ |
| Two saved models: regression + classification |
| ``` |
|
|
| ## Dataset |
|
|
| The data is the **Property Listings for 5 South American Countries** dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is `price` in USD, with a mix of numeric and categorical features. |
|
|
| ### Key Features Used |
|
|
| | Feature | Description | |
| |---|---| |
| | `surface_total` | Total surface area (m²) | |
| | `surface_covered` | Covered surface area (m²) | |
| | `rooms` | Number of rooms | |
| | `bathrooms` | Number of bathrooms | |
| | `property_type` | Type of property (house, apartment, etc.) | |
| | `l2` | Province / region (location) | |
| | `price` | **Target** — sale price in USD | |
|
|
| To keep prices comparable I kept only properties for sale priced in USD, kept only listings located in Argentina (dropping a few thousand stray non-Argentine rows such as Rio de Janeiro), dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median. |
|
|
| ## Exploratory Data Analysis |
|
|
| **Price after outlier removal.** After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end. |
|
|
|  |
|
|
| **Price distribution.** Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff. |
|
|
|  |
|
|
| **Correlation heatmap.** In the raw data, bathrooms and rooms have the strongest link to price (0.64 and 0.47), while the surface columns barely correlate (around 0.05). That low number is a warning sign that the surface columns still hold extreme outliers — so size looks weak here even though it should matter. The tree models later pick surface back up once they handle the non-linear patterns. |
|
|
|  |
|
|
| ### Research Questions |
|
|
| **Q1 — Does property type affect price?** Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature. |
|
|
|  |
|
|
| **Q2 — Which provinces are most expensive?** A handful of provinces sit far above the rest, confirming that location is a major price driver. |
|
|
|  |
|
|
| **Q3 — Do bigger properties cost more?** Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story. |
|
|
|  |
|
|
| ## Baseline Regression |
|
|
| A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark. |
|
|
| | Metric | Score | |
| |---|---| |
| | MAE | 88,316.89 | |
| | RMSE | 135,240.01 | |
| | R² | 0.381 | |
|
|
| **Takeaway:** the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough. |
|
|
|  |
|
|
| The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties. |
|
|
|  |
|
|
| ## Feature Engineering & Clustering |
|
|
| I added three features — `surface_per_room`, `extra_surface`, and `total_facilities` (rooms + bathrooms) — to give the models more to work with. |
|
|
| I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to **k = 4**. |
|
|
|  |
|
|
| Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature. |
|
|
|  |
|
|
| ## Model Competition (Regression) |
|
|
| Three models on the engineered dataset, same split: |
|
|
| | Model | MAE | RMSE | R² | |
| |---|---|---|---| |
| | Linear Regression (Engineered) | 88,273 | 134,673 | 0.386 | |
| | Decision Tree | 78,380 | 131,767 | 0.412 | |
| | **Random Forest** | **67,144** | **110,818** | **0.584** | |
|
|
|  |
|
|
| **Winner: Random Forest.** The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price. |
|
|
|  |
|
|
| ```python |
| import pickle |
| with open("regression_model.pkl", "rb") as f: |
| model = pickle.load(f) |
| ``` |
|
|
| ## Regression-to-Classification |
|
|
| I reframed price as a 3-class problem using **equal terciles** learned from the training set only (so nothing leaks from the test set): |
|
|
| - **Cheap** — bottom third |
| - - **Medium** — middle third |
| - - **Expensive** — top third |
| |
| - Thresholds (USD): roughly **100,000** and **207,000**. Because the bands are terciles, the classes come out balanced (~33% each). |
| |
| -  |
| - **Why this split:** equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention. |
| ## Classification & Final Evaluation |
|
|
| Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw): |
|
|
| | Model | Accuracy | Recall (macro) | F1 (macro) | |
| |---|---|---|---| |
| | Logistic Regression | 0.640 | 0.64 | 0.64 | |
| | KNN | 0.691 | 0.69 | 0.69 | |
| | **Random Forest (winner)** | **0.693** | **0.69** | **0.69** | |
|
|
| **Winner: Random Forest (accuracy ≈ 0.693).** The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold. |
|
|
|  |
|
|
| ```python |
| import pickle |
| with open("classification_model.pkl", "rb") as f: |
| model = pickle.load(f) |
| ``` |
|
|
| ## Summary |
|
|
| A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: **surface area, facilities, property type, and province**. |
|
|
| ## Limitations |
|
|
| - Only the Argentina file was used (200,000 sampled rows, 193,173 after filtering to Argentina), so results may not generalize to all five countries. |
| - - Many size/room values were missing and filled with medians, which flattens some variation. |
| - - Prices are listing (asking) prices, not final sale prices. |
| - - These are statistical associations, not proven causes of price. |
| |
| - ## Notebook & Libraries |
| |
| - The full cleaning, EDA, feature engineering, and modeling code is in the `.ipynb` file in this repo. Built with `numpy`, `pandas`, `seaborn`, `matplotlib`, and `scikit-learn` (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via `pickle`. |
| - --- |
| |
| *Avihay Amor | 2026* |