avihayamor's picture
Update README.md
997b6a2 verified
|
raw
history blame contribute delete
11.3 kB
---
license: mit
language:
- en
tags:
- tabular-regression
- tabular-classification
- scikit-learn
- random-forest
- linear-regression
- kmeans-clustering
- feature-engineering
- supervised-learning
- predictive-modeling
- data-science
- real-estate
- property-prices
- south-america
pipeline_tag: tabular-regression
---
# South America Property — Price Prediction (Regression, Clustering & Classification)
<video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/video.mp4" controls="controls" style="max-width: 720px;"></video>
## Presentation
Watch the project overview:
<video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/DS%20-%20assignment%202%20(Avihay%20Amor).mp4" controls="controls" style="max-width: 720px;"></video>
## Quick Links
- **Analysis Notebook** — the `.ipynb` file in this repository
- - **Regression Model**`regression_model.pkl` (RandomForestRegressor)
- - **Classification Model**`classification_model.pkl` (RandomForestClassifier)
- - **Dataset** — Property Listings for 5 South American Countries (Kaggle)
## Project Overview
This project predicts property prices in South America (Argentina) from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands.
- ## Research Question
- **How do a property's size, room count, type, and location determine its price (USD) in South America?**
- The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands.
## Project Workflow
```
Raw Dataset (Argentina listings, 193,173 rows after filtering × 25 features) ↓
Part 1: Cleaning & EDA
- Keep only properties for sale (Venta), priced in USD
- Keep only properties located in Argentina (drop stray non-Argentine rows, e.g. Rio de Janeiro)
- Drop missing prices, remove duplicates, cap price to 10k–1M USD
- Correlation analysis + 3 research questions
Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38)
Part 3: Feature Engineering + K-Means cluster feature (k = 4)
Part 4: Model Competition → Random Forest wins (R² ≈ 0.585)
Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive)
Part 6: Classification → Random Forest wins (accuracy ≈ 0.693)
Two saved models: regression + classification
```
## Dataset
The data is the **Property Listings for 5 South American Countries** dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is `price` in USD, with a mix of numeric and categorical features.
### Key Features Used
| Feature | Description |
|---|---|
| `surface_total` | Total surface area (m²) |
| `surface_covered` | Covered surface area (m²) |
| `rooms` | Number of rooms |
| `bathrooms` | Number of bathrooms |
| `property_type` | Type of property (house, apartment, etc.) |
| `l2` | Province / region (location) |
| `price` | **Target** — sale price in USD |
To keep prices comparable I kept only properties for sale priced in USD, kept only listings located in Argentina (dropping a few thousand stray non-Argentine rows such as Rio de Janeiro), dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median.
## Exploratory Data Analysis
**Price after outlier removal.** After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end.
![Boxplot of Property Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_boxplot.png)
**Price distribution.** Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff.
![Distribution of Property Prices](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_distribution.png)
**Correlation heatmap.** In the raw data, bathrooms and rooms have the strongest link to price (0.64 and 0.47), while the surface columns barely correlate (around 0.05). That low number is a warning sign that the surface columns still hold extreme outliers — so size looks weak here even though it should matter. The tree models later pick surface back up once they handle the non-linear patterns.
![Correlation Heatmap](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/correlation_heatmap.png)
### Research Questions
**Q1 — Does property type affect price?** Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature.
![Average Price by Property Type](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q1_price_by_type.png)
**Q2 — Which provinces are most expensive?** A handful of provinces sit far above the rest, confirming that location is a major price driver.
![Top 10 Most Expensive Provinces](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q2_top10_provinces.png)
**Q3 — Do bigger properties cost more?** Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story.
![Price vs Covered Surface](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q3_price_vs_surface.png)
## Baseline Regression
A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark.
| Metric | Score |
|---|---|
| MAE | 88,316.89 |
| RMSE | 135,240.01 |
| R² | 0.381 |
**Takeaway:** the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough.
![Baseline Feature Coefficients](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_coefficients.png)
The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties.
![Baseline Predicted vs Actual Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_pred_vs_actual.png)
## Feature Engineering & Clustering
I added three features — `surface_per_room`, `extra_surface`, and `total_facilities` (rooms + bathrooms) — to give the models more to work with.
I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to **k = 4**.
![Elbow Method](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_elbow.png)
Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature.
![KMeans Clusters via PCA](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_pca_clusters.png)
## Model Competition (Regression)
Three models on the engineered dataset, same split:
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Linear Regression (Engineered) | 88,273 | 134,673 | 0.386 |
| Decision Tree | 78,380 | 131,767 | 0.412 |
| **Random Forest** | **67,144** | **110,818** | **0.584** |
![Model Comparison — R² Score](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/model_comparison_r2.png)
**Winner: Random Forest.** The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price.
![Regression Feature Importance](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/regression_feature_importance.png)
```python
import pickle
with open("regression_model.pkl", "rb") as f:
model = pickle.load(f)
```
## Regression-to-Classification
I reframed price as a 3-class problem using **equal terciles** learned from the training set only (so nothing leaks from the test set):
- **Cheap** — bottom third
- - **Medium** — middle third
- - **Expensive** — top third
- Thresholds (USD): roughly **100,000** and **207,000**. Because the bands are terciles, the classes come out balanced (~33% each).
- ![Class Distribution](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/class_distribution.png)
- **Why this split:** equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention.
## Classification & Final Evaluation
Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw):
| Model | Accuracy | Recall (macro) | F1 (macro) |
|---|---|---|---|
| Logistic Regression | 0.640 | 0.64 | 0.64 |
| KNN | 0.691 | 0.69 | 0.69 |
| **Random Forest (winner)** | **0.693** | **0.69** | **0.69** |
**Winner: Random Forest (accuracy ≈ 0.693).** The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold.
![Classification Confusion Matrix](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/classification_confusion_matrix%282%29.png)
```python
import pickle
with open("classification_model.pkl", "rb") as f:
model = pickle.load(f)
```
## Summary
A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: **surface area, facilities, property type, and province**.
## Limitations
- Only the Argentina file was used (200,000 sampled rows, 193,173 after filtering to Argentina), so results may not generalize to all five countries.
- - Many size/room values were missing and filled with medians, which flattens some variation.
- - Prices are listing (asking) prices, not final sale prices.
- - These are statistical associations, not proven causes of price.
- ## Notebook & Libraries
- The full cleaning, EDA, feature engineering, and modeling code is in the `.ipynb` file in this repo. Built with `numpy`, `pandas`, `seaborn`, `matplotlib`, and `scikit-learn` (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via `pickle`.
- ---
*Avihay Amor | 2026*