license: mit
language:
- en
tags:
- tabular-regression
- tabular-classification
- scikit-learn
- random-forest
- linear-regression
- kmeans-clustering
- feature-engineering
- supervised-learning
- predictive-modeling
- data-science
- real-estate
- property-prices
- south-america
pipeline_tag: tabular-regression
South America Property — Price Prediction (Regression, Clustering & Classification)
Presentation
Watch the project overview:
Quick Links
- Analysis Notebook — the
.ipynbfile in this repository - Regression Model —
regression_model.pkl(RandomForestRegressor) - Classification Model —
classification_model.pkl(RandomForestClassifier) - Dataset — Property Listings for 5 South American Countries (Kaggle)
Project Overview
This project predicts property prices in South America from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands. The two winning models are saved as pickle files.
Research Question
How do a property's size, room count, type, and location determine its price (USD) in South America?
The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands.
Project Workflow
Raw Dataset (Argentina listings, 200,000 rows sampled × 25 features)
↓
Part 1: Cleaning & EDA
- Keep only properties for sale (Venta), priced in USD
- Drop missing prices, remove duplicates, cap price to 10k–1M USD
- Correlation analysis + 3 research questions
↓
Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38)
↓
Part 3: Feature Engineering + K-Means cluster feature (k = 4)
↓
Part 4: Model Competition → Random Forest wins (R² ≈ 0.585)
↓
Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive)
↓
Part 6: Classification → Random Forest wins (accuracy ≈ 0.693)
↓
Two saved models: regression + classification
Dataset
The data is the Property Listings for 5 South American Countries dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is price in USD, with a mix of numeric and categorical features.
Key Features Used
| Feature | Description |
|---|---|
surface_total |
Total surface area (m²) |
surface_covered |
Covered surface area (m²) |
rooms |
Number of rooms |
bathrooms |
Number of bathrooms |
property_type |
Type of property (house, apartment, etc.) |
l2 |
Province / region (location) |
price |
Target — sale price in USD |
To keep prices comparable I kept only properties for sale priced in USD, dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median.
Exploratory Data Analysis
Price after outlier removal. After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end.
Price distribution. Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff.
Correlation heatmap. Size is king: surface_covered and surface_total correlate most strongly with price, with rooms and bathrooms close behind.
Research Questions
Q1 — Does property type affect price? Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature.
Q2 — Which provinces are most expensive? A handful of provinces sit far above the rest, confirming that location is a major price driver.
Q3 — Do bigger properties cost more? Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story.
Baseline Regression
A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark.
| Metric | Score |
|---|---|
| MAE | 88,316.89 |
| RMSE | 135,240.01 |
| R² | 0.381 |
Takeaway: the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough.
The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties.
Feature Engineering & Clustering
I added three features — surface_per_room, extra_surface, and total_facilities (rooms + bathrooms) — to give the models more to work with.
I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to k = 4.
Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature.
Model Competition (Regression)
Three models on the engineered dataset, same split:
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Linear Regression (Engineered) | 88,273 | 134,673 | 0.386 |
| Decision Tree | 78,380 | 131,767 | 0.412 |
| Random Forest | 67,144 | 110,818 | 0.584 |
Winner: Random Forest. The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price.
import pickle
with open("regression_model.pkl", "rb") as f:
model = pickle.load(f)
Regression-to-Classification
I reframed price as a 3-class problem using equal terciles learned from the training set only (so nothing leaks from the test set):
- Cheap — bottom third
- Medium — middle third
- Expensive — top third
Thresholds (USD): roughly 100,000 and 207,000. Because the bands are terciles, the classes come out balanced (~33% each).
Why this split: equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention.
Classification & Final Evaluation
Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw):
| Model | Accuracy | Recall (macro) | F1 (macro) |
|---|---|---|---|
| Logistic Regression | 0.640 | 0.64 | 0.64 |
| KNN | 0.691 | 0.69 | 0.69 |
| Random Forest (winner) | 0.693 | 0.69 | 0.69 |
Winner: Random Forest (accuracy ≈ 0.693). The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold.
import pickle
with open("classification_model.pkl", "rb") as f:
model = pickle.load(f)
Summary
A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: surface area, facilities, property type, and province.
Limitations
- Only the Argentina file was used (200,000 sampled rows), so results may not generalize to all five countries.
- Many size/room values were missing and filled with medians, which flattens some variation.
- Prices are listing (asking) prices, not final sale prices.
- These are statistical associations, not proven causes of price.
Notebook & Libraries
The full cleaning, EDA, feature engineering, and modeling code is in the .ipynb file in this repo. Built with numpy, pandas, seaborn, matplotlib, and scikit-learn (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via pickle.
Avihay Amor | 2026













