---
license: mit
language:
- en
tags:
- tabular-regression
- tabular-classification
- scikit-learn
- random-forest
- linear-regression
- kmeans-clustering
- feature-engineering
- supervised-learning
- predictive-modeling
- data-science
- real-estate
- property-prices
- south-america
pipeline_tag: tabular-regression
---

# South America Property — Price Prediction (Regression, Clustering & Classification)

<video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/video.mp4" controls="controls" style="max-width: 720px;"></video>

## Presentation

Watch the project overview:

<!--
*************************************************************
***  PASTE YOUR VIDEO HERE                                ***
***  Once you have a YouTube/Loom/Vimeo link OR upload a   ***
***  .mp4 to this repo, replace this whole comment block   ***
***  with ONE of the following:                            ***
***                                                        ***
***  (A) For an .mp4 uploaded to this repo:                ***
***  <video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/YOUR_VIDEO.mp4" controls="controls" style="max-width: 720px;"></video>
***                                                        ***
***  (B) For a YouTube video (swap VIDEO_ID):              ***
***  [![Watch the video](https://img.youtube.com/vi/VIDEO_ID/0.jpg)](https://www.youtube.com/watch?v=VIDEO_ID)
*************************************************************
-->

## Quick Links

- **Analysis Notebook** — the `.ipynb` file in this repository
- **Regression Model** — `regression_model.pkl` (RandomForestRegressor)
- **Classification Model** — `classification_model.pkl` (RandomForestClassifier)
- **Dataset** — Property Listings for 5 South American Countries (Kaggle)

## Project Overview

This project predicts property prices in South America from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands. The two winning models are saved as pickle files.

## Research Question

**How do a property's size, room count, type, and location determine its price (USD) in South America?**

The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands.

## Project Workflow

```
Raw Dataset (Argentina listings, 200,000 rows sampled × 25 features)
        ↓
Part 1: Cleaning & EDA
  - Keep only properties for sale (Venta), priced in USD
  - Drop missing prices, remove duplicates, cap price to 10k–1M USD
  - Correlation analysis + 3 research questions
        ↓
Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38)
        ↓
Part 3: Feature Engineering + K-Means cluster feature (k = 4)
        ↓
Part 4: Model Competition → Random Forest wins (R² ≈ 0.585)
        ↓
Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive)
        ↓
Part 6: Classification → Random Forest wins (accuracy ≈ 0.693)
        ↓
Two saved models: regression + classification
```

## Dataset

The data is the **Property Listings for 5 South American Countries** dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is `price` in USD, with a mix of numeric and categorical features.

### Key Features Used

| Feature | Description |
|---|---|
| `surface_total` | Total surface area (m²) |
| `surface_covered` | Covered surface area (m²) |
| `rooms` | Number of rooms |
| `bathrooms` | Number of bathrooms |
| `property_type` | Type of property (house, apartment, etc.) |
| `l2` | Province / region (location) |
| `price` | **Target** — sale price in USD |

To keep prices comparable I kept only properties for sale priced in USD, dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median.

## Exploratory Data Analysis

**Price after outlier removal.** After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end.

![Boxplot of Property Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_boxplot.png)

**Price distribution.** Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff.

![Distribution of Property Prices](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_distribution.png)

**Correlation heatmap.** Size is king: `surface_covered` and `surface_total` correlate most strongly with price, with rooms and bathrooms close behind.

![Correlation Heatmap](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/correlation_heatmap.png)

### Research Questions

**Q1 — Does property type affect price?** Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature.

![Average Price by Property Type](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q1_price_by_type.png)

**Q2 — Which provinces are most expensive?** A handful of provinces sit far above the rest, confirming that location is a major price driver.

![Top 10 Most Expensive Provinces](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q2_top10_provinces.png)

**Q3 — Do bigger properties cost more?** Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story.

![Price vs Covered Surface](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q3_price_vs_surface.png)

## Baseline Regression

A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark.

| Metric | Score |
|---|---|
| MAE | 88,316.89 |
| RMSE | 135,240.01 |
| R² | 0.381 |

**Takeaway:** the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough.

![Baseline Feature Coefficients](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_coefficients.png)

The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties.

![Baseline Predicted vs Actual Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_pred_vs_actual.png)

## Feature Engineering & Clustering

I added three features — `surface_per_room`, `extra_surface`, and `total_facilities` (rooms + bathrooms) — to give the models more to work with.

I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to **k = 4**.

![Elbow Method](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_elbow.png)

Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature.

![KMeans Clusters via PCA](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_pca_clusters.png)

## Model Competition (Regression)

Three models on the engineered dataset, same split:

| Model | MAE | RMSE | R² |
|---|---|---|---|
| Linear Regression (Engineered) | 88,273 | 134,673 | 0.386 |
| Decision Tree | 78,380 | 131,767 | 0.412 |
| **Random Forest** | **67,144** | **110,818** | **0.584** |

![Model Comparison — R² Score](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/model_comparison_r2.png)

**Winner: Random Forest.** The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price.

![Regression Feature Importance](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/regression_feature_importance.png)

```python
import pickle
with open("regression_model.pkl", "rb") as f:
    model = pickle.load(f)
```

## Regression-to-Classification

I reframed price as a 3-class problem using **equal terciles** learned from the training set only (so nothing leaks from the test set):

- **Cheap** — bottom third
- **Medium** — middle third
- **Expensive** — top third

Thresholds (USD): roughly **100,000** and **207,000**. Because the bands are terciles, the classes come out balanced (~33% each).

![Class Distribution](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/class_distribution.png)

**Why this split:** equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention.

## Classification & Final Evaluation

Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw):

| Model | Accuracy | Recall (macro) | F1 (macro) |
|---|---|---|---|
| Logistic Regression | 0.640 | 0.64 | 0.64 |
| KNN | 0.691 | 0.69 | 0.69 |
| **Random Forest (winner)** | **0.693** | **0.69** | **0.69** |

**Winner: Random Forest (accuracy ≈ 0.693).** The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold.

![Classification Confusion Matrix](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/classification_confusion_matrix.png)

```python
import pickle
with open("classification_model.pkl", "rb") as f:
    model = pickle.load(f)
```

## Summary

A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: **surface area, facilities, property type, and province**.

## Limitations

- Only the Argentina file was used (200,000 sampled rows), so results may not generalize to all five countries.
- Many size/room values were missing and filled with medians, which flattens some variation.
- Prices are listing (asking) prices, not final sale prices.
- These are statistical associations, not proven causes of price.

## Notebook & Libraries

The full cleaning, EDA, feature engineering, and modeling code is in the `.ipynb` file in this repo. Built with `numpy`, `pandas`, `seaborn`, `matplotlib`, and `scikit-learn` (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via `pickle`.

---

*Avihay Amor | 2026*