South America Property — Price Prediction (Regression, Clustering & Classification)

Presentation

Watch the project overview:

Quick Links

  • Analysis Notebook — the .ipynb file in this repository
  • Regression Modelregression_model.pkl (RandomForestRegressor)
  • Classification Modelclassification_model.pkl (RandomForestClassifier)
  • Dataset — Property Listings for 5 South American Countries (Kaggle)

Project Overview

This project predicts property prices in South America from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands. The two winning models are saved as pickle files.

Research Question

How do a property's size, room count, type, and location determine its price (USD) in South America?

The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands.

Project Workflow

Raw Dataset (Argentina listings, 200,000 rows sampled × 25 features)
        ↓
Part 1: Cleaning & EDA
  - Keep only properties for sale (Venta), priced in USD
  - Drop missing prices, remove duplicates, cap price to 10k–1M USD
  - Correlation analysis + 3 research questions
        ↓
Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38)
        ↓
Part 3: Feature Engineering + K-Means cluster feature (k = 4)
        ↓
Part 4: Model Competition → Random Forest wins (R² ≈ 0.585)
        ↓
Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive)
        ↓
Part 6: Classification → Random Forest wins (accuracy ≈ 0.693)
        ↓
Two saved models: regression + classification

Dataset

The data is the Property Listings for 5 South American Countries dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is price in USD, with a mix of numeric and categorical features.

Key Features Used

Feature Description
surface_total Total surface area (m²)
surface_covered Covered surface area (m²)
rooms Number of rooms
bathrooms Number of bathrooms
property_type Type of property (house, apartment, etc.)
l2 Province / region (location)
price Target — sale price in USD

To keep prices comparable I kept only properties for sale priced in USD, dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median.

Exploratory Data Analysis

Price after outlier removal. After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end.

Boxplot of Property Price

Price distribution. Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff.

Distribution of Property Prices

Correlation heatmap. Size is king: surface_covered and surface_total correlate most strongly with price, with rooms and bathrooms close behind.

Correlation Heatmap

Research Questions

Q1 — Does property type affect price? Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature.

Average Price by Property Type

Q2 — Which provinces are most expensive? A handful of provinces sit far above the rest, confirming that location is a major price driver.

Top 10 Most Expensive Provinces

Q3 — Do bigger properties cost more? Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story.

Price vs Covered Surface

Baseline Regression

A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark.

Metric Score
MAE 88,316.89
RMSE 135,240.01
0.381

Takeaway: the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough.

Baseline Feature Coefficients

The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties.

Baseline Predicted vs Actual Price

Feature Engineering & Clustering

I added three features — surface_per_room, extra_surface, and total_facilities (rooms + bathrooms) — to give the models more to work with.

I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to k = 4.

Elbow Method

Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature.

KMeans Clusters via PCA

Model Competition (Regression)

Three models on the engineered dataset, same split:

Model MAE RMSE
Linear Regression (Engineered) 88,273 134,673 0.386
Decision Tree 78,380 131,767 0.412
Random Forest 67,144 110,818 0.584

Model Comparison — R² Score

Winner: Random Forest. The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price.

Regression Feature Importance

import pickle
with open("regression_model.pkl", "rb") as f:
    model = pickle.load(f)

Regression-to-Classification

I reframed price as a 3-class problem using equal terciles learned from the training set only (so nothing leaks from the test set):

  • Cheap — bottom third
  • Medium — middle third
  • Expensive — top third

Thresholds (USD): roughly 100,000 and 207,000. Because the bands are terciles, the classes come out balanced (~33% each).

Class Distribution

Why this split: equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention.

Classification & Final Evaluation

Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw):

Model Accuracy Recall (macro) F1 (macro)
Logistic Regression 0.640 0.64 0.64
KNN 0.691 0.69 0.69
Random Forest (winner) 0.693 0.69 0.69

Winner: Random Forest (accuracy ≈ 0.693). The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold.

Classification Confusion Matrix

import pickle
with open("classification_model.pkl", "rb") as f:
    model = pickle.load(f)

Summary

A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: surface area, facilities, property type, and province.

Limitations

  • Only the Argentina file was used (200,000 sampled rows), so results may not generalize to all five countries.
  • Many size/room values were missing and filled with medians, which flattens some variation.
  • Prices are listing (asking) prices, not final sale prices.
  • These are statistical associations, not proven causes of price.

Notebook & Libraries

The full cleaning, EDA, feature engineering, and modeling code is in the .ipynb file in this repo. Built with numpy, pandas, seaborn, matplotlib, and scikit-learn (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via pickle.


Avihay Amor | 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support