README.md · avihayamor/South_America_property_price

South_America_property_price_prediction / README.md

avihayamor

Update README.md

997b6a2 verified 5 days ago

preview code

raw

history blame contribute delete

11.3 kB

	---
	license: mit
	language:
	- en
	tags:
	- tabular-regression
	- tabular-classification
	- scikit-learn
	- random-forest
	- linear-regression
	- kmeans-clustering
	- feature-engineering
	- supervised-learning
	- predictive-modeling
	- data-science
	- real-estate
	- property-prices
	- south-america
	pipeline_tag: tabular-regression
	---

	# South America Property — Price Prediction (Regression, Clustering & Classification)

	<video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/video.mp4" controls="controls" style="max-width: 720px;"></video>

	## Presentation

	Watch the project overview:

	<video src="https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/DS%20-%20assignment%202%20(Avihay%20Amor).mp4" controls="controls" style="max-width: 720px;"></video>

	## Quick Links

	- Analysis Notebook — the `.ipynb` file in this repository
	- - Regression Model — `regression_model.pkl` (RandomForestRegressor)
	- - Classification Model — `classification_model.pkl` (RandomForestClassifier)
	- - Dataset — Property Listings for 5 South American Countries (Kaggle)

	## Project Overview

	This project predicts property prices in South America (Argentina) from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands.

	- ## Research Question
	- How do a property's size, room count, type, and location determine its price (USD) in South America?
	- The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands.
	## Project Workflow

	```
	Raw Dataset (Argentina listings, 193,173 rows after filtering × 25 features) ↓
	Part 1: Cleaning & EDA
	- Keep only properties for sale (Venta), priced in USD
	- Keep only properties located in Argentina (drop stray non-Argentine rows, e.g. Rio de Janeiro)
	- Drop missing prices, remove duplicates, cap price to 10k–1M USD
	- Correlation analysis + 3 research questions
	↓
	Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38)
	↓
	Part 3: Feature Engineering + K-Means cluster feature (k = 4)
	↓
	Part 4: Model Competition → Random Forest wins (R² ≈ 0.585)
	↓
	Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive)
	↓
	Part 6: Classification → Random Forest wins (accuracy ≈ 0.693)
	↓
	Two saved models: regression + classification
	```

	## Dataset

	The data is the Property Listings for 5 South American Countries dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is `price` in USD, with a mix of numeric and categorical features.

	### Key Features Used

	\| Feature \| Description \|
	\|---\|---\|
	\| `surface_total` \| Total surface area (m²) \|
	\| `surface_covered` \| Covered surface area (m²) \|
	\| `rooms` \| Number of rooms \|
	\| `bathrooms` \| Number of bathrooms \|
	\| `property_type` \| Type of property (house, apartment, etc.) \|
	\| `l2` \| Province / region (location) \|
	\| `price` \| Target — sale price in USD \|

	To keep prices comparable I kept only properties for sale priced in USD, kept only listings located in Argentina (dropping a few thousand stray non-Argentine rows such as Rio de Janeiro), dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median.

	## Exploratory Data Analysis

	Price after outlier removal. After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end.

	![Boxplot of Property Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_boxplot.png)

	Price distribution. Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff.

	![Distribution of Property Prices](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_distribution.png)

	Correlation heatmap. In the raw data, bathrooms and rooms have the strongest link to price (0.64 and 0.47), while the surface columns barely correlate (around 0.05). That low number is a warning sign that the surface columns still hold extreme outliers — so size looks weak here even though it should matter. The tree models later pick surface back up once they handle the non-linear patterns.

	![Correlation Heatmap](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/correlation_heatmap.png)

	### Research Questions

	Q1 — Does property type affect price? Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature.

	![Average Price by Property Type](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q1_price_by_type.png)

	Q2 — Which provinces are most expensive? A handful of provinces sit far above the rest, confirming that location is a major price driver.

	![Top 10 Most Expensive Provinces](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q2_top10_provinces.png)

	Q3 — Do bigger properties cost more? Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story.

	![Price vs Covered Surface](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q3_price_vs_surface.png)

	## Baseline Regression

	A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark.

	\| Metric \| Score \|
	\|---\|---\|
	\| MAE \| 88,316.89 \|
	\| RMSE \| 135,240.01 \|
	\| R² \| 0.381 \|

	Takeaway: the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough.

	![Baseline Feature Coefficients](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_coefficients.png)

	The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties.

	![Baseline Predicted vs Actual Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_pred_vs_actual.png)

	## Feature Engineering & Clustering

	I added three features — `surface_per_room`, `extra_surface`, and `total_facilities` (rooms + bathrooms) — to give the models more to work with.

	I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to k = 4.

	![Elbow Method](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_elbow.png)

	Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature.

	![KMeans Clusters via PCA](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_pca_clusters.png)

	## Model Competition (Regression)

	Three models on the engineered dataset, same split:

	\| Model \| MAE \| RMSE \| R² \|
	\|---\|---\|---\|---\|
	\| Linear Regression (Engineered) \| 88,273 \| 134,673 \| 0.386 \|
	\| Decision Tree \| 78,380 \| 131,767 \| 0.412 \|
	\| Random Forest \| 67,144 \| 110,818 \| 0.584 \|

	![Model Comparison — R² Score](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/model_comparison_r2.png)

	Winner: Random Forest. The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price.

	![Regression Feature Importance](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/regression_feature_importance.png)

	```python
	import pickle
	with open("regression_model.pkl", "rb") as f:
	model = pickle.load(f)
	```

	## Regression-to-Classification

	I reframed price as a 3-class problem using equal terciles learned from the training set only (so nothing leaks from the test set):

	- Cheap — bottom third
	- - Medium — middle third
	- - Expensive — top third

	- Thresholds (USD): roughly 100,000 and 207,000. Because the bands are terciles, the classes come out balanced (~33% each).

	- ![Class Distribution](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/class_distribution.png)
	- Why this split: equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention.
	## Classification & Final Evaluation

	Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw):

	\| Model \| Accuracy \| Recall (macro) \| F1 (macro) \|
	\|---\|---\|---\|---\|
	\| Logistic Regression \| 0.640 \| 0.64 \| 0.64 \|
	\| KNN \| 0.691 \| 0.69 \| 0.69 \|
	\| Random Forest (winner) \| 0.693 \| 0.69 \| 0.69 \|

	Winner: Random Forest (accuracy ≈ 0.693). The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold.

	![Classification Confusion Matrix](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/classification_confusion_matrix%282%29.png)

	```python
	import pickle
	with open("classification_model.pkl", "rb") as f:
	model = pickle.load(f)
	```

	## Summary

	A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: surface area, facilities, property type, and province.

	## Limitations

	- Only the Argentina file was used (200,000 sampled rows, 193,173 after filtering to Argentina), so results may not generalize to all five countries.
	- - Many size/room values were missing and filled with medians, which flattens some variation.
	- - Prices are listing (asking) prices, not final sale prices.
	- - These are statistical associations, not proven causes of price.

	- ## Notebook & Libraries

	- The full cleaning, EDA, feature engineering, and modeling code is in the `.ipynb` file in this repo. Built with `numpy`, `pandas`, `seaborn`, `matplotlib`, and `scikit-learn` (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via `pickle`.
	- ---

	Avihay Amor \| 2026