--- license: mit language: - en tags: - tabular-regression - tabular-classification - scikit-learn - random-forest - linear-regression - kmeans-clustering - feature-engineering - supervised-learning - predictive-modeling - data-science - real-estate - property-prices - south-america pipeline_tag: tabular-regression --- # South America Property — Price Prediction (Regression, Clustering & Classification) ## Presentation Watch the project overview: ## Quick Links - **Analysis Notebook** — the `.ipynb` file in this repository - **Regression Model** — `regression_model.pkl` (RandomForestRegressor) - **Classification Model** — `classification_model.pkl` (RandomForestClassifier) - **Dataset** — Property Listings for 5 South American Countries (Kaggle) ## Project Overview This project predicts property prices in South America from a listing's size, rooms, type, and location. It runs the full pipeline: cleaning and EDA, feature engineering with a K-Means cluster feature, a regression model for the price in USD, and a classification model for price bands. The two winning models are saved as pickle files. ## Research Question **How do a property's size, room count, type, and location determine its price (USD) in South America?** The work is split into three parts: EDA to see which features matter most, regression to predict the actual price, and classification to sort properties into Cheap / Medium / Expensive bands. ## Project Workflow ``` Raw Dataset (Argentina listings, 200,000 rows sampled × 25 features) ↓ Part 1: Cleaning & EDA - Keep only properties for sale (Venta), priced in USD - Drop missing prices, remove duplicates, cap price to 10k–1M USD - Correlation analysis + 3 research questions ↓ Part 2: Baseline Regression (Linear Regression, R² ≈ 0.38) ↓ Part 3: Feature Engineering + K-Means cluster feature (k = 4) ↓ Part 4: Model Competition → Random Forest wins (R² ≈ 0.585) ↓ Part 5: Price binned into 3 equal bands (Cheap / Medium / Expensive) ↓ Part 6: Classification → Random Forest wins (accuracy ≈ 0.693) ↓ Two saved models: regression + classification ``` ## Dataset The data is the **Property Listings for 5 South American Countries** dataset from Kaggle. I used the Argentina file (the largest, ~1M rows) and sampled the first 200,000 rows to keep Colab fast. The target is `price` in USD, with a mix of numeric and categorical features. ### Key Features Used | Feature | Description | |---|---| | `surface_total` | Total surface area (m²) | | `surface_covered` | Covered surface area (m²) | | `rooms` | Number of rooms | | `bathrooms` | Number of bathrooms | | `property_type` | Type of property (house, apartment, etc.) | | `l2` | Province / region (location) | | `price` | **Target** — sale price in USD | To keep prices comparable I kept only properties for sale priced in USD, dropped rows without a price, removed duplicates, and capped price to the 10,000–1,000,000 USD range. Missing numeric values were filled with the column median. ## Exploratory Data Analysis **Price after outlier removal.** After capping to 10k–1M USD, the price range is clean enough to model — but the boxplot shows it stays heavily skewed toward the lower end. ![Boxplot of Property Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_boxplot.png) **Price distribution.** Most properties are cheap, with a long tail of expensive ones. This skew is exactly why the price bands later use equal terciles rather than a simple cutoff. ![Distribution of Property Prices](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/price_distribution.png) **Correlation heatmap.** Size is king: `surface_covered` and `surface_total` correlate most strongly with price, with rooms and bathrooms close behind. ![Correlation Heatmap](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/correlation_heatmap.png) ### Research Questions **Q1 — Does property type affect price?** Yes. Houses and larger residential types sell for clearly more than apartments, so property type is worth keeping as a feature. ![Average Price by Property Type](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q1_price_by_type.png) **Q2 — Which provinces are most expensive?** A handful of provinces sit far above the rest, confirming that location is a major price driver. ![Top 10 Most Expensive Provinces](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q2_top10_provinces.png) **Q3 — Do bigger properties cost more?** Generally yes — price rises with covered surface — but the wide scatter shows size alone doesn't tell the whole story. ![Price vs Covered Surface](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/q3_price_vs_surface.png) ## Baseline Regression A plain Linear Regression on all features (80/20 split, seed 42) sets the benchmark. | Metric | Score | |---|---| | MAE | 88,316.89 | | RMSE | 135,240.01 | | R² | 0.381 | **Takeaway:** the model only explains ~38% of the variance — there's a real signal, but it's clearly non-linear, so a straight line isn't enough. ![Baseline Feature Coefficients](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_coefficients.png) The predicted-vs-actual plot shows the model regressing toward the middle and missing the expensive properties. ![Baseline Predicted vs Actual Price](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/baseline_pred_vs_actual.png) ## Feature Engineering & Clustering I added three features — `surface_per_room`, `extra_surface`, and `total_facilities` (rooms + bathrooms) — to give the models more to work with. I also ran K-Means on the scaled numeric features (excluding price, to avoid leakage). The elbow curve pointed to **k = 4**. ![Elbow Method](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_elbow.png) Viewed in 2D with PCA, the four clusters separate cleanly by size and value — from small/cheap up to large/premium. The cluster label was added as a new feature. ![KMeans Clusters via PCA](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/kmeans_pca_clusters.png) ## Model Competition (Regression) Three models on the engineered dataset, same split: | Model | MAE | RMSE | R² | |---|---|---|---| | Linear Regression (Engineered) | 88,273 | 134,673 | 0.386 | | Decision Tree | 78,380 | 131,767 | 0.412 | | **Random Forest** | **67,144** | **110,818** | **0.584** | ![Model Comparison — R² Score](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/model_comparison_r2.png) **Winner: Random Forest.** The engineered features barely moved Linear Regression, but the tree models handled the non-linearity well, and Random Forest jumped to R² ≈ 0.585. Its feature importances confirm what EDA hinted at: surface, facilities, type, and province drive price. ![Regression Feature Importance](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/regression_feature_importance.png) ```python import pickle with open("regression_model.pkl", "rb") as f: model = pickle.load(f) ``` ## Regression-to-Classification I reframed price as a 3-class problem using **equal terciles** learned from the training set only (so nothing leaks from the test set): - **Cheap** — bottom third - **Medium** — middle third - **Expensive** — top third Thresholds (USD): roughly **100,000** and **207,000**. Because the bands are terciles, the classes come out balanced (~33% each). ![Class Distribution](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/class_distribution.png) **Why this split:** equal bands give a clean, balanced low/mid/high question. In a pricing tool, over-labelling a property's band (a false positive) can lead a buyer to overpay, so the Expensive class's false positives get extra attention. ## Classification & Final Evaluation Three classifiers on the same engineered features (Logistic Regression and KNN on scaled inputs, Random Forest on raw): | Model | Accuracy | Recall (macro) | F1 (macro) | |---|---|---|---| | Logistic Regression | 0.640 | 0.64 | 0.64 | | KNN | 0.691 | 0.69 | 0.69 | | **Random Forest (winner)** | **0.693** | **0.69** | **0.69** | **Winner: Random Forest (accuracy ≈ 0.693).** The confusion matrix shows nearly all the mistakes happen between neighbouring bands (Cheap↔Medium, Medium↔Expensive) — Cheap and Expensive are rarely confused. In short, the model reliably separates the high and low ends, and only struggles with properties sitting right on a price threshold. ![Classification Confusion Matrix](https://huggingface.co/avihayamor/South_America_property_price_prediction/resolve/main/classification_confusion_matrix.png) ```python import pickle with open("classification_model.pkl", "rb") as f: model = pickle.load(f) ``` ## Summary A linear baseline got us to R² ≈ 0.38, but the real gains came from feature engineering plus a Random Forest, which reached R² ≈ 0.585 on the regression task. Reframing price into three balanced bands and training a Random Forest classifier hit ≈ 0.693 accuracy, with errors only between adjacent bands. Across both tasks the same drivers stand out: **surface area, facilities, property type, and province**. ## Limitations - Only the Argentina file was used (200,000 sampled rows), so results may not generalize to all five countries. - Many size/room values were missing and filled with medians, which flattens some variation. - Prices are listing (asking) prices, not final sale prices. - These are statistical associations, not proven causes of price. ## Notebook & Libraries The full cleaning, EDA, feature engineering, and modeling code is in the `.ipynb` file in this repo. Built with `numpy`, `pandas`, `seaborn`, `matplotlib`, and `scikit-learn` (Linear/Logistic Regression, Decision Tree, Random Forest, KNN, K-Means, PCA, StandardScaler, metrics), with models saved via `pickle`. --- *Avihay Amor | 2026*