--- language: - vi license: mit tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers - embeddings - vietnamese - rental - real-estate library_name: transformers pipeline_tag: feature-extraction --- # BGE-M3 Vietnamese Rental Property Search Fine-tuned projection head for [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for **Vietnamese rental property search** (Phòng trọ). This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with **weighted hard negatives** using contrastive learning (InfoNCE loss). ## 🎯 Model Description - **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (frozen) - **Task**: Semantic search for Vietnamese rental properties - **Training Strategy**: Weighted contrastive learning with hard negatives - **Output Dimension**: 128 (projected from 1024) - **Training Data**: 10,384 Vietnamese rental property query-document pairs ## 📊 Performance Evaluated on 96 test examples: | Metric | Score | |--------|-------| | **MRR** | **98.44%** | | **Recall@1** | **96.88%** | | **Recall@5** | **100.00%** | | **Recall@10** | **100.00%** | | **Recall@50** | **100.00%** | ### Interpretation - **98.44% MRR**: On average, the correct match appears at position ~1.02 (nearly always rank 1!) - **96.88% Recall@1**: 93 out of 96 queries find the correct match at the top position - **100% Recall@5+**: All queries find their correct match within top-5 results ## 🚀 Quick Start ### Installation ```bash pip install transformers torch safetensors ``` ### Usage ```python from transformers import AutoModel, AutoTokenizer import torch # Load model model = AutoModel.from_pretrained( "your-username/bge-m3-vietnamese-rental-projection", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3") # Move to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) model.eval() # Encode texts texts = [ "Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh", "Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng" ] # Method 1: Using encode (recommended) embeddings = model.encode(texts, device=device) # [2, 128] # Method 2: Using forward inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512) inputs = {k: v.to(device) for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # [2, 128], L2-normalized print(embeddings.shape) # torch.Size([2, 128]) # Compute similarity (cosine) similarity = embeddings[0] @ embeddings[1] print(f"Similarity: {similarity:.4f}") ``` ### Search Example ```python # Build a search engine class RentalSearchEngine: def __init__(self, model, tokenizer, device="cuda"): self.model = model self.tokenizer = tokenizer self.device = device self.database_embeddings = None self.database_texts = None def index(self, property_descriptions): """Index a database of property descriptions""" self.database_texts = property_descriptions self.database_embeddings = self.model.encode( property_descriptions, device=self.device ) def search(self, query, top_k=5): """Search for most similar properties""" query_emb = self.model.encode([query], device=self.device)[0] # Compute similarities similarities = query_emb @ self.database_embeddings.T # Get top-k top_k = min(top_k, len(similarities)) scores, indices = torch.topk(similarities, k=top_k) results = [] for idx, score in zip(indices.tolist(), scores.tolist()): results.append({ "text": self.database_texts[idx], "score": score }) return results # Example usage engine = RentalSearchEngine(model, tokenizer, device) # Index properties properties = [ "Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng", "Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng", "Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng", "Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng", ] engine.index(properties) # Search results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3) for i, result in enumerate(results, 1): print(f"{i}. [{result['score']:.4f}] {result['text']}") ``` ## 🎓 Training Details ### Dataset - **Size**: 10,384 examples - **Split**: 9,345 train / 1,039 validation - **Format**: Query-positive-hard negatives triplets - **Hard Negatives**: 3 per example, weighted by feature type ### Weighted Hard Negatives Strategy The model uses feature-based weighting for hard negatives: | Feature Type | Weight | Importance | |--------------|--------|------------| | Location (Quận) | 2.5 | Highest | | Price | 2.0 | High | | Area (m²) | 1.8 | Medium | | Amenities | 1.5 | Lower | This teaches the model that location mismatches are more critical than amenity differences. ### Training Configuration ```json { "base_model": "BAAI/bge-m3", "d_out": 128, "freeze_encoder": true, "epochs": 17, "batch_size": 128, "learning_rate": 0.0002, "optimizer": "AdamW", "weight_decay": 0.01, "loss": "Weighted InfoNCE (symmetric)", "temperature": 0.07, "device": "Tesla T4 (Google Colab)", "training_time": "~2.5 hours" } ``` ### Training Progress | Epoch | Train Loss | Val Loss | Status | |-------|------------|----------|--------| | 1 | 2.9054 | 2.4529 | ⭐ Best | | 5 | 2.1609 | 2.0078 | ⭐ Best | | 9 | 2.0237 | 1.8906 | ⭐ Best | | 12 | 1.9722 | 1.8760 | ⭐ Best | | **16** | **1.9297** | **1.8215** | ⭐ **Best** | | 17 | 1.9191 | 1.8276 | Final | **Improvement**: -34% train loss, -26% validation loss ### Model Architecture ``` BAAI/bge-m3 (frozen) ↓ [1024-dim] ProjectionHead ├─ Linear(1024 → 128, bias=False) └─ L2 Normalization ↓ [128-dim, L2-normalized] Output Embeddings ``` **Parameters**: - Trainable: 131,072 (0.02%) - Total: 567,885,824 - Strategy: Only projection head is trainable ## 🎯 Use Cases This model is optimized for: ✅ **Vietnamese rental property search** - Matching user queries to property listings - Finding similar properties - Semantic search for rental accommodations ✅ **Supported features**: - Location (districts, neighborhoods) - Price range - Area/size (m²) - Amenities (WC, máy lạnh, ban công, bếp, etc.) - Room type (phòng trọ, studio, etc.) ## ⚠️ Limitations - **Domain-specific**: Optimized for Vietnamese rental properties only - **Geographic focus**: Primarily trained on properties in Ho Chi Minh City and Hanoi - **Language**: Vietnamese only (not multilingual like base BGE-M3) - **Frozen encoder**: Base BGE-M3 encoder is not fine-tuned, only projection head - **Not for**: General-purpose Vietnamese embeddings or other domains ## 🔍 Example Predictions ### Example 1: Location Sensitivity ``` Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng" Positive (0.947): Gò Vấp 18m² 3tr5 wc riêng ✅ Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!) Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price) Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity) → Model correctly penalizes location mismatch most heavily ``` ### Example 2: Feature Understanding ``` Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ" Positive (0.904): Q10 20m² 4tr máy lạnh wc riêng ✅ Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district) Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price) Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs) → Strong margin (+0.36) between positive and top negative ``` ## 📖 Citation If you use this model, please cite: ```bibtex @misc{bge-m3-vietnamese-rental, author = {Your Name}, title = {BGE-M3 Vietnamese Rental Property Search}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}}, } ``` ## 📜 License MIT License - Free to use for commercial and non-commercial purposes. ## 🙏 Acknowledgments - Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) - Framework: [Hugging Face Transformers](https://github.com/huggingface/transformers) - Training: Google Colab (Tesla T4) ## 📧 Contact For questions or feedback, please open an issue on the model repository. --- **Last updated**: October 2025