File size: 8,760 Bytes
62e0350
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
language:
- vi
license: mit
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- embeddings
- vietnamese
- rental
- real-estate
library_name: transformers
pipeline_tag: feature-extraction
---

# BGE-M3 Vietnamese Rental Property Search

Fine-tuned projection head for [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for **Vietnamese rental property search** (Phòng trọ).

This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with **weighted hard negatives** using contrastive learning (InfoNCE loss).

## 🎯 Model Description

- **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (frozen)
- **Task**: Semantic search for Vietnamese rental properties
- **Training Strategy**: Weighted contrastive learning with hard negatives
- **Output Dimension**: 128 (projected from 1024)
- **Training Data**: 10,384 Vietnamese rental property query-document pairs

## 📊 Performance

Evaluated on 96 test examples:

| Metric | Score |
|--------|-------|
| **MRR** | **98.44%** |
| **Recall@1** | **96.88%** |
| **Recall@5** | **100.00%** |
| **Recall@10** | **100.00%** |
| **Recall@50** | **100.00%** |

### Interpretation

- **98.44% MRR**: On average, the correct match appears at position ~1.02 (nearly always rank 1!)
- **96.88% Recall@1**: 93 out of 96 queries find the correct match at the top position
- **100% Recall@5+**: All queries find their correct match within top-5 results

## 🚀 Quick Start

### Installation

```bash
pip install transformers torch safetensors
```

### Usage

```python
from transformers import AutoModel, AutoTokenizer
import torch

# Load model
model = AutoModel.from_pretrained(
    "your-username/bge-m3-vietnamese-rental-projection",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Encode texts
texts = [
    "Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh",
    "Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng"
]

# Method 1: Using encode (recommended)
embeddings = model.encode(texts, device=device)  # [2, 128]

# Method 2: Using forward
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # [2, 128], L2-normalized

print(embeddings.shape)  # torch.Size([2, 128])

# Compute similarity (cosine)
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")
```

### Search Example

```python
# Build a search engine
class RentalSearchEngine:
    def __init__(self, model, tokenizer, device="cuda"):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.database_embeddings = None
        self.database_texts = None
    
    def index(self, property_descriptions):
        """Index a database of property descriptions"""
        self.database_texts = property_descriptions
        self.database_embeddings = self.model.encode(
            property_descriptions,
            device=self.device
        )
    
    def search(self, query, top_k=5):
        """Search for most similar properties"""
        query_emb = self.model.encode([query], device=self.device)[0]
        
        # Compute similarities
        similarities = query_emb @ self.database_embeddings.T
        
        # Get top-k
        top_k = min(top_k, len(similarities))
        scores, indices = torch.topk(similarities, k=top_k)
        
        results = []
        for idx, score in zip(indices.tolist(), scores.tolist()):
            results.append({
                "text": self.database_texts[idx],
                "score": score
            })
        
        return results

# Example usage
engine = RentalSearchEngine(model, tokenizer, device)

# Index properties
properties = [
    "Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng",
    "Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng",
    "Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng",
    "Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng",
]
engine.index(properties)

# Search
results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3)

for i, result in enumerate(results, 1):
    print(f"{i}. [{result['score']:.4f}] {result['text']}")
```

## 🎓 Training Details

### Dataset

- **Size**: 10,384 examples
- **Split**: 9,345 train / 1,039 validation
- **Format**: Query-positive-hard negatives triplets
- **Hard Negatives**: 3 per example, weighted by feature type

### Weighted Hard Negatives Strategy

The model uses feature-based weighting for hard negatives:

| Feature Type | Weight | Importance |
|--------------|--------|------------|
| Location (Quận) | 2.5 | Highest |
| Price | 2.0 | High |
| Area (m²) | 1.8 | Medium |
| Amenities | 1.5 | Lower |

This teaches the model that location mismatches are more critical than amenity differences.

### Training Configuration

```json
{
  "base_model": "BAAI/bge-m3",
  "d_out": 128,
  "freeze_encoder": true,
  "epochs": 17,
  "batch_size": 128,
  "learning_rate": 0.0002,
  "optimizer": "AdamW",
  "weight_decay": 0.01,
  "loss": "Weighted InfoNCE (symmetric)",
  "temperature": 0.07,
  "device": "Tesla T4 (Google Colab)",
  "training_time": "~2.5 hours"
}
```

### Training Progress

| Epoch | Train Loss | Val Loss | Status |
|-------|------------|----------|--------|
| 1 | 2.9054 | 2.4529 | ⭐ Best |
| 5 | 2.1609 | 2.0078 | ⭐ Best |
| 9 | 2.0237 | 1.8906 | ⭐ Best |
| 12 | 1.9722 | 1.8760 | ⭐ Best |
| **16** | **1.9297** | **1.8215** | ⭐ **Best** |
| 17 | 1.9191 | 1.8276 | Final |

**Improvement**: -34% train loss, -26% validation loss

### Model Architecture

```
BAAI/bge-m3 (frozen)
    ↓ [1024-dim]
ProjectionHead
    ├─ Linear(1024 → 128, bias=False)
    └─ L2 Normalization
    ↓ [128-dim, L2-normalized]
Output Embeddings
```

**Parameters**:
- Trainable: 131,072 (0.02%)
- Total: 567,885,824
- Strategy: Only projection head is trainable

## 🎯 Use Cases

This model is optimized for:

✅ **Vietnamese rental property search**
- Matching user queries to property listings
- Finding similar properties
- Semantic search for rental accommodations

✅ **Supported features**:
- Location (districts, neighborhoods)
- Price range
- Area/size (m²)
- Amenities (WC, máy lạnh, ban công, bếp, etc.)
- Room type (phòng trọ, studio, etc.)

## ⚠️ Limitations

- **Domain-specific**: Optimized for Vietnamese rental properties only
- **Geographic focus**: Primarily trained on properties in Ho Chi Minh City and Hanoi
- **Language**: Vietnamese only (not multilingual like base BGE-M3)
- **Frozen encoder**: Base BGE-M3 encoder is not fine-tuned, only projection head
- **Not for**: General-purpose Vietnamese embeddings or other domains

## 🔍 Example Predictions

### Example 1: Location Sensitivity

```
Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng"

Positive (0.947):  Gò Vấp 18m² 3tr5 wc riêng ✅
Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!)
Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price)
Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity)

→ Model correctly penalizes location mismatch most heavily
```

### Example 2: Feature Understanding

```
Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ"

Positive (0.904):  Q10 20m² 4tr máy lạnh wc riêng ✅
Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district)
Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price)
Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs)

→ Strong margin (+0.36) between positive and top negative
```

## 📖 Citation

If you use this model, please cite:

```bibtex
@misc{bge-m3-vietnamese-rental,
  author = {Your Name},
  title = {BGE-M3 Vietnamese Rental Property Search},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}},
}
```

## 📜 License

MIT License - Free to use for commercial and non-commercial purposes.

## 🙏 Acknowledgments

- Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
- Framework: [Hugging Face Transformers](https://github.com/huggingface/transformers)
- Training: Google Colab (Tesla T4)

## 📧 Contact

For questions or feedback, please open an issue on the model repository.

---

**Last updated**: October 2025