Feature Extraction
Transformers
Safetensors
sentence-transformers
Vietnamese
bgem3_projection
sentence-similarity
embeddings
vietnamese
rental
real-estate
custom_code
Instructions to use lamdx4/bge-m3-vietnamese-rental-projection with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lamdx4/bge-m3-vietnamese-rental-projection with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="lamdx4/bge-m3-vietnamese-rental-projection", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("lamdx4/bge-m3-vietnamese-rental-projection", trust_remote_code=True, dtype="auto") - sentence-transformers
How to use lamdx4/bge-m3-vietnamese-rental-projection with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("lamdx4/bge-m3-vietnamese-rental-projection", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Upload folder using huggingface_hub
Browse files- README.md +313 -0
- UPLOAD_INSTRUCTIONS.md +193 -0
- __pycache__/modeling_bgem3_projection.cpython-310.pyc +0 -0
- config.json +14 -0
- model.safetensors +3 -0
- modeling_bgem3_projection.py +309 -0
- training_info.json +56 -0
README.md
ADDED
|
@@ -0,0 +1,313 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- vi
|
| 4 |
+
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- sentence-transformers
|
| 7 |
+
- feature-extraction
|
| 8 |
+
- sentence-similarity
|
| 9 |
+
- transformers
|
| 10 |
+
- embeddings
|
| 11 |
+
- vietnamese
|
| 12 |
+
- rental
|
| 13 |
+
- real-estate
|
| 14 |
+
library_name: transformers
|
| 15 |
+
pipeline_tag: feature-extraction
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# BGE-M3 Vietnamese Rental Property Search
|
| 19 |
+
|
| 20 |
+
Fine-tuned projection head for [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for **Vietnamese rental property search** (Phòng trọ).
|
| 21 |
+
|
| 22 |
+
This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with **weighted hard negatives** using contrastive learning (InfoNCE loss).
|
| 23 |
+
|
| 24 |
+
## 🎯 Model Description
|
| 25 |
+
|
| 26 |
+
- **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (frozen)
|
| 27 |
+
- **Task**: Semantic search for Vietnamese rental properties
|
| 28 |
+
- **Training Strategy**: Weighted contrastive learning with hard negatives
|
| 29 |
+
- **Output Dimension**: 128 (projected from 1024)
|
| 30 |
+
- **Training Data**: 10,384 Vietnamese rental property query-document pairs
|
| 31 |
+
|
| 32 |
+
## 📊 Performance
|
| 33 |
+
|
| 34 |
+
Evaluated on 96 test examples:
|
| 35 |
+
|
| 36 |
+
| Metric | Score |
|
| 37 |
+
|--------|-------|
|
| 38 |
+
| **MRR** | **98.44%** |
|
| 39 |
+
| **Recall@1** | **96.88%** |
|
| 40 |
+
| **Recall@5** | **100.00%** |
|
| 41 |
+
| **Recall@10** | **100.00%** |
|
| 42 |
+
| **Recall@50** | **100.00%** |
|
| 43 |
+
|
| 44 |
+
### Interpretation
|
| 45 |
+
|
| 46 |
+
- **98.44% MRR**: On average, the correct match appears at position ~1.02 (nearly always rank 1!)
|
| 47 |
+
- **96.88% Recall@1**: 93 out of 96 queries find the correct match at the top position
|
| 48 |
+
- **100% Recall@5+**: All queries find their correct match within top-5 results
|
| 49 |
+
|
| 50 |
+
## 🚀 Quick Start
|
| 51 |
+
|
| 52 |
+
### Installation
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
pip install transformers torch safetensors
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Usage
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
from transformers import AutoModel, AutoTokenizer
|
| 62 |
+
import torch
|
| 63 |
+
|
| 64 |
+
# Load model
|
| 65 |
+
model = AutoModel.from_pretrained(
|
| 66 |
+
"your-username/bge-m3-vietnamese-rental-projection",
|
| 67 |
+
trust_remote_code=True
|
| 68 |
+
)
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
|
| 70 |
+
|
| 71 |
+
# Move to GPU if available
|
| 72 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 73 |
+
model = model.to(device)
|
| 74 |
+
model.eval()
|
| 75 |
+
|
| 76 |
+
# Encode texts
|
| 77 |
+
texts = [
|
| 78 |
+
"Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh",
|
| 79 |
+
"Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng"
|
| 80 |
+
]
|
| 81 |
+
|
| 82 |
+
# Method 1: Using encode (recommended)
|
| 83 |
+
embeddings = model.encode(texts, device=device) # [2, 128]
|
| 84 |
+
|
| 85 |
+
# Method 2: Using forward
|
| 86 |
+
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
|
| 87 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 88 |
+
|
| 89 |
+
with torch.no_grad():
|
| 90 |
+
outputs = model(**inputs)
|
| 91 |
+
embeddings = outputs.last_hidden_state # [2, 128], L2-normalized
|
| 92 |
+
|
| 93 |
+
print(embeddings.shape) # torch.Size([2, 128])
|
| 94 |
+
|
| 95 |
+
# Compute similarity (cosine)
|
| 96 |
+
similarity = embeddings[0] @ embeddings[1]
|
| 97 |
+
print(f"Similarity: {similarity:.4f}")
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### Search Example
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
# Build a search engine
|
| 104 |
+
class RentalSearchEngine:
|
| 105 |
+
def __init__(self, model, tokenizer, device="cuda"):
|
| 106 |
+
self.model = model
|
| 107 |
+
self.tokenizer = tokenizer
|
| 108 |
+
self.device = device
|
| 109 |
+
self.database_embeddings = None
|
| 110 |
+
self.database_texts = None
|
| 111 |
+
|
| 112 |
+
def index(self, property_descriptions):
|
| 113 |
+
"""Index a database of property descriptions"""
|
| 114 |
+
self.database_texts = property_descriptions
|
| 115 |
+
self.database_embeddings = self.model.encode(
|
| 116 |
+
property_descriptions,
|
| 117 |
+
device=self.device
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
def search(self, query, top_k=5):
|
| 121 |
+
"""Search for most similar properties"""
|
| 122 |
+
query_emb = self.model.encode([query], device=self.device)[0]
|
| 123 |
+
|
| 124 |
+
# Compute similarities
|
| 125 |
+
similarities = query_emb @ self.database_embeddings.T
|
| 126 |
+
|
| 127 |
+
# Get top-k
|
| 128 |
+
top_k = min(top_k, len(similarities))
|
| 129 |
+
scores, indices = torch.topk(similarities, k=top_k)
|
| 130 |
+
|
| 131 |
+
results = []
|
| 132 |
+
for idx, score in zip(indices.tolist(), scores.tolist()):
|
| 133 |
+
results.append({
|
| 134 |
+
"text": self.database_texts[idx],
|
| 135 |
+
"score": score
|
| 136 |
+
})
|
| 137 |
+
|
| 138 |
+
return results
|
| 139 |
+
|
| 140 |
+
# Example usage
|
| 141 |
+
engine = RentalSearchEngine(model, tokenizer, device)
|
| 142 |
+
|
| 143 |
+
# Index properties
|
| 144 |
+
properties = [
|
| 145 |
+
"Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng",
|
| 146 |
+
"Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng",
|
| 147 |
+
"Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng",
|
| 148 |
+
"Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng",
|
| 149 |
+
]
|
| 150 |
+
engine.index(properties)
|
| 151 |
+
|
| 152 |
+
# Search
|
| 153 |
+
results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3)
|
| 154 |
+
|
| 155 |
+
for i, result in enumerate(results, 1):
|
| 156 |
+
print(f"{i}. [{result['score']:.4f}] {result['text']}")
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## 🎓 Training Details
|
| 160 |
+
|
| 161 |
+
### Dataset
|
| 162 |
+
|
| 163 |
+
- **Size**: 10,384 examples
|
| 164 |
+
- **Split**: 9,345 train / 1,039 validation
|
| 165 |
+
- **Format**: Query-positive-hard negatives triplets
|
| 166 |
+
- **Hard Negatives**: 3 per example, weighted by feature type
|
| 167 |
+
|
| 168 |
+
### Weighted Hard Negatives Strategy
|
| 169 |
+
|
| 170 |
+
The model uses feature-based weighting for hard negatives:
|
| 171 |
+
|
| 172 |
+
| Feature Type | Weight | Importance |
|
| 173 |
+
|--------------|--------|------------|
|
| 174 |
+
| Location (Quận) | 2.5 | Highest |
|
| 175 |
+
| Price | 2.0 | High |
|
| 176 |
+
| Area (m²) | 1.8 | Medium |
|
| 177 |
+
| Amenities | 1.5 | Lower |
|
| 178 |
+
|
| 179 |
+
This teaches the model that location mismatches are more critical than amenity differences.
|
| 180 |
+
|
| 181 |
+
### Training Configuration
|
| 182 |
+
|
| 183 |
+
```json
|
| 184 |
+
{
|
| 185 |
+
"base_model": "BAAI/bge-m3",
|
| 186 |
+
"d_out": 128,
|
| 187 |
+
"freeze_encoder": true,
|
| 188 |
+
"epochs": 17,
|
| 189 |
+
"batch_size": 128,
|
| 190 |
+
"learning_rate": 0.0002,
|
| 191 |
+
"optimizer": "AdamW",
|
| 192 |
+
"weight_decay": 0.01,
|
| 193 |
+
"loss": "Weighted InfoNCE (symmetric)",
|
| 194 |
+
"temperature": 0.07,
|
| 195 |
+
"device": "Tesla T4 (Google Colab)",
|
| 196 |
+
"training_time": "~2.5 hours"
|
| 197 |
+
}
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
### Training Progress
|
| 201 |
+
|
| 202 |
+
| Epoch | Train Loss | Val Loss | Status |
|
| 203 |
+
|-------|------------|----------|--------|
|
| 204 |
+
| 1 | 2.9054 | 2.4529 | ⭐ Best |
|
| 205 |
+
| 5 | 2.1609 | 2.0078 | ⭐ Best |
|
| 206 |
+
| 9 | 2.0237 | 1.8906 | ⭐ Best |
|
| 207 |
+
| 12 | 1.9722 | 1.8760 | ⭐ Best |
|
| 208 |
+
| **16** | **1.9297** | **1.8215** | ⭐ **Best** |
|
| 209 |
+
| 17 | 1.9191 | 1.8276 | Final |
|
| 210 |
+
|
| 211 |
+
**Improvement**: -34% train loss, -26% validation loss
|
| 212 |
+
|
| 213 |
+
### Model Architecture
|
| 214 |
+
|
| 215 |
+
```
|
| 216 |
+
BAAI/bge-m3 (frozen)
|
| 217 |
+
↓ [1024-dim]
|
| 218 |
+
ProjectionHead
|
| 219 |
+
├─ Linear(1024 → 128, bias=False)
|
| 220 |
+
└─ L2 Normalization
|
| 221 |
+
↓ [128-dim, L2-normalized]
|
| 222 |
+
Output Embeddings
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
**Parameters**:
|
| 226 |
+
- Trainable: 131,072 (0.02%)
|
| 227 |
+
- Total: 567,885,824
|
| 228 |
+
- Strategy: Only projection head is trainable
|
| 229 |
+
|
| 230 |
+
## 🎯 Use Cases
|
| 231 |
+
|
| 232 |
+
This model is optimized for:
|
| 233 |
+
|
| 234 |
+
✅ **Vietnamese rental property search**
|
| 235 |
+
- Matching user queries to property listings
|
| 236 |
+
- Finding similar properties
|
| 237 |
+
- Semantic search for rental accommodations
|
| 238 |
+
|
| 239 |
+
✅ **Supported features**:
|
| 240 |
+
- Location (districts, neighborhoods)
|
| 241 |
+
- Price range
|
| 242 |
+
- Area/size (m²)
|
| 243 |
+
- Amenities (WC, máy lạnh, ban công, bếp, etc.)
|
| 244 |
+
- Room type (phòng trọ, studio, etc.)
|
| 245 |
+
|
| 246 |
+
## ⚠️ Limitations
|
| 247 |
+
|
| 248 |
+
- **Domain-specific**: Optimized for Vietnamese rental properties only
|
| 249 |
+
- **Geographic focus**: Primarily trained on properties in Ho Chi Minh City and Hanoi
|
| 250 |
+
- **Language**: Vietnamese only (not multilingual like base BGE-M3)
|
| 251 |
+
- **Frozen encoder**: Base BGE-M3 encoder is not fine-tuned, only projection head
|
| 252 |
+
- **Not for**: General-purpose Vietnamese embeddings or other domains
|
| 253 |
+
|
| 254 |
+
## 🔍 Example Predictions
|
| 255 |
+
|
| 256 |
+
### Example 1: Location Sensitivity
|
| 257 |
+
|
| 258 |
+
```
|
| 259 |
+
Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng"
|
| 260 |
+
|
| 261 |
+
Positive (0.947): Gò Vấp 18m² 3tr5 wc riêng ✅
|
| 262 |
+
Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!)
|
| 263 |
+
Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price)
|
| 264 |
+
Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity)
|
| 265 |
+
|
| 266 |
+
→ Model correctly penalizes location mismatch most heavily
|
| 267 |
+
```
|
| 268 |
+
|
| 269 |
+
### Example 2: Feature Understanding
|
| 270 |
+
|
| 271 |
+
```
|
| 272 |
+
Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ"
|
| 273 |
+
|
| 274 |
+
Positive (0.904): Q10 20m² 4tr máy lạnh wc riêng ✅
|
| 275 |
+
Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district)
|
| 276 |
+
Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price)
|
| 277 |
+
Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs)
|
| 278 |
+
|
| 279 |
+
→ Strong margin (+0.36) between positive and top negative
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
## 📖 Citation
|
| 283 |
+
|
| 284 |
+
If you use this model, please cite:
|
| 285 |
+
|
| 286 |
+
```bibtex
|
| 287 |
+
@misc{bge-m3-vietnamese-rental,
|
| 288 |
+
author = {Your Name},
|
| 289 |
+
title = {BGE-M3 Vietnamese Rental Property Search},
|
| 290 |
+
year = {2025},
|
| 291 |
+
publisher = {Hugging Face},
|
| 292 |
+
howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}},
|
| 293 |
+
}
|
| 294 |
+
```
|
| 295 |
+
|
| 296 |
+
## 📜 License
|
| 297 |
+
|
| 298 |
+
MIT License - Free to use for commercial and non-commercial purposes.
|
| 299 |
+
|
| 300 |
+
## 🙏 Acknowledgments
|
| 301 |
+
|
| 302 |
+
- Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
|
| 303 |
+
- Framework: [Hugging Face Transformers](https://github.com/huggingface/transformers)
|
| 304 |
+
- Training: Google Colab (Tesla T4)
|
| 305 |
+
|
| 306 |
+
## 📧 Contact
|
| 307 |
+
|
| 308 |
+
For questions or feedback, please open an issue on the model repository.
|
| 309 |
+
|
| 310 |
+
---
|
| 311 |
+
|
| 312 |
+
**Last updated**: October 2025
|
| 313 |
+
|
UPLOAD_INSTRUCTIONS.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Hub Upload Instructions
|
| 2 |
+
|
| 3 |
+
## Files Ready for Upload
|
| 4 |
+
|
| 5 |
+
All files are in the `hf_upload/` directory:
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
hf_upload/
|
| 9 |
+
├── model.safetensors # Projection head weights (512 KB)
|
| 10 |
+
├── config.json # Model configuration
|
| 11 |
+
├── modeling_bgem3_projection.py # Model class definition
|
| 12 |
+
├── training_info.json # Training metrics and details
|
| 13 |
+
└── README.md # Model Card
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
## Step-by-Step Upload Process
|
| 17 |
+
|
| 18 |
+
### 1. Install Hugging Face CLI (if not already installed)
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install huggingface_hub
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### 2. Login to Hugging Face
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
huggingface-cli login
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
Enter your Hugging Face token when prompted. Get your token from: https://huggingface.co/settings/tokens
|
| 31 |
+
|
| 32 |
+
### 3. Create Repository
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
huggingface-cli repo create bge-m3-vietnamese-rental-projection --type model
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
This creates a new model repository: `https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection`
|
| 39 |
+
|
| 40 |
+
### 4. Upload Files
|
| 41 |
+
|
| 42 |
+
#### Option A: Using huggingface-cli (Recommended)
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
cd hf_upload
|
| 46 |
+
|
| 47 |
+
# Upload all files at once
|
| 48 |
+
huggingface-cli upload YOUR_USERNAME/bge-m3-vietnamese-rental-projection . . --repo-type model
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
#### Option B: Using Git
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
cd hf_upload
|
| 55 |
+
|
| 56 |
+
# Clone the empty repo
|
| 57 |
+
git clone https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
|
| 58 |
+
cd bge-m3-vietnamese-rental-projection
|
| 59 |
+
|
| 60 |
+
# Copy files
|
| 61 |
+
cp ../model.safetensors .
|
| 62 |
+
cp ../config.json .
|
| 63 |
+
cp ../modeling_bgem3_projection.py .
|
| 64 |
+
cp ../training_info.json .
|
| 65 |
+
cp ../README.md .
|
| 66 |
+
|
| 67 |
+
# Commit and push
|
| 68 |
+
git add .
|
| 69 |
+
git commit -m "Initial upload: BGE-M3 Vietnamese rental projection head"
|
| 70 |
+
git push
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
#### Option C: Using Python
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
from huggingface_hub import HfApi
|
| 77 |
+
|
| 78 |
+
api = HfApi()
|
| 79 |
+
|
| 80 |
+
# Upload each file
|
| 81 |
+
api.upload_file(
|
| 82 |
+
path_or_fileobj="model.safetensors",
|
| 83 |
+
path_in_repo="model.safetensors",
|
| 84 |
+
repo_id="YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
|
| 85 |
+
repo_type="model",
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
# Repeat for other files...
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### 5. Update README.md
|
| 92 |
+
|
| 93 |
+
Before uploading, update `README.md` with your Hugging Face username:
|
| 94 |
+
|
| 95 |
+
1. Replace `your-username` with your actual username (appears 2 times)
|
| 96 |
+
2. Update the citation section with your name
|
| 97 |
+
3. Add your contact information if desired
|
| 98 |
+
|
| 99 |
+
### 6. Verify Upload
|
| 100 |
+
|
| 101 |
+
After uploading, visit:
|
| 102 |
+
```
|
| 103 |
+
https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
You should see:
|
| 107 |
+
- ✅ Model Card (README.md) displayed
|
| 108 |
+
- ✅ Files tab shows all 5 files
|
| 109 |
+
- ✅ Model can be loaded with `from_pretrained()`
|
| 110 |
+
|
| 111 |
+
### 7. Test Download (Important!)
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
from transformers import AutoTokenizer
|
| 115 |
+
import sys
|
| 116 |
+
sys.path.insert(0, "path/to/hf_upload") # Add for trust_remote_code
|
| 117 |
+
|
| 118 |
+
# Import model class
|
| 119 |
+
from modeling_bgem3_projection import BGEM3ProjectionModel, BGEM3ProjectionConfig
|
| 120 |
+
|
| 121 |
+
# Load from Hub
|
| 122 |
+
config = BGEM3ProjectionConfig.from_pretrained(
|
| 123 |
+
"YOUR_USERNAME/bge-m3-vietnamese-rental-projection"
|
| 124 |
+
)
|
| 125 |
+
model = BGEM3ProjectionModel.from_pretrained(
|
| 126 |
+
"YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
|
| 127 |
+
config=config,
|
| 128 |
+
trust_remote_code=True
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
# Load tokenizer
|
| 132 |
+
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
|
| 133 |
+
|
| 134 |
+
# Test encoding
|
| 135 |
+
texts = ["Phòng trọ Quận 10, 25m², giá 5tr"]
|
| 136 |
+
embeddings = model.encode(texts)
|
| 137 |
+
print(f"Embeddings shape: {embeddings.shape}") # Should be [1, 128]
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## Troubleshooting
|
| 141 |
+
|
| 142 |
+
### Issue: "trust_remote_code" error
|
| 143 |
+
|
| 144 |
+
**Solution**: Make sure to use `trust_remote_code=True` when loading the model.
|
| 145 |
+
|
| 146 |
+
### Issue: Weight loading warnings
|
| 147 |
+
|
| 148 |
+
The warnings about encoder weights not being initialized are **expected**. We only upload projection head weights; the encoder is loaded from BAAI/bge-m3 separately.
|
| 149 |
+
|
| 150 |
+
### Issue: NumPy version error
|
| 151 |
+
|
| 152 |
+
**Solution**: Use `pip install "numpy<2.0"` if you encounter TensorFlow compatibility issues.
|
| 153 |
+
|
| 154 |
+
## Additional Configuration
|
| 155 |
+
|
| 156 |
+
### Add Model Tags
|
| 157 |
+
|
| 158 |
+
You can add tags to your model page for better discoverability. In the README.md front matter:
|
| 159 |
+
|
| 160 |
+
```yaml
|
| 161 |
+
---
|
| 162 |
+
language:
|
| 163 |
+
- vi
|
| 164 |
+
tags:
|
| 165 |
+
- sentence-transformers
|
| 166 |
+
- vietnamese
|
| 167 |
+
- rental
|
| 168 |
+
- real-estate
|
| 169 |
+
- bge-m3
|
| 170 |
+
---
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
### Add to a Collection
|
| 174 |
+
|
| 175 |
+
Consider adding your model to Vietnamese NLP or real estate collections on Hugging Face.
|
| 176 |
+
|
| 177 |
+
## License
|
| 178 |
+
|
| 179 |
+
The model is released under MIT License. Make sure this is acceptable for your use case.
|
| 180 |
+
|
| 181 |
+
## Support
|
| 182 |
+
|
| 183 |
+
For issues or questions:
|
| 184 |
+
- Open an issue on the model repository
|
| 185 |
+
- Contact Hugging Face support
|
| 186 |
+
- Check Hugging Face documentation: https://huggingface.co/docs
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
**Ready to upload!** 🚀
|
| 191 |
+
|
| 192 |
+
Follow the steps above and your model will be publicly available for the community to use.
|
| 193 |
+
|
__pycache__/modeling_bgem3_projection.cpython-310.pyc
ADDED
|
Binary file (8.69 kB). View file
|
|
|
config.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "bgem3_projection",
|
| 3 |
+
"base_model": "BAAI/bge-m3",
|
| 4 |
+
"d_in": 1024,
|
| 5 |
+
"d_out": 128,
|
| 6 |
+
"use_layernorm": false,
|
| 7 |
+
"freeze_encoder": true,
|
| 8 |
+
"max_length": 512,
|
| 9 |
+
"architectures": [
|
| 10 |
+
"BGEM3ProjectionModel"
|
| 11 |
+
],
|
| 12 |
+
"torch_dtype": "float32",
|
| 13 |
+
"transformers_version": "4.36.0"
|
| 14 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a6dbefbe72459fd1c0592887da680af20cce946e7b2cdb7f00f891e624420e53
|
| 3 |
+
size 524464
|
modeling_bgem3_projection.py
ADDED
|
@@ -0,0 +1,309 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
BGE-M3 Projection Model for Hugging Face Transformers
|
| 3 |
+
|
| 4 |
+
A lightweight projection head trained on top of frozen BGE-M3 encoder
|
| 5 |
+
for Vietnamese rental property search.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from typing import List, Optional, Union
|
| 9 |
+
import torch
|
| 10 |
+
import torch.nn as nn
|
| 11 |
+
import torch.nn.functional as F
|
| 12 |
+
from transformers import AutoModel, AutoTokenizer
|
| 13 |
+
from transformers import PretrainedConfig, PreTrainedModel
|
| 14 |
+
from transformers.modeling_outputs import BaseModelOutput
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class BGEM3ProjectionConfig(PretrainedConfig):
|
| 18 |
+
"""
|
| 19 |
+
Configuration class for BGEM3ProjectionModel
|
| 20 |
+
|
| 21 |
+
Args:
|
| 22 |
+
base_model (str): Base model identifier (default: "BAAI/bge-m3")
|
| 23 |
+
d_in (int): Input dimension from base encoder (default: 1024)
|
| 24 |
+
d_out (int): Output dimension after projection (default: 128)
|
| 25 |
+
use_layernorm (bool): Whether to use LayerNorm in projection head
|
| 26 |
+
freeze_encoder (bool): Whether to freeze the base encoder
|
| 27 |
+
max_length (int): Maximum sequence length for tokenization
|
| 28 |
+
"""
|
| 29 |
+
|
| 30 |
+
model_type = "bgem3_projection"
|
| 31 |
+
|
| 32 |
+
def __init__(
|
| 33 |
+
self,
|
| 34 |
+
base_model: str = "BAAI/bge-m3",
|
| 35 |
+
d_in: int = 1024,
|
| 36 |
+
d_out: int = 128,
|
| 37 |
+
use_layernorm: bool = False,
|
| 38 |
+
freeze_encoder: bool = True,
|
| 39 |
+
max_length: int = 512,
|
| 40 |
+
**kwargs
|
| 41 |
+
):
|
| 42 |
+
super().__init__(**kwargs)
|
| 43 |
+
self.base_model = base_model
|
| 44 |
+
self.d_in = d_in
|
| 45 |
+
self.d_out = d_out
|
| 46 |
+
self.use_layernorm = use_layernorm
|
| 47 |
+
self.freeze_encoder = freeze_encoder
|
| 48 |
+
self.max_length = max_length
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
|
| 52 |
+
"""
|
| 53 |
+
Mean pooling with attention mask
|
| 54 |
+
|
| 55 |
+
Args:
|
| 56 |
+
last_hidden_state: [batch_size, seq_len, hidden_size]
|
| 57 |
+
attention_mask: [batch_size, seq_len]
|
| 58 |
+
|
| 59 |
+
Returns:
|
| 60 |
+
pooled: [batch_size, hidden_size]
|
| 61 |
+
"""
|
| 62 |
+
mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state) # [B, T, 1]
|
| 63 |
+
summed = (last_hidden_state * mask).sum(dim=1) # [B, H]
|
| 64 |
+
counts = mask.sum(dim=1).clamp(min=1e-6) # [B, 1]
|
| 65 |
+
return summed / counts
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
class ProjectionHead(nn.Module):
|
| 69 |
+
"""
|
| 70 |
+
Projection head: Linear + Optional LayerNorm + L2 Normalization
|
| 71 |
+
"""
|
| 72 |
+
|
| 73 |
+
def __init__(self, d_in: int, d_out: int, use_layernorm: bool = False):
|
| 74 |
+
super().__init__()
|
| 75 |
+
self.linear = nn.Linear(d_in, d_out, bias=False)
|
| 76 |
+
self.ln = nn.LayerNorm(d_out) if use_layernorm else None
|
| 77 |
+
|
| 78 |
+
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
| 79 |
+
"""
|
| 80 |
+
Args:
|
| 81 |
+
x: [batch_size, d_in]
|
| 82 |
+
|
| 83 |
+
Returns:
|
| 84 |
+
[batch_size, d_out] L2-normalized
|
| 85 |
+
"""
|
| 86 |
+
x = self.linear(x)
|
| 87 |
+
if self.ln is not None:
|
| 88 |
+
x = self.ln(x)
|
| 89 |
+
# L2 normalize for cosine similarity
|
| 90 |
+
x = F.normalize(x, p=2, dim=-1)
|
| 91 |
+
return x
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
class BGEM3ProjectionModel(PreTrainedModel):
|
| 95 |
+
"""
|
| 96 |
+
BGE-M3 with trainable projection head
|
| 97 |
+
|
| 98 |
+
This model combines:
|
| 99 |
+
1. Frozen BGE-M3 encoder (1024-dim embeddings)
|
| 100 |
+
2. Trainable projection head (1024 -> d_out, default 128)
|
| 101 |
+
|
| 102 |
+
Usage:
|
| 103 |
+
>>> from transformers import AutoModel, AutoTokenizer
|
| 104 |
+
>>>
|
| 105 |
+
>>> model = AutoModel.from_pretrained("your-username/bge-m3-vietnamese-rental-projection", trust_remote_code=True)
|
| 106 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
|
| 107 |
+
>>>
|
| 108 |
+
>>> # Encode texts
|
| 109 |
+
>>> texts = ["Phòng trọ Quận 10, 25m2, giá 5tr"]
|
| 110 |
+
>>> inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
|
| 111 |
+
>>> embeddings = model(**inputs).last_hidden_state
|
| 112 |
+
>>>
|
| 113 |
+
>>> # Or use the encode method
|
| 114 |
+
>>> embeddings = model.encode(texts)
|
| 115 |
+
"""
|
| 116 |
+
|
| 117 |
+
config_class = BGEM3ProjectionConfig
|
| 118 |
+
base_model_prefix = "bgem3_projection"
|
| 119 |
+
supports_gradient_checkpointing = False
|
| 120 |
+
|
| 121 |
+
def __init__(self, config: BGEM3ProjectionConfig):
|
| 122 |
+
super().__init__(config)
|
| 123 |
+
|
| 124 |
+
self.config = config
|
| 125 |
+
|
| 126 |
+
# Load base encoder
|
| 127 |
+
self.encoder = AutoModel.from_pretrained(config.base_model)
|
| 128 |
+
|
| 129 |
+
# Freeze encoder if specified
|
| 130 |
+
if config.freeze_encoder:
|
| 131 |
+
for param in self.encoder.parameters():
|
| 132 |
+
param.requires_grad = False
|
| 133 |
+
|
| 134 |
+
# Projection head (trainable)
|
| 135 |
+
self.head = ProjectionHead(
|
| 136 |
+
d_in=config.d_in,
|
| 137 |
+
d_out=config.d_out,
|
| 138 |
+
use_layernorm=config.use_layernorm
|
| 139 |
+
)
|
| 140 |
+
|
| 141 |
+
# Initialize tokenizer (for convenience)
|
| 142 |
+
self._tokenizer = None
|
| 143 |
+
|
| 144 |
+
@property
|
| 145 |
+
def tokenizer(self):
|
| 146 |
+
"""Lazy load tokenizer"""
|
| 147 |
+
if self._tokenizer is None:
|
| 148 |
+
self._tokenizer = AutoTokenizer.from_pretrained(
|
| 149 |
+
self.config.base_model,
|
| 150 |
+
use_fast=True
|
| 151 |
+
)
|
| 152 |
+
return self._tokenizer
|
| 153 |
+
|
| 154 |
+
def forward(
|
| 155 |
+
self,
|
| 156 |
+
input_ids: Optional[torch.Tensor] = None,
|
| 157 |
+
attention_mask: Optional[torch.Tensor] = None,
|
| 158 |
+
token_type_ids: Optional[torch.Tensor] = None,
|
| 159 |
+
position_ids: Optional[torch.Tensor] = None,
|
| 160 |
+
head_mask: Optional[torch.Tensor] = None,
|
| 161 |
+
inputs_embeds: Optional[torch.Tensor] = None,
|
| 162 |
+
output_attentions: Optional[bool] = None,
|
| 163 |
+
output_hidden_states: Optional[bool] = None,
|
| 164 |
+
return_dict: Optional[bool] = None,
|
| 165 |
+
) -> Union[tuple, BaseModelOutput]:
|
| 166 |
+
"""
|
| 167 |
+
Forward pass through encoder and projection head
|
| 168 |
+
|
| 169 |
+
Returns:
|
| 170 |
+
BaseModelOutput with last_hidden_state = projected embeddings [batch_size, d_out]
|
| 171 |
+
"""
|
| 172 |
+
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 173 |
+
|
| 174 |
+
# Encode with base model
|
| 175 |
+
with torch.set_grad_enabled(not self.config.freeze_encoder):
|
| 176 |
+
encoder_outputs = self.encoder(
|
| 177 |
+
input_ids=input_ids,
|
| 178 |
+
attention_mask=attention_mask,
|
| 179 |
+
token_type_ids=token_type_ids,
|
| 180 |
+
position_ids=position_ids,
|
| 181 |
+
head_mask=head_mask,
|
| 182 |
+
inputs_embeds=inputs_embeds,
|
| 183 |
+
output_attentions=output_attentions,
|
| 184 |
+
output_hidden_states=output_hidden_states,
|
| 185 |
+
return_dict=True,
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
# Mean pooling
|
| 189 |
+
pooled = mean_pool(
|
| 190 |
+
encoder_outputs.last_hidden_state,
|
| 191 |
+
attention_mask
|
| 192 |
+
) # [batch_size, 1024]
|
| 193 |
+
|
| 194 |
+
# Project to d_out
|
| 195 |
+
projected = self.head(pooled) # [batch_size, d_out], L2-normalized
|
| 196 |
+
|
| 197 |
+
if not return_dict:
|
| 198 |
+
return (projected,)
|
| 199 |
+
|
| 200 |
+
return BaseModelOutput(
|
| 201 |
+
last_hidden_state=projected,
|
| 202 |
+
hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
|
| 203 |
+
attentions=encoder_outputs.attentions if output_attentions else None,
|
| 204 |
+
)
|
| 205 |
+
|
| 206 |
+
@torch.no_grad()
|
| 207 |
+
def encode(
|
| 208 |
+
self,
|
| 209 |
+
texts: Union[str, List[str]],
|
| 210 |
+
batch_size: int = 32,
|
| 211 |
+
max_length: Optional[int] = None,
|
| 212 |
+
show_progress: bool = False,
|
| 213 |
+
device: Optional[torch.device] = None,
|
| 214 |
+
) -> torch.Tensor:
|
| 215 |
+
"""
|
| 216 |
+
Encode texts to embeddings (convenience method)
|
| 217 |
+
|
| 218 |
+
Args:
|
| 219 |
+
texts: Single text or list of texts
|
| 220 |
+
batch_size: Batch size for encoding
|
| 221 |
+
max_length: Maximum sequence length (default: config.max_length)
|
| 222 |
+
show_progress: Show progress bar
|
| 223 |
+
device: Target device (default: model device)
|
| 224 |
+
|
| 225 |
+
Returns:
|
| 226 |
+
Tensor of shape [num_texts, d_out], L2-normalized
|
| 227 |
+
"""
|
| 228 |
+
if isinstance(texts, str):
|
| 229 |
+
texts = [texts]
|
| 230 |
+
|
| 231 |
+
if device is None:
|
| 232 |
+
device = next(self.parameters()).device
|
| 233 |
+
|
| 234 |
+
if max_length is None:
|
| 235 |
+
max_length = self.config.max_length
|
| 236 |
+
|
| 237 |
+
self.eval()
|
| 238 |
+
all_embeddings = []
|
| 239 |
+
|
| 240 |
+
# Optional progress bar
|
| 241 |
+
iterator = range(0, len(texts), batch_size)
|
| 242 |
+
if show_progress:
|
| 243 |
+
try:
|
| 244 |
+
from tqdm import tqdm
|
| 245 |
+
iterator = tqdm(iterator, desc="Encoding")
|
| 246 |
+
except ImportError:
|
| 247 |
+
pass
|
| 248 |
+
|
| 249 |
+
for i in iterator:
|
| 250 |
+
batch_texts = texts[i:i + batch_size]
|
| 251 |
+
|
| 252 |
+
# Tokenize
|
| 253 |
+
inputs = self.tokenizer(
|
| 254 |
+
batch_texts,
|
| 255 |
+
padding=True,
|
| 256 |
+
truncation=True,
|
| 257 |
+
max_length=max_length,
|
| 258 |
+
return_tensors="pt"
|
| 259 |
+
)
|
| 260 |
+
|
| 261 |
+
# Move to device
|
| 262 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 263 |
+
|
| 264 |
+
# Forward pass
|
| 265 |
+
outputs = self.forward(**inputs)
|
| 266 |
+
embeddings = outputs.last_hidden_state
|
| 267 |
+
|
| 268 |
+
all_embeddings.append(embeddings.cpu())
|
| 269 |
+
|
| 270 |
+
return torch.cat(all_embeddings, dim=0)
|
| 271 |
+
|
| 272 |
+
def compute_similarity(
|
| 273 |
+
self,
|
| 274 |
+
text1: Union[str, List[str]],
|
| 275 |
+
text2: Union[str, List[str]],
|
| 276 |
+
) -> torch.Tensor:
|
| 277 |
+
"""
|
| 278 |
+
Compute cosine similarity between texts
|
| 279 |
+
|
| 280 |
+
Args:
|
| 281 |
+
text1: Single text or list of texts
|
| 282 |
+
text2: Single text or list of texts
|
| 283 |
+
|
| 284 |
+
Returns:
|
| 285 |
+
Similarity scores (cosine similarity)
|
| 286 |
+
"""
|
| 287 |
+
emb1 = self.encode(text1)
|
| 288 |
+
emb2 = self.encode(text2)
|
| 289 |
+
|
| 290 |
+
# Cosine similarity (already L2-normalized, so just dot product)
|
| 291 |
+
if emb1.dim() == 1:
|
| 292 |
+
emb1 = emb1.unsqueeze(0)
|
| 293 |
+
if emb2.dim() == 1:
|
| 294 |
+
emb2 = emb2.unsqueeze(0)
|
| 295 |
+
|
| 296 |
+
similarity = emb1 @ emb2.T
|
| 297 |
+
|
| 298 |
+
return similarity.squeeze()
|
| 299 |
+
|
| 300 |
+
|
| 301 |
+
# Register model for AutoModel
|
| 302 |
+
try:
|
| 303 |
+
from transformers import AutoModel, AutoConfig
|
| 304 |
+
AutoConfig.register("bgem3_projection", BGEM3ProjectionConfig)
|
| 305 |
+
AutoModel.register(BGEM3ProjectionConfig, BGEM3ProjectionModel)
|
| 306 |
+
except Exception as e:
|
| 307 |
+
# Registration may fail if models are already registered
|
| 308 |
+
pass
|
| 309 |
+
|
training_info.json
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"training": {
|
| 3 |
+
"dataset_size": 10384,
|
| 4 |
+
"train_examples": 9345,
|
| 5 |
+
"val_examples": 1039,
|
| 6 |
+
"epochs": 17,
|
| 7 |
+
"best_epoch": 16,
|
| 8 |
+
"batch_size": 128,
|
| 9 |
+
"learning_rate": 0.0002,
|
| 10 |
+
"optimizer": "AdamW",
|
| 11 |
+
"weight_decay": 0.01,
|
| 12 |
+
"device": "Tesla T4 (Google Colab)",
|
| 13 |
+
"training_time": "~2.5 hours"
|
| 14 |
+
},
|
| 15 |
+
"loss": {
|
| 16 |
+
"final_train_loss": 1.9191,
|
| 17 |
+
"best_val_loss": 1.8215122487809923,
|
| 18 |
+
"initial_train_loss": 2.9054,
|
| 19 |
+
"initial_val_loss": 2.4529,
|
| 20 |
+
"improvement": {
|
| 21 |
+
"train": "-34%",
|
| 22 |
+
"val": "-26%"
|
| 23 |
+
}
|
| 24 |
+
},
|
| 25 |
+
"evaluation": {
|
| 26 |
+
"test_examples": 96,
|
| 27 |
+
"metrics": {
|
| 28 |
+
"MRR": 0.9844,
|
| 29 |
+
"Recall@1": 0.9688,
|
| 30 |
+
"Recall@5": 1.0,
|
| 31 |
+
"Recall@10": 1.0,
|
| 32 |
+
"Recall@50": 1.0
|
| 33 |
+
},
|
| 34 |
+
"interpretation": "Excellent retrieval performance. 96.88% of queries find correct match at rank 1."
|
| 35 |
+
},
|
| 36 |
+
"model_details": {
|
| 37 |
+
"base_model": "BAAI/bge-m3",
|
| 38 |
+
"projection_dim": 128,
|
| 39 |
+
"trainable_params": 131072,
|
| 40 |
+
"total_params": 567885824,
|
| 41 |
+
"trainable_ratio": "0.02%",
|
| 42 |
+
"training_strategy": "Frozen encoder + trainable projection head"
|
| 43 |
+
},
|
| 44 |
+
"loss_function": {
|
| 45 |
+
"type": "Weighted InfoNCE",
|
| 46 |
+
"temperature": 0.07,
|
| 47 |
+
"symmetric": true,
|
| 48 |
+
"weighted_hard_negatives": true,
|
| 49 |
+
"feature_weights": {
|
| 50 |
+
"location": 2.5,
|
| 51 |
+
"price": 2.0,
|
| 52 |
+
"area": 1.8,
|
| 53 |
+
"amenity": 1.5
|
| 54 |
+
}
|
| 55 |
+
}
|
| 56 |
+
}
|