lamdx4 commited on
Commit
62e0350
·
verified ·
1 Parent(s): cc3c90c

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ license: mit
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ - embeddings
11
+ - vietnamese
12
+ - rental
13
+ - real-estate
14
+ library_name: transformers
15
+ pipeline_tag: feature-extraction
16
+ ---
17
+
18
+ # BGE-M3 Vietnamese Rental Property Search
19
+
20
+ Fine-tuned projection head for [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for **Vietnamese rental property search** (Phòng trọ).
21
+
22
+ This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with **weighted hard negatives** using contrastive learning (InfoNCE loss).
23
+
24
+ ## 🎯 Model Description
25
+
26
+ - **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (frozen)
27
+ - **Task**: Semantic search for Vietnamese rental properties
28
+ - **Training Strategy**: Weighted contrastive learning with hard negatives
29
+ - **Output Dimension**: 128 (projected from 1024)
30
+ - **Training Data**: 10,384 Vietnamese rental property query-document pairs
31
+
32
+ ## 📊 Performance
33
+
34
+ Evaluated on 96 test examples:
35
+
36
+ | Metric | Score |
37
+ |--------|-------|
38
+ | **MRR** | **98.44%** |
39
+ | **Recall@1** | **96.88%** |
40
+ | **Recall@5** | **100.00%** |
41
+ | **Recall@10** | **100.00%** |
42
+ | **Recall@50** | **100.00%** |
43
+
44
+ ### Interpretation
45
+
46
+ - **98.44% MRR**: On average, the correct match appears at position ~1.02 (nearly always rank 1!)
47
+ - **96.88% Recall@1**: 93 out of 96 queries find the correct match at the top position
48
+ - **100% Recall@5+**: All queries find their correct match within top-5 results
49
+
50
+ ## 🚀 Quick Start
51
+
52
+ ### Installation
53
+
54
+ ```bash
55
+ pip install transformers torch safetensors
56
+ ```
57
+
58
+ ### Usage
59
+
60
+ ```python
61
+ from transformers import AutoModel, AutoTokenizer
62
+ import torch
63
+
64
+ # Load model
65
+ model = AutoModel.from_pretrained(
66
+ "your-username/bge-m3-vietnamese-rental-projection",
67
+ trust_remote_code=True
68
+ )
69
+ tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
70
+
71
+ # Move to GPU if available
72
+ device = "cuda" if torch.cuda.is_available() else "cpu"
73
+ model = model.to(device)
74
+ model.eval()
75
+
76
+ # Encode texts
77
+ texts = [
78
+ "Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh",
79
+ "Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng"
80
+ ]
81
+
82
+ # Method 1: Using encode (recommended)
83
+ embeddings = model.encode(texts, device=device) # [2, 128]
84
+
85
+ # Method 2: Using forward
86
+ inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
87
+ inputs = {k: v.to(device) for k, v in inputs.items()}
88
+
89
+ with torch.no_grad():
90
+ outputs = model(**inputs)
91
+ embeddings = outputs.last_hidden_state # [2, 128], L2-normalized
92
+
93
+ print(embeddings.shape) # torch.Size([2, 128])
94
+
95
+ # Compute similarity (cosine)
96
+ similarity = embeddings[0] @ embeddings[1]
97
+ print(f"Similarity: {similarity:.4f}")
98
+ ```
99
+
100
+ ### Search Example
101
+
102
+ ```python
103
+ # Build a search engine
104
+ class RentalSearchEngine:
105
+ def __init__(self, model, tokenizer, device="cuda"):
106
+ self.model = model
107
+ self.tokenizer = tokenizer
108
+ self.device = device
109
+ self.database_embeddings = None
110
+ self.database_texts = None
111
+
112
+ def index(self, property_descriptions):
113
+ """Index a database of property descriptions"""
114
+ self.database_texts = property_descriptions
115
+ self.database_embeddings = self.model.encode(
116
+ property_descriptions,
117
+ device=self.device
118
+ )
119
+
120
+ def search(self, query, top_k=5):
121
+ """Search for most similar properties"""
122
+ query_emb = self.model.encode([query], device=self.device)[0]
123
+
124
+ # Compute similarities
125
+ similarities = query_emb @ self.database_embeddings.T
126
+
127
+ # Get top-k
128
+ top_k = min(top_k, len(similarities))
129
+ scores, indices = torch.topk(similarities, k=top_k)
130
+
131
+ results = []
132
+ for idx, score in zip(indices.tolist(), scores.tolist()):
133
+ results.append({
134
+ "text": self.database_texts[idx],
135
+ "score": score
136
+ })
137
+
138
+ return results
139
+
140
+ # Example usage
141
+ engine = RentalSearchEngine(model, tokenizer, device)
142
+
143
+ # Index properties
144
+ properties = [
145
+ "Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng",
146
+ "Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng",
147
+ "Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng",
148
+ "Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng",
149
+ ]
150
+ engine.index(properties)
151
+
152
+ # Search
153
+ results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3)
154
+
155
+ for i, result in enumerate(results, 1):
156
+ print(f"{i}. [{result['score']:.4f}] {result['text']}")
157
+ ```
158
+
159
+ ## 🎓 Training Details
160
+
161
+ ### Dataset
162
+
163
+ - **Size**: 10,384 examples
164
+ - **Split**: 9,345 train / 1,039 validation
165
+ - **Format**: Query-positive-hard negatives triplets
166
+ - **Hard Negatives**: 3 per example, weighted by feature type
167
+
168
+ ### Weighted Hard Negatives Strategy
169
+
170
+ The model uses feature-based weighting for hard negatives:
171
+
172
+ | Feature Type | Weight | Importance |
173
+ |--------------|--------|------------|
174
+ | Location (Quận) | 2.5 | Highest |
175
+ | Price | 2.0 | High |
176
+ | Area (m²) | 1.8 | Medium |
177
+ | Amenities | 1.5 | Lower |
178
+
179
+ This teaches the model that location mismatches are more critical than amenity differences.
180
+
181
+ ### Training Configuration
182
+
183
+ ```json
184
+ {
185
+ "base_model": "BAAI/bge-m3",
186
+ "d_out": 128,
187
+ "freeze_encoder": true,
188
+ "epochs": 17,
189
+ "batch_size": 128,
190
+ "learning_rate": 0.0002,
191
+ "optimizer": "AdamW",
192
+ "weight_decay": 0.01,
193
+ "loss": "Weighted InfoNCE (symmetric)",
194
+ "temperature": 0.07,
195
+ "device": "Tesla T4 (Google Colab)",
196
+ "training_time": "~2.5 hours"
197
+ }
198
+ ```
199
+
200
+ ### Training Progress
201
+
202
+ | Epoch | Train Loss | Val Loss | Status |
203
+ |-------|------------|----------|--------|
204
+ | 1 | 2.9054 | 2.4529 | ⭐ Best |
205
+ | 5 | 2.1609 | 2.0078 | ⭐ Best |
206
+ | 9 | 2.0237 | 1.8906 | ⭐ Best |
207
+ | 12 | 1.9722 | 1.8760 | ⭐ Best |
208
+ | **16** | **1.9297** | **1.8215** | ⭐ **Best** |
209
+ | 17 | 1.9191 | 1.8276 | Final |
210
+
211
+ **Improvement**: -34% train loss, -26% validation loss
212
+
213
+ ### Model Architecture
214
+
215
+ ```
216
+ BAAI/bge-m3 (frozen)
217
+ ↓ [1024-dim]
218
+ ProjectionHead
219
+ ├─ Linear(1024 → 128, bias=False)
220
+ └─ L2 Normalization
221
+ ↓ [128-dim, L2-normalized]
222
+ Output Embeddings
223
+ ```
224
+
225
+ **Parameters**:
226
+ - Trainable: 131,072 (0.02%)
227
+ - Total: 567,885,824
228
+ - Strategy: Only projection head is trainable
229
+
230
+ ## 🎯 Use Cases
231
+
232
+ This model is optimized for:
233
+
234
+ ✅ **Vietnamese rental property search**
235
+ - Matching user queries to property listings
236
+ - Finding similar properties
237
+ - Semantic search for rental accommodations
238
+
239
+ ✅ **Supported features**:
240
+ - Location (districts, neighborhoods)
241
+ - Price range
242
+ - Area/size (m²)
243
+ - Amenities (WC, máy lạnh, ban công, bếp, etc.)
244
+ - Room type (phòng trọ, studio, etc.)
245
+
246
+ ## ⚠️ Limitations
247
+
248
+ - **Domain-specific**: Optimized for Vietnamese rental properties only
249
+ - **Geographic focus**: Primarily trained on properties in Ho Chi Minh City and Hanoi
250
+ - **Language**: Vietnamese only (not multilingual like base BGE-M3)
251
+ - **Frozen encoder**: Base BGE-M3 encoder is not fine-tuned, only projection head
252
+ - **Not for**: General-purpose Vietnamese embeddings or other domains
253
+
254
+ ## 🔍 Example Predictions
255
+
256
+ ### Example 1: Location Sensitivity
257
+
258
+ ```
259
+ Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng"
260
+
261
+ Positive (0.947): Gò Vấp 18m² 3tr5 wc riêng ✅
262
+ Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!)
263
+ Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price)
264
+ Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity)
265
+
266
+ → Model correctly penalizes location mismatch most heavily
267
+ ```
268
+
269
+ ### Example 2: Feature Understanding
270
+
271
+ ```
272
+ Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ"
273
+
274
+ Positive (0.904): Q10 20m² 4tr máy lạnh wc riêng ✅
275
+ Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district)
276
+ Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price)
277
+ Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs)
278
+
279
+ → Strong margin (+0.36) between positive and top negative
280
+ ```
281
+
282
+ ## 📖 Citation
283
+
284
+ If you use this model, please cite:
285
+
286
+ ```bibtex
287
+ @misc{bge-m3-vietnamese-rental,
288
+ author = {Your Name},
289
+ title = {BGE-M3 Vietnamese Rental Property Search},
290
+ year = {2025},
291
+ publisher = {Hugging Face},
292
+ howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}},
293
+ }
294
+ ```
295
+
296
+ ## 📜 License
297
+
298
+ MIT License - Free to use for commercial and non-commercial purposes.
299
+
300
+ ## 🙏 Acknowledgments
301
+
302
+ - Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
303
+ - Framework: [Hugging Face Transformers](https://github.com/huggingface/transformers)
304
+ - Training: Google Colab (Tesla T4)
305
+
306
+ ## 📧 Contact
307
+
308
+ For questions or feedback, please open an issue on the model repository.
309
+
310
+ ---
311
+
312
+ **Last updated**: October 2025
313
+
UPLOAD_INSTRUCTIONS.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face Hub Upload Instructions
2
+
3
+ ## Files Ready for Upload
4
+
5
+ All files are in the `hf_upload/` directory:
6
+
7
+ ```
8
+ hf_upload/
9
+ ├── model.safetensors # Projection head weights (512 KB)
10
+ ├── config.json # Model configuration
11
+ ├── modeling_bgem3_projection.py # Model class definition
12
+ ├── training_info.json # Training metrics and details
13
+ └── README.md # Model Card
14
+ ```
15
+
16
+ ## Step-by-Step Upload Process
17
+
18
+ ### 1. Install Hugging Face CLI (if not already installed)
19
+
20
+ ```bash
21
+ pip install huggingface_hub
22
+ ```
23
+
24
+ ### 2. Login to Hugging Face
25
+
26
+ ```bash
27
+ huggingface-cli login
28
+ ```
29
+
30
+ Enter your Hugging Face token when prompted. Get your token from: https://huggingface.co/settings/tokens
31
+
32
+ ### 3. Create Repository
33
+
34
+ ```bash
35
+ huggingface-cli repo create bge-m3-vietnamese-rental-projection --type model
36
+ ```
37
+
38
+ This creates a new model repository: `https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection`
39
+
40
+ ### 4. Upload Files
41
+
42
+ #### Option A: Using huggingface-cli (Recommended)
43
+
44
+ ```bash
45
+ cd hf_upload
46
+
47
+ # Upload all files at once
48
+ huggingface-cli upload YOUR_USERNAME/bge-m3-vietnamese-rental-projection . . --repo-type model
49
+ ```
50
+
51
+ #### Option B: Using Git
52
+
53
+ ```bash
54
+ cd hf_upload
55
+
56
+ # Clone the empty repo
57
+ git clone https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
58
+ cd bge-m3-vietnamese-rental-projection
59
+
60
+ # Copy files
61
+ cp ../model.safetensors .
62
+ cp ../config.json .
63
+ cp ../modeling_bgem3_projection.py .
64
+ cp ../training_info.json .
65
+ cp ../README.md .
66
+
67
+ # Commit and push
68
+ git add .
69
+ git commit -m "Initial upload: BGE-M3 Vietnamese rental projection head"
70
+ git push
71
+ ```
72
+
73
+ #### Option C: Using Python
74
+
75
+ ```python
76
+ from huggingface_hub import HfApi
77
+
78
+ api = HfApi()
79
+
80
+ # Upload each file
81
+ api.upload_file(
82
+ path_or_fileobj="model.safetensors",
83
+ path_in_repo="model.safetensors",
84
+ repo_id="YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
85
+ repo_type="model",
86
+ )
87
+
88
+ # Repeat for other files...
89
+ ```
90
+
91
+ ### 5. Update README.md
92
+
93
+ Before uploading, update `README.md` with your Hugging Face username:
94
+
95
+ 1. Replace `your-username` with your actual username (appears 2 times)
96
+ 2. Update the citation section with your name
97
+ 3. Add your contact information if desired
98
+
99
+ ### 6. Verify Upload
100
+
101
+ After uploading, visit:
102
+ ```
103
+ https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
104
+ ```
105
+
106
+ You should see:
107
+ - ✅ Model Card (README.md) displayed
108
+ - ✅ Files tab shows all 5 files
109
+ - ✅ Model can be loaded with `from_pretrained()`
110
+
111
+ ### 7. Test Download (Important!)
112
+
113
+ ```python
114
+ from transformers import AutoTokenizer
115
+ import sys
116
+ sys.path.insert(0, "path/to/hf_upload") # Add for trust_remote_code
117
+
118
+ # Import model class
119
+ from modeling_bgem3_projection import BGEM3ProjectionModel, BGEM3ProjectionConfig
120
+
121
+ # Load from Hub
122
+ config = BGEM3ProjectionConfig.from_pretrained(
123
+ "YOUR_USERNAME/bge-m3-vietnamese-rental-projection"
124
+ )
125
+ model = BGEM3ProjectionModel.from_pretrained(
126
+ "YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
127
+ config=config,
128
+ trust_remote_code=True
129
+ )
130
+
131
+ # Load tokenizer
132
+ tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
133
+
134
+ # Test encoding
135
+ texts = ["Phòng trọ Quận 10, 25m², giá 5tr"]
136
+ embeddings = model.encode(texts)
137
+ print(f"Embeddings shape: {embeddings.shape}") # Should be [1, 128]
138
+ ```
139
+
140
+ ## Troubleshooting
141
+
142
+ ### Issue: "trust_remote_code" error
143
+
144
+ **Solution**: Make sure to use `trust_remote_code=True` when loading the model.
145
+
146
+ ### Issue: Weight loading warnings
147
+
148
+ The warnings about encoder weights not being initialized are **expected**. We only upload projection head weights; the encoder is loaded from BAAI/bge-m3 separately.
149
+
150
+ ### Issue: NumPy version error
151
+
152
+ **Solution**: Use `pip install "numpy<2.0"` if you encounter TensorFlow compatibility issues.
153
+
154
+ ## Additional Configuration
155
+
156
+ ### Add Model Tags
157
+
158
+ You can add tags to your model page for better discoverability. In the README.md front matter:
159
+
160
+ ```yaml
161
+ ---
162
+ language:
163
+ - vi
164
+ tags:
165
+ - sentence-transformers
166
+ - vietnamese
167
+ - rental
168
+ - real-estate
169
+ - bge-m3
170
+ ---
171
+ ```
172
+
173
+ ### Add to a Collection
174
+
175
+ Consider adding your model to Vietnamese NLP or real estate collections on Hugging Face.
176
+
177
+ ## License
178
+
179
+ The model is released under MIT License. Make sure this is acceptable for your use case.
180
+
181
+ ## Support
182
+
183
+ For issues or questions:
184
+ - Open an issue on the model repository
185
+ - Contact Hugging Face support
186
+ - Check Hugging Face documentation: https://huggingface.co/docs
187
+
188
+ ---
189
+
190
+ **Ready to upload!** 🚀
191
+
192
+ Follow the steps above and your model will be publicly available for the community to use.
193
+
__pycache__/modeling_bgem3_projection.cpython-310.pyc ADDED
Binary file (8.69 kB). View file
 
config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "bgem3_projection",
3
+ "base_model": "BAAI/bge-m3",
4
+ "d_in": 1024,
5
+ "d_out": 128,
6
+ "use_layernorm": false,
7
+ "freeze_encoder": true,
8
+ "max_length": 512,
9
+ "architectures": [
10
+ "BGEM3ProjectionModel"
11
+ ],
12
+ "torch_dtype": "float32",
13
+ "transformers_version": "4.36.0"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6dbefbe72459fd1c0592887da680af20cce946e7b2cdb7f00f891e624420e53
3
+ size 524464
modeling_bgem3_projection.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BGE-M3 Projection Model for Hugging Face Transformers
3
+
4
+ A lightweight projection head trained on top of frozen BGE-M3 encoder
5
+ for Vietnamese rental property search.
6
+ """
7
+
8
+ from typing import List, Optional, Union
9
+ import torch
10
+ import torch.nn as nn
11
+ import torch.nn.functional as F
12
+ from transformers import AutoModel, AutoTokenizer
13
+ from transformers import PretrainedConfig, PreTrainedModel
14
+ from transformers.modeling_outputs import BaseModelOutput
15
+
16
+
17
+ class BGEM3ProjectionConfig(PretrainedConfig):
18
+ """
19
+ Configuration class for BGEM3ProjectionModel
20
+
21
+ Args:
22
+ base_model (str): Base model identifier (default: "BAAI/bge-m3")
23
+ d_in (int): Input dimension from base encoder (default: 1024)
24
+ d_out (int): Output dimension after projection (default: 128)
25
+ use_layernorm (bool): Whether to use LayerNorm in projection head
26
+ freeze_encoder (bool): Whether to freeze the base encoder
27
+ max_length (int): Maximum sequence length for tokenization
28
+ """
29
+
30
+ model_type = "bgem3_projection"
31
+
32
+ def __init__(
33
+ self,
34
+ base_model: str = "BAAI/bge-m3",
35
+ d_in: int = 1024,
36
+ d_out: int = 128,
37
+ use_layernorm: bool = False,
38
+ freeze_encoder: bool = True,
39
+ max_length: int = 512,
40
+ **kwargs
41
+ ):
42
+ super().__init__(**kwargs)
43
+ self.base_model = base_model
44
+ self.d_in = d_in
45
+ self.d_out = d_out
46
+ self.use_layernorm = use_layernorm
47
+ self.freeze_encoder = freeze_encoder
48
+ self.max_length = max_length
49
+
50
+
51
+ def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
52
+ """
53
+ Mean pooling with attention mask
54
+
55
+ Args:
56
+ last_hidden_state: [batch_size, seq_len, hidden_size]
57
+ attention_mask: [batch_size, seq_len]
58
+
59
+ Returns:
60
+ pooled: [batch_size, hidden_size]
61
+ """
62
+ mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state) # [B, T, 1]
63
+ summed = (last_hidden_state * mask).sum(dim=1) # [B, H]
64
+ counts = mask.sum(dim=1).clamp(min=1e-6) # [B, 1]
65
+ return summed / counts
66
+
67
+
68
+ class ProjectionHead(nn.Module):
69
+ """
70
+ Projection head: Linear + Optional LayerNorm + L2 Normalization
71
+ """
72
+
73
+ def __init__(self, d_in: int, d_out: int, use_layernorm: bool = False):
74
+ super().__init__()
75
+ self.linear = nn.Linear(d_in, d_out, bias=False)
76
+ self.ln = nn.LayerNorm(d_out) if use_layernorm else None
77
+
78
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
79
+ """
80
+ Args:
81
+ x: [batch_size, d_in]
82
+
83
+ Returns:
84
+ [batch_size, d_out] L2-normalized
85
+ """
86
+ x = self.linear(x)
87
+ if self.ln is not None:
88
+ x = self.ln(x)
89
+ # L2 normalize for cosine similarity
90
+ x = F.normalize(x, p=2, dim=-1)
91
+ return x
92
+
93
+
94
+ class BGEM3ProjectionModel(PreTrainedModel):
95
+ """
96
+ BGE-M3 with trainable projection head
97
+
98
+ This model combines:
99
+ 1. Frozen BGE-M3 encoder (1024-dim embeddings)
100
+ 2. Trainable projection head (1024 -> d_out, default 128)
101
+
102
+ Usage:
103
+ >>> from transformers import AutoModel, AutoTokenizer
104
+ >>>
105
+ >>> model = AutoModel.from_pretrained("your-username/bge-m3-vietnamese-rental-projection", trust_remote_code=True)
106
+ >>> tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
107
+ >>>
108
+ >>> # Encode texts
109
+ >>> texts = ["Phòng trọ Quận 10, 25m2, giá 5tr"]
110
+ >>> inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
111
+ >>> embeddings = model(**inputs).last_hidden_state
112
+ >>>
113
+ >>> # Or use the encode method
114
+ >>> embeddings = model.encode(texts)
115
+ """
116
+
117
+ config_class = BGEM3ProjectionConfig
118
+ base_model_prefix = "bgem3_projection"
119
+ supports_gradient_checkpointing = False
120
+
121
+ def __init__(self, config: BGEM3ProjectionConfig):
122
+ super().__init__(config)
123
+
124
+ self.config = config
125
+
126
+ # Load base encoder
127
+ self.encoder = AutoModel.from_pretrained(config.base_model)
128
+
129
+ # Freeze encoder if specified
130
+ if config.freeze_encoder:
131
+ for param in self.encoder.parameters():
132
+ param.requires_grad = False
133
+
134
+ # Projection head (trainable)
135
+ self.head = ProjectionHead(
136
+ d_in=config.d_in,
137
+ d_out=config.d_out,
138
+ use_layernorm=config.use_layernorm
139
+ )
140
+
141
+ # Initialize tokenizer (for convenience)
142
+ self._tokenizer = None
143
+
144
+ @property
145
+ def tokenizer(self):
146
+ """Lazy load tokenizer"""
147
+ if self._tokenizer is None:
148
+ self._tokenizer = AutoTokenizer.from_pretrained(
149
+ self.config.base_model,
150
+ use_fast=True
151
+ )
152
+ return self._tokenizer
153
+
154
+ def forward(
155
+ self,
156
+ input_ids: Optional[torch.Tensor] = None,
157
+ attention_mask: Optional[torch.Tensor] = None,
158
+ token_type_ids: Optional[torch.Tensor] = None,
159
+ position_ids: Optional[torch.Tensor] = None,
160
+ head_mask: Optional[torch.Tensor] = None,
161
+ inputs_embeds: Optional[torch.Tensor] = None,
162
+ output_attentions: Optional[bool] = None,
163
+ output_hidden_states: Optional[bool] = None,
164
+ return_dict: Optional[bool] = None,
165
+ ) -> Union[tuple, BaseModelOutput]:
166
+ """
167
+ Forward pass through encoder and projection head
168
+
169
+ Returns:
170
+ BaseModelOutput with last_hidden_state = projected embeddings [batch_size, d_out]
171
+ """
172
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
173
+
174
+ # Encode with base model
175
+ with torch.set_grad_enabled(not self.config.freeze_encoder):
176
+ encoder_outputs = self.encoder(
177
+ input_ids=input_ids,
178
+ attention_mask=attention_mask,
179
+ token_type_ids=token_type_ids,
180
+ position_ids=position_ids,
181
+ head_mask=head_mask,
182
+ inputs_embeds=inputs_embeds,
183
+ output_attentions=output_attentions,
184
+ output_hidden_states=output_hidden_states,
185
+ return_dict=True,
186
+ )
187
+
188
+ # Mean pooling
189
+ pooled = mean_pool(
190
+ encoder_outputs.last_hidden_state,
191
+ attention_mask
192
+ ) # [batch_size, 1024]
193
+
194
+ # Project to d_out
195
+ projected = self.head(pooled) # [batch_size, d_out], L2-normalized
196
+
197
+ if not return_dict:
198
+ return (projected,)
199
+
200
+ return BaseModelOutput(
201
+ last_hidden_state=projected,
202
+ hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
203
+ attentions=encoder_outputs.attentions if output_attentions else None,
204
+ )
205
+
206
+ @torch.no_grad()
207
+ def encode(
208
+ self,
209
+ texts: Union[str, List[str]],
210
+ batch_size: int = 32,
211
+ max_length: Optional[int] = None,
212
+ show_progress: bool = False,
213
+ device: Optional[torch.device] = None,
214
+ ) -> torch.Tensor:
215
+ """
216
+ Encode texts to embeddings (convenience method)
217
+
218
+ Args:
219
+ texts: Single text or list of texts
220
+ batch_size: Batch size for encoding
221
+ max_length: Maximum sequence length (default: config.max_length)
222
+ show_progress: Show progress bar
223
+ device: Target device (default: model device)
224
+
225
+ Returns:
226
+ Tensor of shape [num_texts, d_out], L2-normalized
227
+ """
228
+ if isinstance(texts, str):
229
+ texts = [texts]
230
+
231
+ if device is None:
232
+ device = next(self.parameters()).device
233
+
234
+ if max_length is None:
235
+ max_length = self.config.max_length
236
+
237
+ self.eval()
238
+ all_embeddings = []
239
+
240
+ # Optional progress bar
241
+ iterator = range(0, len(texts), batch_size)
242
+ if show_progress:
243
+ try:
244
+ from tqdm import tqdm
245
+ iterator = tqdm(iterator, desc="Encoding")
246
+ except ImportError:
247
+ pass
248
+
249
+ for i in iterator:
250
+ batch_texts = texts[i:i + batch_size]
251
+
252
+ # Tokenize
253
+ inputs = self.tokenizer(
254
+ batch_texts,
255
+ padding=True,
256
+ truncation=True,
257
+ max_length=max_length,
258
+ return_tensors="pt"
259
+ )
260
+
261
+ # Move to device
262
+ inputs = {k: v.to(device) for k, v in inputs.items()}
263
+
264
+ # Forward pass
265
+ outputs = self.forward(**inputs)
266
+ embeddings = outputs.last_hidden_state
267
+
268
+ all_embeddings.append(embeddings.cpu())
269
+
270
+ return torch.cat(all_embeddings, dim=0)
271
+
272
+ def compute_similarity(
273
+ self,
274
+ text1: Union[str, List[str]],
275
+ text2: Union[str, List[str]],
276
+ ) -> torch.Tensor:
277
+ """
278
+ Compute cosine similarity between texts
279
+
280
+ Args:
281
+ text1: Single text or list of texts
282
+ text2: Single text or list of texts
283
+
284
+ Returns:
285
+ Similarity scores (cosine similarity)
286
+ """
287
+ emb1 = self.encode(text1)
288
+ emb2 = self.encode(text2)
289
+
290
+ # Cosine similarity (already L2-normalized, so just dot product)
291
+ if emb1.dim() == 1:
292
+ emb1 = emb1.unsqueeze(0)
293
+ if emb2.dim() == 1:
294
+ emb2 = emb2.unsqueeze(0)
295
+
296
+ similarity = emb1 @ emb2.T
297
+
298
+ return similarity.squeeze()
299
+
300
+
301
+ # Register model for AutoModel
302
+ try:
303
+ from transformers import AutoModel, AutoConfig
304
+ AutoConfig.register("bgem3_projection", BGEM3ProjectionConfig)
305
+ AutoModel.register(BGEM3ProjectionConfig, BGEM3ProjectionModel)
306
+ except Exception as e:
307
+ # Registration may fail if models are already registered
308
+ pass
309
+
training_info.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "training": {
3
+ "dataset_size": 10384,
4
+ "train_examples": 9345,
5
+ "val_examples": 1039,
6
+ "epochs": 17,
7
+ "best_epoch": 16,
8
+ "batch_size": 128,
9
+ "learning_rate": 0.0002,
10
+ "optimizer": "AdamW",
11
+ "weight_decay": 0.01,
12
+ "device": "Tesla T4 (Google Colab)",
13
+ "training_time": "~2.5 hours"
14
+ },
15
+ "loss": {
16
+ "final_train_loss": 1.9191,
17
+ "best_val_loss": 1.8215122487809923,
18
+ "initial_train_loss": 2.9054,
19
+ "initial_val_loss": 2.4529,
20
+ "improvement": {
21
+ "train": "-34%",
22
+ "val": "-26%"
23
+ }
24
+ },
25
+ "evaluation": {
26
+ "test_examples": 96,
27
+ "metrics": {
28
+ "MRR": 0.9844,
29
+ "Recall@1": 0.9688,
30
+ "Recall@5": 1.0,
31
+ "Recall@10": 1.0,
32
+ "Recall@50": 1.0
33
+ },
34
+ "interpretation": "Excellent retrieval performance. 96.88% of queries find correct match at rank 1."
35
+ },
36
+ "model_details": {
37
+ "base_model": "BAAI/bge-m3",
38
+ "projection_dim": 128,
39
+ "trainable_params": 131072,
40
+ "total_params": 567885824,
41
+ "trainable_ratio": "0.02%",
42
+ "training_strategy": "Frozen encoder + trainable projection head"
43
+ },
44
+ "loss_function": {
45
+ "type": "Weighted InfoNCE",
46
+ "temperature": 0.07,
47
+ "symmetric": true,
48
+ "weighted_hard_negatives": true,
49
+ "feature_weights": {
50
+ "location": 2.5,
51
+ "price": 2.0,
52
+ "area": 1.8,
53
+ "amenity": 1.5
54
+ }
55
+ }
56
+ }