Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +313 -0
UPLOAD_INSTRUCTIONS.md +193 -0
__pycache__/modeling_bgem3_projection.cpython-310.pyc +0 -0
config.json +14 -0
model.safetensors +3 -0
modeling_bgem3_projection.py +309 -0
training_info.json +56 -0

README.md ADDED Viewed

	@@ -0,0 +1,313 @@

+---
+language:
+- vi
+license: mit
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- transformers
+- embeddings
+- vietnamese
+- rental
+- real-estate
+library_name: transformers
+pipeline_tag: feature-extraction
+---
+# BGE-M3 Vietnamese Rental Property Search
+Fine-tuned projection head for [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for **Vietnamese rental property search** (Phòng trọ).
+This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with **weighted hard negatives** using contrastive learning (InfoNCE loss).
+## 🎯 Model Description
+- **Base Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) (frozen)
+- **Task**: Semantic search for Vietnamese rental properties
+- **Training Strategy**: Weighted contrastive learning with hard negatives
+- **Output Dimension**: 128 (projected from 1024)
+- **Training Data**: 10,384 Vietnamese rental property query-document pairs
+## 📊 Performance
+Evaluated on 96 test examples:
+| Metric | Score |
+|--------|-------|
+| **MRR** | **98.44%** |
+| **Recall@1** | **96.88%** |
+| **Recall@5** | **100.00%** |
+| **Recall@10** | **100.00%** |
+| **Recall@50** | **100.00%** |
+### Interpretation
+- **98.44% MRR**: On average, the correct match appears at position ~1.02 (nearly always rank 1!)
+- **96.88% Recall@1**: 93 out of 96 queries find the correct match at the top position
+- **100% Recall@5+**: All queries find their correct match within top-5 results
+## 🚀 Quick Start
+### Installation
+```bash
+pip install transformers torch safetensors
+```
+### Usage
+```python
+from transformers import AutoModel, AutoTokenizer
+import torch
+# Load model
+model = AutoModel.from_pretrained(
+    "your-username/bge-m3-vietnamese-rental-projection",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
+# Move to GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device)
+model.eval()
+# Encode texts
+texts = [
+    "Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh",
+    "Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng"
+]
+# Method 1: Using encode (recommended)
+embeddings = model.encode(texts, device=device)  # [2, 128]
+# Method 2: Using forward
+inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model(**inputs)
+    embeddings = outputs.last_hidden_state  # [2, 128], L2-normalized
+print(embeddings.shape)  # torch.Size([2, 128])
+# Compute similarity (cosine)
+similarity = embeddings[0] @ embeddings[1]
+print(f"Similarity: {similarity:.4f}")
+```
+### Search Example
+```python
+# Build a search engine
+class RentalSearchEngine:
+    def __init__(self, model, tokenizer, device="cuda"):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.device = device
+        self.database_embeddings = None
+        self.database_texts = None
+    def index(self, property_descriptions):
+        """Index a database of property descriptions"""
+        self.database_texts = property_descriptions
+        self.database_embeddings = self.model.encode(
+            property_descriptions,
+            device=self.device
+        )
+    def search(self, query, top_k=5):
+        """Search for most similar properties"""
+        query_emb = self.model.encode([query], device=self.device)[0]
+        # Compute similarities
+        similarities = query_emb @ self.database_embeddings.T
+        # Get top-k
+        top_k = min(top_k, len(similarities))
+        scores, indices = torch.topk(similarities, k=top_k)
+        results = []
+        for idx, score in zip(indices.tolist(), scores.tolist()):
+            results.append({
+                "text": self.database_texts[idx],
+                "score": score
+            })
+        return results
+# Example usage
+engine = RentalSearchEngine(model, tokenizer, device)
+# Index properties
+properties = [
+    "Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng",
+    "Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng",
+    "Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng",
+    "Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng",
+]
+engine.index(properties)
+# Search
+results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3)
+for i, result in enumerate(results, 1):
+    print(f"{i}. [{result['score']:.4f}] {result['text']}")
+```
+## 🎓 Training Details
+### Dataset
+- **Size**: 10,384 examples
+- **Split**: 9,345 train / 1,039 validation
+- **Format**: Query-positive-hard negatives triplets
+- **Hard Negatives**: 3 per example, weighted by feature type
+### Weighted Hard Negatives Strategy
+The model uses feature-based weighting for hard negatives:
+| Feature Type | Weight | Importance |
+|--------------|--------|------------|
+| Location (Quận) | 2.5 | Highest |
+| Price | 2.0 | High |
+| Area (m²) | 1.8 | Medium |
+| Amenities | 1.5 | Lower |
+This teaches the model that location mismatches are more critical than amenity differences.
+### Training Configuration
+```json
+{
+  "base_model": "BAAI/bge-m3",
+  "d_out": 128,
+  "freeze_encoder": true,
+  "epochs": 17,
+  "batch_size": 128,
+  "learning_rate": 0.0002,
+  "optimizer": "AdamW",
+  "weight_decay": 0.01,
+  "loss": "Weighted InfoNCE (symmetric)",
+  "temperature": 0.07,
+  "device": "Tesla T4 (Google Colab)",
+  "training_time": "~2.5 hours"
+}
+```
+### Training Progress
+| Epoch | Train Loss | Val Loss | Status |
+|-------|------------|----------|--------|
+| 1 | 2.9054 | 2.4529 | ⭐ Best |
+| 5 | 2.1609 | 2.0078 | ⭐ Best |
+| 9 | 2.0237 | 1.8906 | ⭐ Best |
+| 12 | 1.9722 | 1.8760 | ⭐ Best |
+| **16** | **1.9297** | **1.8215** | ⭐ **Best** |
+| 17 | 1.9191 | 1.8276 | Final |
+**Improvement**: -34% train loss, -26% validation loss
+### Model Architecture
+```
+BAAI/bge-m3 (frozen)
+    ↓ [1024-dim]
+ProjectionHead
+    ├─ Linear(1024 → 128, bias=False)
+    └─ L2 Normalization
+    ↓ [128-dim, L2-normalized]
+Output Embeddings
+```
+**Parameters**:
+- Trainable: 131,072 (0.02%)
+- Total: 567,885,824
+- Strategy: Only projection head is trainable
+## 🎯 Use Cases
+This model is optimized for:
+✅ **Vietnamese rental property search**
+- Matching user queries to property listings
+- Finding similar properties
+- Semantic search for rental accommodations
+✅ **Supported features**:
+- Location (districts, neighborhoods)
+- Price range
+- Area/size (m²)
+- Amenities (WC, máy lạnh, ban công, bếp, etc.)
+- Room type (phòng trọ, studio, etc.)
+## ⚠️ Limitations
+- **Domain-specific**: Optimized for Vietnamese rental properties only
+- **Geographic focus**: Primarily trained on properties in Ho Chi Minh City and Hanoi
+- **Language**: Vietnamese only (not multilingual like base BGE-M3)
+- **Frozen encoder**: Base BGE-M3 encoder is not fine-tuned, only projection head
+- **Not for**: General-purpose Vietnamese embeddings or other domains
+## 🔍 Example Predictions
+### Example 1: Location Sensitivity
+```
+Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng"
+Positive (0.947):  Gò Vấp 18m² 3tr5 wc riêng ✅
+Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!)
+Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price)
+Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity)
+→ Model correctly penalizes location mismatch most heavily
+```
+### Example 2: Feature Understanding
+```
+Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ"
+Positive (0.904):  Q10 20m² 4tr máy lạnh wc riêng ✅
+Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district)
+Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price)
+Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs)
+→ Strong margin (+0.36) between positive and top negative
+```
+## 📖 Citation
+If you use this model, please cite:
+```bibtex
+@misc{bge-m3-vietnamese-rental,
+  author = {Your Name},
+  title = {BGE-M3 Vietnamese Rental Property Search},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}},
+}
+```
+## 📜 License
+MIT License - Free to use for commercial and non-commercial purposes.
+## 🙏 Acknowledgments
+- Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
+- Framework: [Hugging Face Transformers](https://github.com/huggingface/transformers)
+- Training: Google Colab (Tesla T4)
+## 📧 Contact
+For questions or feedback, please open an issue on the model repository.
+---
+**Last updated**: October 2025

UPLOAD_INSTRUCTIONS.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# Hugging Face Hub Upload Instructions
+## Files Ready for Upload
+All files are in the `hf_upload/` directory:
+```
+hf_upload/
+├── model.safetensors          # Projection head weights (512 KB)
+├── config.json                 # Model configuration
+├── modeling_bgem3_projection.py # Model class definition
+├── training_info.json          # Training metrics and details
+└── README.md                   # Model Card
+```
+## Step-by-Step Upload Process
+### 1. Install Hugging Face CLI (if not already installed)
+```bash
+pip install huggingface_hub
+```
+### 2. Login to Hugging Face
+```bash
+huggingface-cli login
+```
+Enter your Hugging Face token when prompted. Get your token from: https://huggingface.co/settings/tokens
+### 3. Create Repository
+```bash
+huggingface-cli repo create bge-m3-vietnamese-rental-projection --type model
+```
+This creates a new model repository: `https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection`
+### 4. Upload Files
+#### Option A: Using huggingface-cli (Recommended)
+```bash
+cd hf_upload
+# Upload all files at once
+huggingface-cli upload YOUR_USERNAME/bge-m3-vietnamese-rental-projection . . --repo-type model
+```
+#### Option B: Using Git
+```bash
+cd hf_upload
+# Clone the empty repo
+git clone https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
+cd bge-m3-vietnamese-rental-projection
+# Copy files
+cp ../model.safetensors .
+cp ../config.json .
+cp ../modeling_bgem3_projection.py .
+cp ../training_info.json .
+cp ../README.md .
+# Commit and push
+git add .
+git commit -m "Initial upload: BGE-M3 Vietnamese rental projection head"
+git push
+```
+#### Option C: Using Python
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+# Upload each file
+api.upload_file(
+    path_or_fileobj="model.safetensors",
+    path_in_repo="model.safetensors",
+    repo_id="YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
+    repo_type="model",
+)
+# Repeat for other files...
+```
+### 5. Update README.md
+Before uploading, update `README.md` with your Hugging Face username:
+1. Replace `your-username` with your actual username (appears 2 times)
+2. Update the citation section with your name
+3. Add your contact information if desired
+### 6. Verify Upload
+After uploading, visit:
+```
+https://huggingface.co/YOUR_USERNAME/bge-m3-vietnamese-rental-projection
+```
+You should see:
+- ✅ Model Card (README.md) displayed
+- ✅ Files tab shows all 5 files
+- ✅ Model can be loaded with `from_pretrained()`
+### 7. Test Download (Important!)
+```python
+from transformers import AutoTokenizer
+import sys
+sys.path.insert(0, "path/to/hf_upload")  # Add for trust_remote_code
+# Import model class
+from modeling_bgem3_projection import BGEM3ProjectionModel, BGEM3ProjectionConfig
+# Load from Hub
+config = BGEM3ProjectionConfig.from_pretrained(
+    "YOUR_USERNAME/bge-m3-vietnamese-rental-projection"
+)
+model = BGEM3ProjectionModel.from_pretrained(
+    "YOUR_USERNAME/bge-m3-vietnamese-rental-projection",
+    config=config,
+    trust_remote_code=True
+)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
+# Test encoding
+texts = ["Phòng trọ Quận 10, 25m², giá 5tr"]
+embeddings = model.encode(texts)
+print(f"Embeddings shape: {embeddings.shape}")  # Should be [1, 128]
+```
+## Troubleshooting
+### Issue: "trust_remote_code" error
+**Solution**: Make sure to use `trust_remote_code=True` when loading the model.
+### Issue: Weight loading warnings
+The warnings about encoder weights not being initialized are **expected**. We only upload projection head weights; the encoder is loaded from BAAI/bge-m3 separately.
+### Issue: NumPy version error
+**Solution**: Use `pip install "numpy<2.0"` if you encounter TensorFlow compatibility issues.
+## Additional Configuration
+### Add Model Tags
+You can add tags to your model page for better discoverability. In the README.md front matter:
+```yaml
+---
+language:
+- vi
+tags:
+- sentence-transformers
+- vietnamese
+- rental
+- real-estate
+- bge-m3
+---
+```
+### Add to a Collection
+Consider adding your model to Vietnamese NLP or real estate collections on Hugging Face.
+## License
+The model is released under MIT License. Make sure this is acceptable for your use case.
+## Support
+For issues or questions:
+- Open an issue on the model repository
+- Contact Hugging Face support
+- Check Hugging Face documentation: https://huggingface.co/docs
+---
+**Ready to upload!** 🚀
+Follow the steps above and your model will be publicly available for the community to use.

__pycache__/modeling_bgem3_projection.cpython-310.pyc ADDED Viewed

Binary file (8.69 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "bgem3_projection",
+  "base_model": "BAAI/bge-m3",
+  "d_in": 1024,
+  "d_out": 128,
+  "use_layernorm": false,
+  "freeze_encoder": true,
+  "max_length": 512,
+  "architectures": [
+    "BGEM3ProjectionModel"
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.36.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a6dbefbe72459fd1c0592887da680af20cce946e7b2cdb7f00f891e624420e53
+size 524464

modeling_bgem3_projection.py ADDED Viewed

	@@ -0,0 +1,309 @@

+"""
+BGE-M3 Projection Model for Hugging Face Transformers
+A lightweight projection head trained on top of frozen BGE-M3 encoder
+for Vietnamese rental property search.
+"""
+from typing import List, Optional, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+from transformers import PretrainedConfig, PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutput
+class BGEM3ProjectionConfig(PretrainedConfig):
+    """
+    Configuration class for BGEM3ProjectionModel
+    Args:
+        base_model (str): Base model identifier (default: "BAAI/bge-m3")
+        d_in (int): Input dimension from base encoder (default: 1024)
+        d_out (int): Output dimension after projection (default: 128)
+        use_layernorm (bool): Whether to use LayerNorm in projection head
+        freeze_encoder (bool): Whether to freeze the base encoder
+        max_length (int): Maximum sequence length for tokenization
+    """
+    model_type = "bgem3_projection"
+    def __init__(
+        self,
+        base_model: str = "BAAI/bge-m3",
+        d_in: int = 1024,
+        d_out: int = 128,
+        use_layernorm: bool = False,
+        freeze_encoder: bool = True,
+        max_length: int = 512,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.base_model = base_model
+        self.d_in = d_in
+        self.d_out = d_out
+        self.use_layernorm = use_layernorm
+        self.freeze_encoder = freeze_encoder
+        self.max_length = max_length
+def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+    """
+    Mean pooling with attention mask
+    Args:
+        last_hidden_state: [batch_size, seq_len, hidden_size]
+        attention_mask: [batch_size, seq_len]
+    Returns:
+        pooled: [batch_size, hidden_size]
+    """
+    mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)  # [B, T, 1]
+    summed = (last_hidden_state * mask).sum(dim=1)  # [B, H]
+    counts = mask.sum(dim=1).clamp(min=1e-6)  # [B, 1]
+    return summed / counts
+class ProjectionHead(nn.Module):
+    """
+    Projection head: Linear + Optional LayerNorm + L2 Normalization
+    """
+    def __init__(self, d_in: int, d_out: int, use_layernorm: bool = False):
+        super().__init__()
+        self.linear = nn.Linear(d_in, d_out, bias=False)
+        self.ln = nn.LayerNorm(d_out) if use_layernorm else None
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            x: [batch_size, d_in]
+        Returns:
+            [batch_size, d_out] L2-normalized
+        """
+        x = self.linear(x)
+        if self.ln is not None:
+            x = self.ln(x)
+        # L2 normalize for cosine similarity
+        x = F.normalize(x, p=2, dim=-1)
+        return x
+class BGEM3ProjectionModel(PreTrainedModel):
+    """
+    BGE-M3 with trainable projection head
+    This model combines:
+    1. Frozen BGE-M3 encoder (1024-dim embeddings)
+    2. Trainable projection head (1024 -> d_out, default 128)
+    Usage:
+        >>> from transformers import AutoModel, AutoTokenizer
+        >>>
+        >>> model = AutoModel.from_pretrained("your-username/bge-m3-vietnamese-rental-projection", trust_remote_code=True)
+        >>> tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
+        >>>
+        >>> # Encode texts
+        >>> texts = ["Phòng trọ Quận 10, 25m2, giá 5tr"]
+        >>> inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
+        >>> embeddings = model(**inputs).last_hidden_state
+        >>>
+        >>> # Or use the encode method
+        >>> embeddings = model.encode(texts)
+    """
+    config_class = BGEM3ProjectionConfig
+    base_model_prefix = "bgem3_projection"
+    supports_gradient_checkpointing = False
+    def __init__(self, config: BGEM3ProjectionConfig):
+        super().__init__(config)
+        self.config = config
+        # Load base encoder
+        self.encoder = AutoModel.from_pretrained(config.base_model)
+        # Freeze encoder if specified
+        if config.freeze_encoder:
+            for param in self.encoder.parameters():
+                param.requires_grad = False
+        # Projection head (trainable)
+        self.head = ProjectionHead(
+            d_in=config.d_in,
+            d_out=config.d_out,
+            use_layernorm=config.use_layernorm
+        )
+        # Initialize tokenizer (for convenience)
+        self._tokenizer = None
+    @property
+    def tokenizer(self):
+        """Lazy load tokenizer"""
+        if self._tokenizer is None:
+            self._tokenizer = AutoTokenizer.from_pretrained(
+                self.config.base_model,
+                use_fast=True
+            )
+        return self._tokenizer
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[tuple, BaseModelOutput]:
+        """
+        Forward pass through encoder and projection head
+        Returns:
+            BaseModelOutput with last_hidden_state = projected embeddings [batch_size, d_out]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode with base model
+        with torch.set_grad_enabled(not self.config.freeze_encoder):
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                head_mask=head_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=True,
+            )
+        # Mean pooling
+        pooled = mean_pool(
+            encoder_outputs.last_hidden_state,
+            attention_mask
+        )  # [batch_size, 1024]
+        # Project to d_out
+        projected = self.head(pooled)  # [batch_size, d_out], L2-normalized
+        if not return_dict:
+            return (projected,)
+        return BaseModelOutput(
+            last_hidden_state=projected,
+            hidden_states=encoder_outputs.hidden_states if output_hidden_states else None,
+            attentions=encoder_outputs.attentions if output_attentions else None,
+        )
+    @torch.no_grad()
+    def encode(
+        self,
+        texts: Union[str, List[str]],
+        batch_size: int = 32,
+        max_length: Optional[int] = None,
+        show_progress: bool = False,
+        device: Optional[torch.device] = None,
+    ) -> torch.Tensor:
+        """
+        Encode texts to embeddings (convenience method)
+        Args:
+            texts: Single text or list of texts
+            batch_size: Batch size for encoding
+            max_length: Maximum sequence length (default: config.max_length)
+            show_progress: Show progress bar
+            device: Target device (default: model device)
+        Returns:
+            Tensor of shape [num_texts, d_out], L2-normalized
+        """
+        if isinstance(texts, str):
+            texts = [texts]
+        if device is None:
+            device = next(self.parameters()).device
+        if max_length is None:
+            max_length = self.config.max_length
+        self.eval()
+        all_embeddings = []
+        # Optional progress bar
+        iterator = range(0, len(texts), batch_size)
+        if show_progress:
+            try:
+                from tqdm import tqdm
+                iterator = tqdm(iterator, desc="Encoding")
+            except ImportError:
+                pass
+        for i in iterator:
+            batch_texts = texts[i:i + batch_size]
+            # Tokenize
+            inputs = self.tokenizer(
+                batch_texts,
+                padding=True,
+                truncation=True,
+                max_length=max_length,
+                return_tensors="pt"
+            )
+            # Move to device
+            inputs = {k: v.to(device) for k, v in inputs.items()}
+            # Forward pass
+            outputs = self.forward(**inputs)
+            embeddings = outputs.last_hidden_state
+            all_embeddings.append(embeddings.cpu())
+        return torch.cat(all_embeddings, dim=0)
+    def compute_similarity(
+        self,
+        text1: Union[str, List[str]],
+        text2: Union[str, List[str]],
+    ) -> torch.Tensor:
+        """
+        Compute cosine similarity between texts
+        Args:
+            text1: Single text or list of texts
+            text2: Single text or list of texts
+        Returns:
+            Similarity scores (cosine similarity)
+        """
+        emb1 = self.encode(text1)
+        emb2 = self.encode(text2)
+        # Cosine similarity (already L2-normalized, so just dot product)
+        if emb1.dim() == 1:
+            emb1 = emb1.unsqueeze(0)
+        if emb2.dim() == 1:
+            emb2 = emb2.unsqueeze(0)
+        similarity = emb1 @ emb2.T
+        return similarity.squeeze()
+# Register model for AutoModel
+try:
+    from transformers import AutoModel, AutoConfig
+    AutoConfig.register("bgem3_projection", BGEM3ProjectionConfig)
+    AutoModel.register(BGEM3ProjectionConfig, BGEM3ProjectionModel)
+except Exception as e:
+    # Registration may fail if models are already registered
+    pass

training_info.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "training": {
+    "dataset_size": 10384,
+    "train_examples": 9345,
+    "val_examples": 1039,
+    "epochs": 17,
+    "best_epoch": 16,
+    "batch_size": 128,
+    "learning_rate": 0.0002,
+    "optimizer": "AdamW",
+    "weight_decay": 0.01,
+    "device": "Tesla T4 (Google Colab)",
+    "training_time": "~2.5 hours"
+  },
+  "loss": {
+    "final_train_loss": 1.9191,
+    "best_val_loss": 1.8215122487809923,
+    "initial_train_loss": 2.9054,
+    "initial_val_loss": 2.4529,
+    "improvement": {
+      "train": "-34%",
+      "val": "-26%"
+    }
+  },
+  "evaluation": {
+    "test_examples": 96,
+    "metrics": {
+      "MRR": 0.9844,
+      "Recall@1": 0.9688,
+      "Recall@5": 1.0,
+      "Recall@10": 1.0,
+      "Recall@50": 1.0
+    },
+    "interpretation": "Excellent retrieval performance. 96.88% of queries find correct match at rank 1."
+  },
+  "model_details": {
+    "base_model": "BAAI/bge-m3",
+    "projection_dim": 128,
+    "trainable_params": 131072,
+    "total_params": 567885824,
+    "trainable_ratio": "0.02%",
+    "training_strategy": "Frozen encoder + trainable projection head"
+  },
+  "loss_function": {
+    "type": "Weighted InfoNCE",
+    "temperature": 0.07,
+    "symmetric": true,
+    "weighted_hard_negatives": true,
+    "feature_weights": {
+      "location": 2.5,
+      "price": 2.0,
+      "area": 1.8,
+      "amenity": 1.5
+    }
+  }
+}