---
tags:
- reranker
- code search
- cross-encoder
- MiniLM
- staqc
- information retrieval
- MRR
- code understanding
- python
- stack-overflow
library_name: sentence-transformers
pipeline_tag: text-classification
license: apache-2.0
model-index:
- name: code-reranker-miniLM-staqc
  results:
  - task:
      type: text-classification
      name: Code Reranking
    dataset:
      name: StaQC (Stack Overflow Question-Code)
      type: custom
    metrics:
    - name: MRR
      type: mean_reciprocal_rank
      value: 0.9380
    - name: Top-1 Accuracy
      type: accuracy
      value: 0.9100
---

# code-reranker-miniLM-staqc

**A fine-tuned cross-encoder based on `cross-encoder/ms-marco-MiniLM-L-6-v2` for reranking Python code snippets based on natural language queries from Stack Overflow.**

## Model Description

This model is a cross-encoder trained on the StaQC dataset (Stack Overflow Question-Code pairs) to rerank relevant Python code snippets given a programming question or natural language intent. It is specifically fine-tuned for Python code search and retrieval tasks where accurate relevance scoring is important.

* **Architecture**: Cross-Encoder based on MiniLM-L6
* **Base model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
* **Fine-tuned on**: StaQC SCA (Stack Overflow Question-Code) dataset
* **Task**: Python code snippet reranking for natural language queries
* **Language**: Python code snippets

## Use Cases

* Python code search engines
* Developer assistants for Python programming
* AI coding agents with natural language interfaces
* Evaluation modules in RAG pipelines for Python programming use cases
* Code recommendation systems

## Evaluation Results

The model was evaluated on 500 query-code candidates from the Conala curated dataset.

| Metric         | Value |
| -------------- | ----- |
| MRR            | 0.938 |
| Top‑1 Accuracy | 0.910 |

## How to Use

### Using sentence-transformers

```python
from sentence_transformers import CrossEncoder

# Load the model
model = CrossEncoder("NamanAgnih0tri/code-reranker-miniLM-staqc")

# Sample input
query = "How to convert a string to int in Python?"
code_snippet = "int_value = int('123')"

# Get relevance score
score = model.predict([query, code_snippet])
print(f"Relevance Score: {score:.4f}")
```

### Using transformers directly

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("NamanAgnih0tri/code-reranker-miniLM-staqc")
model = AutoModelForSequenceClassification.from_pretrained("NamanAgnih0tri/code-reranker-miniLM-staqc")

# Sample input
query = "How to reverse a string in Python?"
code_snippet = "def reverse_string(s):\n    return s[::-1]"

# Tokenize and predict relevance
inputs = tokenizer(query, code_snippet, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
    score = logits[0].item()

print(f"Relevance Score: {score:.4f}")
```

### Code Ranking Example

```python
from sentence_transformers import CrossEncoder

model = CrossEncoder("NamanAgnih0tri/code-reranker-miniLM-staqc")

def rank_code_snippets(query, candidates):
    """Rank code snippets by relevance to the query."""
    pairs = [[query, code] for code in candidates]
    scores = model.predict(pairs)
    ranked_results = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return ranked_results

# Example usage
query = "How to reverse a string in Python?"
candidates = [
    "def reverse_string(s):\n    return s[::-1]",
    "print('hello'[::-1])",
    "def add(a,b):\n    return a + b",
    "list = [1,2,3,4]"
]

ranked_results = rank_code_snippets(query, candidates)
for rank, (code, score) in enumerate(ranked_results, 1):
    print(f"{rank}. Score: {score:.4f}\n{code}\n")
```

## Dataset

* **StaQC SCA (Stack Overflow Question-Code pairs)**
* Each pair consists of a natural language programming question and a corresponding Python code snippet
* Positive and negative pairs were used for contrastive fine-tuning
* Dataset contains 85,294 training examples

## Training Details

* **Base Model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
* **Optimizer**: AdamW
* **Epochs**: 3
* **Batch size**: 8
* **Learning rate**: 2e-5
* **Loss**: Cosine Similarity Loss
* **Training samples**: 170,588 (including negative samples)
* **Warmup steps**: 10% of total training steps

## Model Performance Comparison

| Model | MRR | Top-1 Accuracy |
|-------|-----|----------------|
| **code-reranker-miniLM-staqc** | **0.938** | **0.910** |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | 0.895 | 0.844 |
| cross-encoder/ms-marco-TinyBERT-L-2-v2 | 0.823 | 0.756 |

## Limitations

* Trained specifically on Python code snippets; may not generalize well to other programming languages
* Model is relatively small; performance may lag behind larger rerankers on complex queries
* Fine-tuned on Stack Overflow-like questions; may not generalize to code from other domains
* Limited to text-based code snippets; does not handle complex code structures or dependencies

## Citation

If you use this model in your work, please cite it as:

```bibtex
@misc{code-reranker-miniLM-staqc,
  title={Code Reranker using MiniLM and StaQC for Python Code Search},
  author={Naman Agnihotri},
  year={2025},
  howpublished={\url{https://huggingface.co/NamanAgnih0tri/code-reranker-miniLM-staqc}}
}
```

## Author

* **Name**: Naman Agnihotri
* **Contact**: [LinkedIn](https://www.linkedin.com/in/namanagnihotri)
* **GitHub**: [NamanAgnih0tri](https://github.com/NamanAgnih0tri)

## License

This model is licensed under the Apache 2.0 License.