---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- skill-extraction
- job-description
- skill-matching
- workforce-analytics
- hr-tech
- talent-management
- semantic-search
- text-embedding
- skills-taxonomy
- skillsfuture
- singapore
- dense
- generated_from_trainer
- dataset_size:21958
- loss:CosineSimilarityLoss
- custom_code
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- imocha-ai-org/ssf-skill-extraction-pairs
model-index:
- name: ssf-miniLM-finetuned-v2
  results:
  - task:
      type: semantic-similarity
      name: Skill-to-Sentence Matching
    metrics:
    - type: AUC
      value: 0.995
      name: AUC (Held-Out 10%)
    - type: accuracy
      value: 0.971
      name: Best Accuracy
    - type: accuracy
      value: 0.968
      name: Accuracy @ 0.5
widget:
- source_sentence: Analyze tax liabilities, identify applicable rates, and apply corrections to ensure proper calculation and reporting.
  sentences:
  - Tax Computation
  - Cloud Infrastructure Management
  - Asian Cold Dish and Dessert Preparation
- source_sentence: Perform regular preventive maintenance on communication backbone systems, ensuring reliability and minimizing downtime.
  sentences:
  - Automatic Fare Collection Auxiliary Systems Maintenance
  - Clinical Supervision
  - Blog and Vlog Deployment
- source_sentence: Establish key performance indicators (KPIs) to measure the effectiveness of the total rewards program.
  sentences:
  - Product Advisory
  - Rigging for Animation
  - Social Policy Implementation
- source_sentence: Inspects and maintains 22KV switchgear systems, ensuring proper operation and safety compliance.
  sentences:
  - 22KV Switchgear Systems Maintenance
  - Contract Drafting
  - Animal Husbandry and Nutrition
- source_sentence: Design and implement machine learning pipelines for production systems with monitoring and automated retraining.
  sentences:
  - Machine Learning Engineering
  - Cargo Handling and Stowage
  - Non-sterile Compounding
pipeline_tag: sentence-similarity
library_name: sentence-transformers
language:
- en
license: apache-2.0
---

# SSF-MiniLM Finetuned v2 — Skill Extraction Embedding Model

A [sentence-transformers](https://www.SBERT.net) model fine-tuned from [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for **matching job description sentences to standardized skills** from Singapore's SkillsFuture Framework (SSF).

The model maps sentences and skill names into a **384-dimensional dense vector space** where job description text lands close to its corresponding skill, enabling accurate semantic skill extraction, tagging, and retrieval.

## Highlights

- **AUC 0.995** on held-out validation (up from 0.978 baseline)
- **97.1% best accuracy** on skill-sentence matching (up from 92.8% baseline)
- Covers **2,196 unique skills** across all SSF sectors
- Fast inference: 22M params, runs efficiently on CPU and GPU
- Drop-in replacement for `all-MiniLM-L6-v2` — same API, better skill matching

## Model Details

| Property | Value |
|:---|:---|
| **Model Type** | Sentence Transformer (Bi-Encoder) |
| **Base Model** | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) |
| **Architecture** | BERT (6 layers, 12 heads, 384 hidden) |
| **Parameters** | ~22M |
| **Max Sequence Length** | 256 tokens |
| **Output Dimensionality** | 384 |
| **Similarity Function** | Cosine Similarity |
| **Pooling** | Mean Pooling + L2 Normalization |
| **Language** | English |
| **License** | Apache 2.0 |

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
  (2): Normalize()
)
```

## Intended Use

### Primary Use Cases
- **Skill Extraction from Job Descriptions** — identify which standardized skills a JD sentence refers to
- **Skill Tagging / Auto-labeling** — tag resumes, courses, or learning content with SSF skills
- **Semantic Skill Search** — find relevant skills for a given text query
- **Skill Gap Analysis** — compare job requirements against employee skill profiles
- **HR Tech / Workforce Analytics** — power matching engines, recommendation systems, and talent platforms

### Suitable Applications
- Resume parsing and skill extraction pipelines
- Job-to-candidate matching engines
- Learning & development recommendation systems
- Skills taxonomy mapping and alignment
- Workforce planning and analytics dashboards

### Out-of-Scope Uses
- General-purpose sentence similarity (use the base model instead)
- Non-English text
- Tasks requiring generative output (this is an embedding model)
- Medical, legal, or safety-critical classification without human review

## Training Details

### Dataset

| Property | Value |
|:---|:---|
| **Name** | SSF Skill Extraction Pairs |
| **Domain** | Workforce Skills / HR / Job Descriptions |
| **Source Skills** | 2,196 unique skills from Singapore SkillsFuture Framework |
| **Synthetic Sentences** | 5 JD-style sentences per skill, generated via Qwen3-1.7B (Ollama) |
| **Total Training Pairs** | 21,958 (positive + hard negative per sentence) |
| **Format** | `(sentence, skill_name, label)` — label 1.0 for correct skill, 0.0 for random incorrect skill |
| **Validation Split** | 10% held-out (2,195 pairs) |

**Sample training pairs:**

| Sentence | Skill | Label |
|:---|:---|:---:|
| Analyzes tax liabilities, identifies applicable rates, and applies corrections to ensure proper calculation and reporting. | Tax Computation | 1.0 |
| Monitor plant health by assessing symptoms and identifying disease risks. | Plant Health Management and Disease Control | 1.0 |
| Analyzes cross-cultural communication challenges in medical and legal contexts, optimizing translation strategies for diverse stakeholders. | Audience Segmentation | 0.0 |

### Training Objective

**Loss Function:** [CosineSimilarityLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with MSE

The model learns to maximize cosine similarity between a JD sentence and its correct skill, while minimizing similarity to randomly-sampled incorrect skills. This contrastive setup produces well-separated embeddings.

### Training Hyperparameters

| Parameter | Value |
|:---|:---|
| Epochs | 5 |
| Batch Size | 64 |
| Learning Rate | 5e-05 |
| Optimizer | AdamW (fused) |
| Warmup Steps | 10% of total steps |
| Scheduler | Linear decay |
| Seed | 42 |
| Precision | FP32 |
| Deterministic | Yes (`CUBLAS_WORKSPACE_CONFIG=:4096:8`) |

### Training Logs

| Epoch | Step | Training Loss |
|:---:|:---:|:---:|
| 1.45 | 500 | 0.0822 |
| 2.91 | 1,000 | 0.0567 |
| 4.36 | 1,500 | 0.0493 |

## Evaluation

### Benchmark: Held-Out Skill Matching (10% split, 2,195 pairs)

Embeddings encoded with `normalize_embeddings=True`. Cosine similarity computed as dot product of normalized vectors.

| Model | AUC | Acc @ 0.5 | Best Accuracy | Pos Mean Sim | Neg Mean Sim |
|:---|:---:|:---:|:---:|:---:|:---:|
| all-MiniLM-L6-v2 (baseline) | 0.978 | 0.810 | 0.928 | 0.530 | 0.133 |
| SSF-MiniLM v1 (1 epoch) | 0.989 | 0.949 | 0.952 | 0.799 | 0.131 |
| **SSF-MiniLM v2 (5 epochs)** | **0.995** | **0.968** | **0.971** | **0.845** | **0.088** |

### Key Observations

- **AUC improved from 0.978 to 0.995** — the model almost perfectly ranks correct skills above incorrect ones
- **Positive similarity increased from 0.530 to 0.845** — correct pairs are now strongly matched
- **Negative similarity dropped from 0.133 to 0.088** — incorrect pairs are pushed further apart
- **Best accuracy improved from 92.8% to 97.1%** — +4.3% absolute improvement over baseline
- **Accuracy @ 0.5 jumped from 81.0% to 96.8%** — the default threshold works well out of the box

### Metrics Explained

- **AUC**: Measures ranking quality — how often the model scores positive pairs above negative pairs (1.0 = perfect ranking)
- **Accuracy @ 0.5**: Classification accuracy using cosine similarity threshold of 0.5
- **Best Accuracy**: Best accuracy found by scanning thresholds from 1st–99th percentile of scores
- **Pos/Neg Mean Similarity**: Average cosine similarity for correct vs incorrect skill pairs

## Performance Summary

### Strengths
- Excellent skill discrimination (AUC 0.995) across 2,196 diverse skills
- Strong positive/negative separation (0.845 vs 0.088 mean similarity)
- Works well with the default 0.5 threshold — no tuning needed for most applications
- Small model footprint (~87MB) enables fast CPU inference
- Covers a comprehensive range of workforce skills: IT, healthcare, engineering, finance, creative, trades, and more

### Weaknesses
- Optimized for SkillsFuture Framework skills — may underperform on skills not in the SSF taxonomy
- Trained on synthetic JD sentences — real-world JDs with unusual formatting or jargon may need additional fine-tuning
- Short text bias — best with single sentences or phrases; long paragraphs should be split into sentences first
- English only

## Limitations

- **Domain specificity**: The model is fine-tuned on Singapore's SkillsFuture Framework. Skills from other taxonomies (O*NET, ESCO, ISCO) may not match as precisely without further adaptation.
- **Synthetic training data**: JD-style sentences were generated by an LLM (Qwen3-1.7B), which may not capture all real-world phrasing variations.
- **No cross-lingual support**: English only. Multilingual JDs will need translation first.
- **Short text focus**: Designed for sentence-level matching. For multi-paragraph JDs, split into sentences before encoding.
- **Skill taxonomy coverage**: Limited to the 2,196 skills in the SSF dataset. New or niche skills outside this taxonomy will fall back to base model behavior.

## Ethical Considerations

- **Bias**: The SSF taxonomy reflects Singapore's workforce structure. Skills from underrepresented or emerging fields may have fewer training examples.
- **Fairness**: The model matches text to skills — it does not evaluate candidates. Applications should ensure skill matching does not introduce hiring bias.
- **Responsible use**: This model is a tool for structuring skill data, not for making automated hiring decisions. Always include human review in high-stakes HR workflows.
- **Data provenance**: Training data is synthetically generated. No personal or proprietary job description data was used in training.

## Usage

### Quick Start (Sentence Transformers)

```bash
pip install -U sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Encode job description sentences and skills
sentences = [
    "Design and implement scalable data pipelines for real-time analytics.",
    "Manage patient records and ensure compliance with healthcare regulations.",
]
skills = [
    "Data Engineering",
    "Healthcare Records Management",
    "Polymer Processing",
]

sentence_embeddings = model.encode(sentences, normalize_embeddings=True)
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Compute similarity (dot product of normalized vectors = cosine similarity)
import numpy as np
similarities = np.dot(sentence_embeddings, skill_embeddings.T)
print(similarities)
# sentence 0 -> "Data Engineering" = high score
# sentence 1 -> "Healthcare Records Management" = high score
```

### Skill Extraction Pipeline

```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Your skill taxonomy (or load from SSF dataset)
skills = ["Data Engineering", "Machine Learning", "Project Management", "Cloud Computing"]
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Extract skills from a JD sentence
jd_sentence = "Build and deploy ML models on AWS with CI/CD pipelines."
jd_embedding = model.encode([jd_sentence], normalize_embeddings=True)

scores = np.dot(jd_embedding, skill_embeddings.T)[0]
threshold = 0.5

for skill, score in sorted(zip(skills, scores), key=lambda x: -x[1]):
    if score >= threshold:
        print(f"  {skill}: {score:.3f}")
```

### Using with Transformers (Direct)

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")
model = AutoModel.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    attention_mask = inputs["attention_mask"].unsqueeze(-1)
    embeddings = (outputs.last_hidden_state * attention_mask).sum(1) / attention_mask.sum(1)
    # L2 normalize
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)

query = encode(["Build scalable APIs with microservice architecture"])
skills = encode(["API Development", "Microservice Architecture", "Gardening"])
similarities = torch.mm(query, skills.T)
print(similarities)
```

## Deployment Notes

| Property | Detail |
|:---|:---|
| **Model Size** | ~87 MB (safetensors) |
| **Inference Speed** | ~5,000 sentences/sec on GPU, ~500/sec on CPU (batch 64) |
| **Memory** | ~350 MB RAM loaded |
| **ONNX Compatible** | Yes (via `sentence-transformers` export) |
| **Quantization** | Compatible with INT8/FP16 for faster inference |
| **Recommended Hardware** | Works on CPU; GPU recommended for batch processing |
| **Serving** | Compatible with Triton, TorchServe, FastAPI, or any ONNX runtime |

## Training Data

The training dataset is available at [imocha-ai-org/ssf-skill-extraction-pairs](https://huggingface.co/datasets/imocha-ai-org/ssf-skill-extraction-pairs) and contains:

- `pairs.jsonl` — 21,958 training pairs (sentence, skill, label)
- `generated_sentences.json` — 5 synthetic JD sentences per skill (2,196 skills)
- `meta.json` — dataset metadata

## Framework Versions

- Python: 3.10.19
- Sentence Transformers: 5.2.2
- Transformers: 4.57.3
- PyTorch: 2.9.1+cu128
- Accelerate: 1.12.0
- Datasets: 4.3.0
- Tokenizers: 0.22.2

## Citation

### BibTeX

```bibtex
@misc{imocha2026ssf-miniLM,
  title     = {SSF-MiniLM Finetuned v2: Skill Extraction Embedding Model},
  author    = {imocha AI},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2}
}
```

### Sentence Transformers

```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}
```

## Contact / Maintainer

- **Organization**: [imocha AI](https://huggingface.co/imocha-ai-org)
- **Maintainer**: Sarvadnya
- **Issues**: Open an issue on the [model repository](https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2/discussions)