--- tags: - sentence-transformers - sentence-similarity - feature-extraction - skill-extraction - job-description - skill-matching - workforce-analytics - hr-tech - talent-management - semantic-search - text-embedding - skills-taxonomy - skillsfuture - singapore - dense - generated_from_trainer - dataset_size:21958 - loss:CosineSimilarityLoss - custom_code base_model: sentence-transformers/all-MiniLM-L6-v2 datasets: - imocha-ai-org/ssf-skill-extraction-pairs model-index: - name: ssf-miniLM-finetuned-v2 results: - task: type: semantic-similarity name: Skill-to-Sentence Matching metrics: - type: AUC value: 0.995 name: AUC (Held-Out 10%) - type: accuracy value: 0.971 name: Best Accuracy - type: accuracy value: 0.968 name: Accuracy @ 0.5 widget: - source_sentence: Analyze tax liabilities, identify applicable rates, and apply corrections to ensure proper calculation and reporting. sentences: - Tax Computation - Cloud Infrastructure Management - Asian Cold Dish and Dessert Preparation - source_sentence: Perform regular preventive maintenance on communication backbone systems, ensuring reliability and minimizing downtime. sentences: - Automatic Fare Collection Auxiliary Systems Maintenance - Clinical Supervision - Blog and Vlog Deployment - source_sentence: Establish key performance indicators (KPIs) to measure the effectiveness of the total rewards program. sentences: - Product Advisory - Rigging for Animation - Social Policy Implementation - source_sentence: Inspects and maintains 22KV switchgear systems, ensuring proper operation and safety compliance. sentences: - 22KV Switchgear Systems Maintenance - Contract Drafting - Animal Husbandry and Nutrition - source_sentence: Design and implement machine learning pipelines for production systems with monitoring and automated retraining. sentences: - Machine Learning Engineering - Cargo Handling and Stowage - Non-sterile Compounding pipeline_tag: sentence-similarity library_name: sentence-transformers language: - en license: apache-2.0 --- # SSF-MiniLM Finetuned v2 — Skill Extraction Embedding Model A [sentence-transformers](https://www.SBERT.net) model fine-tuned from [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) for **matching job description sentences to standardized skills** from Singapore's SkillsFuture Framework (SSF). The model maps sentences and skill names into a **384-dimensional dense vector space** where job description text lands close to its corresponding skill, enabling accurate semantic skill extraction, tagging, and retrieval. ## Highlights - **AUC 0.995** on held-out validation (up from 0.978 baseline) - **97.1% best accuracy** on skill-sentence matching (up from 92.8% baseline) - Covers **2,196 unique skills** across all SSF sectors - Fast inference: 22M params, runs efficiently on CPU and GPU - Drop-in replacement for `all-MiniLM-L6-v2` — same API, better skill matching ## Model Details | Property | Value | |:---|:---| | **Model Type** | Sentence Transformer (Bi-Encoder) | | **Base Model** | [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | | **Architecture** | BERT (6 layers, 12 heads, 384 hidden) | | **Parameters** | ~22M | | **Max Sequence Length** | 256 tokens | | **Output Dimensionality** | 384 | | **Similarity Function** | Cosine Similarity | | **Pooling** | Mean Pooling + L2 Normalization | | **Language** | English | | **License** | Apache 2.0 | ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'}) (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True}) (2): Normalize() ) ``` ## Intended Use ### Primary Use Cases - **Skill Extraction from Job Descriptions** — identify which standardized skills a JD sentence refers to - **Skill Tagging / Auto-labeling** — tag resumes, courses, or learning content with SSF skills - **Semantic Skill Search** — find relevant skills for a given text query - **Skill Gap Analysis** — compare job requirements against employee skill profiles - **HR Tech / Workforce Analytics** — power matching engines, recommendation systems, and talent platforms ### Suitable Applications - Resume parsing and skill extraction pipelines - Job-to-candidate matching engines - Learning & development recommendation systems - Skills taxonomy mapping and alignment - Workforce planning and analytics dashboards ### Out-of-Scope Uses - General-purpose sentence similarity (use the base model instead) - Non-English text - Tasks requiring generative output (this is an embedding model) - Medical, legal, or safety-critical classification without human review ## Training Details ### Dataset | Property | Value | |:---|:---| | **Name** | SSF Skill Extraction Pairs | | **Domain** | Workforce Skills / HR / Job Descriptions | | **Source Skills** | 2,196 unique skills from Singapore SkillsFuture Framework | | **Synthetic Sentences** | 5 JD-style sentences per skill, generated via Qwen3-1.7B (Ollama) | | **Total Training Pairs** | 21,958 (positive + hard negative per sentence) | | **Format** | `(sentence, skill_name, label)` — label 1.0 for correct skill, 0.0 for random incorrect skill | | **Validation Split** | 10% held-out (2,195 pairs) | **Sample training pairs:** | Sentence | Skill | Label | |:---|:---|:---:| | Analyzes tax liabilities, identifies applicable rates, and applies corrections to ensure proper calculation and reporting. | Tax Computation | 1.0 | | Monitor plant health by assessing symptoms and identifying disease risks. | Plant Health Management and Disease Control | 1.0 | | Analyzes cross-cultural communication challenges in medical and legal contexts, optimizing translation strategies for diverse stakeholders. | Audience Segmentation | 0.0 | ### Training Objective **Loss Function:** [CosineSimilarityLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) with MSE The model learns to maximize cosine similarity between a JD sentence and its correct skill, while minimizing similarity to randomly-sampled incorrect skills. This contrastive setup produces well-separated embeddings. ### Training Hyperparameters | Parameter | Value | |:---|:---| | Epochs | 5 | | Batch Size | 64 | | Learning Rate | 5e-05 | | Optimizer | AdamW (fused) | | Warmup Steps | 10% of total steps | | Scheduler | Linear decay | | Seed | 42 | | Precision | FP32 | | Deterministic | Yes (`CUBLAS_WORKSPACE_CONFIG=:4096:8`) | ### Training Logs | Epoch | Step | Training Loss | |:---:|:---:|:---:| | 1.45 | 500 | 0.0822 | | 2.91 | 1,000 | 0.0567 | | 4.36 | 1,500 | 0.0493 | ## Evaluation ### Benchmark: Held-Out Skill Matching (10% split, 2,195 pairs) Embeddings encoded with `normalize_embeddings=True`. Cosine similarity computed as dot product of normalized vectors. | Model | AUC | Acc @ 0.5 | Best Accuracy | Pos Mean Sim | Neg Mean Sim | |:---|:---:|:---:|:---:|:---:|:---:| | all-MiniLM-L6-v2 (baseline) | 0.978 | 0.810 | 0.928 | 0.530 | 0.133 | | SSF-MiniLM v1 (1 epoch) | 0.989 | 0.949 | 0.952 | 0.799 | 0.131 | | **SSF-MiniLM v2 (5 epochs)** | **0.995** | **0.968** | **0.971** | **0.845** | **0.088** | ### Key Observations - **AUC improved from 0.978 to 0.995** — the model almost perfectly ranks correct skills above incorrect ones - **Positive similarity increased from 0.530 to 0.845** — correct pairs are now strongly matched - **Negative similarity dropped from 0.133 to 0.088** — incorrect pairs are pushed further apart - **Best accuracy improved from 92.8% to 97.1%** — +4.3% absolute improvement over baseline - **Accuracy @ 0.5 jumped from 81.0% to 96.8%** — the default threshold works well out of the box ### Metrics Explained - **AUC**: Measures ranking quality — how often the model scores positive pairs above negative pairs (1.0 = perfect ranking) - **Accuracy @ 0.5**: Classification accuracy using cosine similarity threshold of 0.5 - **Best Accuracy**: Best accuracy found by scanning thresholds from 1st–99th percentile of scores - **Pos/Neg Mean Similarity**: Average cosine similarity for correct vs incorrect skill pairs ## Performance Summary ### Strengths - Excellent skill discrimination (AUC 0.995) across 2,196 diverse skills - Strong positive/negative separation (0.845 vs 0.088 mean similarity) - Works well with the default 0.5 threshold — no tuning needed for most applications - Small model footprint (~87MB) enables fast CPU inference - Covers a comprehensive range of workforce skills: IT, healthcare, engineering, finance, creative, trades, and more ### Weaknesses - Optimized for SkillsFuture Framework skills — may underperform on skills not in the SSF taxonomy - Trained on synthetic JD sentences — real-world JDs with unusual formatting or jargon may need additional fine-tuning - Short text bias — best with single sentences or phrases; long paragraphs should be split into sentences first - English only ## Limitations - **Domain specificity**: The model is fine-tuned on Singapore's SkillsFuture Framework. Skills from other taxonomies (O*NET, ESCO, ISCO) may not match as precisely without further adaptation. - **Synthetic training data**: JD-style sentences were generated by an LLM (Qwen3-1.7B), which may not capture all real-world phrasing variations. - **No cross-lingual support**: English only. Multilingual JDs will need translation first. - **Short text focus**: Designed for sentence-level matching. For multi-paragraph JDs, split into sentences before encoding. - **Skill taxonomy coverage**: Limited to the 2,196 skills in the SSF dataset. New or niche skills outside this taxonomy will fall back to base model behavior. ## Ethical Considerations - **Bias**: The SSF taxonomy reflects Singapore's workforce structure. Skills from underrepresented or emerging fields may have fewer training examples. - **Fairness**: The model matches text to skills — it does not evaluate candidates. Applications should ensure skill matching does not introduce hiring bias. - **Responsible use**: This model is a tool for structuring skill data, not for making automated hiring decisions. Always include human review in high-stakes HR workflows. - **Data provenance**: Training data is synthetically generated. No personal or proprietary job description data was used in training. ## Usage ### Quick Start (Sentence Transformers) ```bash pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2") # Encode job description sentences and skills sentences = [ "Design and implement scalable data pipelines for real-time analytics.", "Manage patient records and ensure compliance with healthcare regulations.", ] skills = [ "Data Engineering", "Healthcare Records Management", "Polymer Processing", ] sentence_embeddings = model.encode(sentences, normalize_embeddings=True) skill_embeddings = model.encode(skills, normalize_embeddings=True) # Compute similarity (dot product of normalized vectors = cosine similarity) import numpy as np similarities = np.dot(sentence_embeddings, skill_embeddings.T) print(similarities) # sentence 0 -> "Data Engineering" = high score # sentence 1 -> "Healthcare Records Management" = high score ``` ### Skill Extraction Pipeline ```python from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2") # Your skill taxonomy (or load from SSF dataset) skills = ["Data Engineering", "Machine Learning", "Project Management", "Cloud Computing"] skill_embeddings = model.encode(skills, normalize_embeddings=True) # Extract skills from a JD sentence jd_sentence = "Build and deploy ML models on AWS with CI/CD pipelines." jd_embedding = model.encode([jd_sentence], normalize_embeddings=True) scores = np.dot(jd_embedding, skill_embeddings.T)[0] threshold = 0.5 for skill, score in sorted(zip(skills, scores), key=lambda x: -x[1]): if score >= threshold: print(f" {skill}: {score:.3f}") ``` ### Using with Transformers (Direct) ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2") model = AutoModel.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2") def encode(texts): inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # Mean pooling attention_mask = inputs["attention_mask"].unsqueeze(-1) embeddings = (outputs.last_hidden_state * attention_mask).sum(1) / attention_mask.sum(1) # L2 normalize return torch.nn.functional.normalize(embeddings, p=2, dim=1) query = encode(["Build scalable APIs with microservice architecture"]) skills = encode(["API Development", "Microservice Architecture", "Gardening"]) similarities = torch.mm(query, skills.T) print(similarities) ``` ## Deployment Notes | Property | Detail | |:---|:---| | **Model Size** | ~87 MB (safetensors) | | **Inference Speed** | ~5,000 sentences/sec on GPU, ~500/sec on CPU (batch 64) | | **Memory** | ~350 MB RAM loaded | | **ONNX Compatible** | Yes (via `sentence-transformers` export) | | **Quantization** | Compatible with INT8/FP16 for faster inference | | **Recommended Hardware** | Works on CPU; GPU recommended for batch processing | | **Serving** | Compatible with Triton, TorchServe, FastAPI, or any ONNX runtime | ## Training Data The training dataset is available at [imocha-ai-org/ssf-skill-extraction-pairs](https://huggingface.co/datasets/imocha-ai-org/ssf-skill-extraction-pairs) and contains: - `pairs.jsonl` — 21,958 training pairs (sentence, skill, label) - `generated_sentences.json` — 5 synthetic JD sentences per skill (2,196 skills) - `meta.json` — dataset metadata ## Framework Versions - Python: 3.10.19 - Sentence Transformers: 5.2.2 - Transformers: 4.57.3 - PyTorch: 2.9.1+cu128 - Accelerate: 1.12.0 - Datasets: 4.3.0 - Tokenizers: 0.22.2 ## Citation ### BibTeX ```bibtex @misc{imocha2026ssf-miniLM, title = {SSF-MiniLM Finetuned v2: Skill Extraction Embedding Model}, author = {imocha AI}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2} } ``` ### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` ## Contact / Maintainer - **Organization**: [imocha AI](https://huggingface.co/imocha-ai-org) - **Maintainer**: Sarvadnya - **Issues**: Open an issue on the [model repository](https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2/discussions)