---
language:
- id
- en
library_name: custom
license: mit
tags:
- indonesian
- skill-extraction
- ner
- job-market
- nlp
- rule-based
- skills-taxonomy
- indonesian-nlp
pipeline_tag: token-classification
---

# Indonesian Skill Extractor v1.0

A production-ready, rule-based NER system for extracting and categorizing technical and soft skills from Indonesian job postings.

## 🎯 Model Description

This model specializes in identifying and categorizing skills from Indonesian job market texts. It uses a comprehensive skill taxonomy with **200+ predefined skills** across **7 categories**, combined with intelligent pattern matching and normalization.

### Key Features

- ✅ **Zero Dependencies**: Pure Python, no ML frameworks required
- ✅ **200+ Skills**: Comprehensive taxonomy across 7 categories  
- ✅ **Bilingual**: Handles English and Indonesian (including code-switching)
- ✅ **Skill Normalization**: Maps aliases to canonical forms (js→javascript, etc.)
- ✅ **Proficiency Detection**: Identifies beginner/intermediate/expert levels
- ✅ **Fast & Deterministic**: 1000+ docs/sec, reproducible results
- ✅ **Production Ready**: Lightweight (~20 KB), easy integration

### Skill Categories

| Category | Count | Examples |
|----------|-------|----------|
| **programming** | 30+ | Python, Java, JavaScript, TypeScript, PHP, C++, Go, Rust |
| **frontend** | 40+ | React, Vue, Angular, Next.js, HTML, CSS, Tailwind, Webpack |
| **backend** | 30+ | Node.js, Django, Laravel, Spring Boot, Express, FastAPI |
| **database** | 25+ | MySQL, PostgreSQL, MongoDB, Redis, Elasticsearch, Oracle |
| **cloud** | 35+ | AWS, Azure, GCP, Docker, Kubernetes, Jenkins, Terraform |
| **data_science** | 30+ | Pandas, TensorFlow, PyTorch, Tableau, Power BI, Spark |
| **soft_skills** | 20+ | Communication, Leadership, Teamwork, Problem Solving |

**Total**: 200+ skills with 40+ aliases and variations

## 🚀 Quick Start

### Installation

No installation required! Just download the single Python file:

```python
# Download skill_extractor.py from this repository
# Place it in your project directory
from skill_extractor import IndonesianSkillExtractor

# Or use convenience function
from skill_extractor import extract_skills
```

### Basic Usage

```python
from skill_extractor import IndonesianSkillExtractor

# Initialize
extractor = IndonesianSkillExtractor()

# Extract skills from text
text = "Menguasai Python, React, MySQL, dan komunikasi yang baik"
result = extractor.extract(text)

print(result)
# Output:
# {
#   'skills': [
#     {'original': 'Python', 'normalized': 'python', 'category': 'programming', 'proficiency': None},
#     {'original': 'React', 'normalized': 'react', 'category': 'frontend', 'proficiency': None},
#     {'original': 'MySQL', 'normalized': 'mysql', 'category': 'database', 'proficiency': None},
#     {'original': 'komunikasi', 'normalized': 'komunikasi', 'category': 'soft_skills', 'proficiency': None}
#   ],
#   'total_count': 4,
#   'unique_count': 4,
#   'by_category': {
#     'programming': [...],
#     'frontend': [...],
#     'database': [...],
#     'soft_skills': [...]
#   }
# }
```

### Simple Extraction

```python
from skill_extractor import extract_skills

# Quick extraction (returns list of skill names)
skills = extract_skills("Python, React, MySQL, AWS")
print(skills)
# Output: ['python', 'react', 'mysql', 'aws']
```

### Batch Processing

```python
extractor = IndonesianSkillExtractor()

texts = [
    "Python, Django, PostgreSQL",
    "React, TypeScript, Node.js",
    "AWS, Docker, Kubernetes"
]

results = extractor.batch_extract(texts)

for i, result in enumerate(results):
    print(f"Text {i+1}: {result['total_count']} skills, {len(result['by_category'])} categories")
```

### Get Top Skills

```python
extractor = IndonesianSkillExtractor()

job_descriptions = [
    "Python, Django, React...",
    "Java, Spring, MySQL...",
    "Python, FastAPI, PostgreSQL..."
]

top_skills = extractor.get_top_skills(job_descriptions, top_n=5)
print(top_skills)
# Output: [('python', 2), ('react', 1), ('django', 1), ...]
```

## 📊 Features

### 1. Skill Normalization

Handles variations and aliases:

```python
extractor = IndonesianSkillExtractor()

# These all normalize to the same skill
texts = ["JS", "js", "JavaScript", "javascript"]
for text in texts:
    skills = extract_skills(text)
    print(skills)  # All output: ['javascript']
```

**40+ Aliases Supported:**
- js → javascript
- ts → typescript
- py → python
- reactjs, react.js → react
- nodejs → node.js
- pg, postgres → postgresql
- mongo → mongodb
- k8s → kubernetes

### 2. Proficiency Detection

Extracts skill levels from text:

```python
text = "Expert in Python, Advanced React, Basic MySQL"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['normalized']}: {skill['proficiency']}")

# Output:
# python: expert
# react: expert (advanced maps to expert)
# mysql: beginner (basic maps to beginner)
```

**Proficiency Keywords:**
- **Expert**: expert, advanced, mahir, ahli, mastery
- **Intermediate**: intermediate, menengah, competent
- **Beginner**: beginner, basic, pemula, dasar

### 3. Indonesian Language Support

Handles Indonesian skill names and code-switching:

```python
text = "Komunikasi yang baik, kerja sama tim, kepemimpinan, Python"
result = extractor.extract(text)

for skill in result['skills']:
    print(f"{skill['original']} → {skill['category']}")

# Output:
# Komunikasi → soft_skills
# kerja sama tim → soft_skills
# kepemimpinan → soft_skills (leadership)
# Python → programming
```

### 4. Comprehensive Parsing

Handles multiple formats:

```python
# Comma-separated
extract_skills("Python, React, MySQL")

# Semicolon-separated
extract_skills("Python; React; MySQL")

# Bullet points
extract_skills("• Python • React • MySQL")

# Newline-separated
extract_skills("Python\nReact\nMySQL")

# Mixed with proficiency
extract_skills("Python (Expert), React (2 years), MySQL")
```

## 📈 Performance

| Metric | Value |
|--------|-------|
| **Speed** | 1000+ docs/second |
| **Model Size** | ~20 KB (pure Python) |
| **Dependencies** | None (stdlib only) |
| **Skills Covered** | 200+ |
| **Categories** | 7 |
| **Aliases** | 40+ |
| **Languages** | Indonesian + English |

### Comparison with ML Models

| Feature | Skill Extractor | BERT-based NER |
|---------|----------------|----------------|
| **Training Data** | Not required | Required (1000+ samples) |
| **Model Size** | 20 KB | 300+ MB |
| **Speed** | 1000+ docs/sec | 50 docs/sec |
| **Deterministic** | ✅ Yes | ❌ No |
| **Explainable** | ✅ Yes | ❌ No |
| **Easy to Update** | ✅ Just edit dict | ❌ Requires retraining |

## 🎯 Use Cases

### 1. Job-Candidate Matching

```python
# Extract skills from job posting
job_skills = extract_skills(job_description)

# Extract skills from resume
candidate_skills = extract_skills(resume_text)

# Calculate match percentage
matching_skills = set(job_skills) & set(candidate_skills)
match_score = len(matching_skills) / len(job_skills) * 100
```

### 2. Skills Gap Analysis

```python
# Get market demand
market_skills = extractor.get_top_skills(job_postings, top_n=20)

# Get candidate pool skills
candidate_skills = extractor.get_top_skills(resumes, top_n=20)

# Find gaps
in_demand = set(s[0] for s in market_skills)
available = set(s[0] for s in candidate_skills)
skill_gaps = in_demand - available
```

### 3. Trend Analysis

```python
from collections import Counter

# Group by time period
skills_by_month = {}
for job in jobs:
    month = job['month']
    skills = extract_skills(job['requirements'])
    
    if month not in skills_by_month:
        skills_by_month[month] = []
    skills_by_month[month].extend(skills)

# Analyze trends
for month, skills in skills_by_month.items():
    top_5 = Counter(skills).most_common(5)
    print(f"{month}: {top_5}")
```

### 4. Resume Screening

```python
required_skills = ['python', 'django', 'postgresql']
nice_to_have = ['react', 'docker', 'aws']

def score_resume(resume_text):
    candidate_skills = set(extract_skills(resume_text))
    
    # Required skills (2 points each)
    required_score = len(candidate_skills & set(required_skills)) * 2
    
    # Nice to have (1 point each)
    bonus_score = len(candidate_skills & set(nice_to_have)) * 1
    
    return required_score + bonus_score

# Rank candidates
candidates = [...]
ranked = sorted(candidates, key=lambda c: score_resume(c['resume']), reverse=True)
```

## 🔧 API Reference

### `IndonesianSkillExtractor`

Main class for skill extraction.

#### Methods:

**`extract(text: str) -> Dict`**
- Full extraction with metadata
- Returns: skills, counts, categories, proficiency

**`extract_simple(text: str) -> List[str]`**
- Simple extraction returning skill names
- Returns: List of normalized skill strings

**`batch_extract(texts: List[str]) -> List[Dict]`**
- Process multiple texts
- Returns: List of extraction results

**`get_top_skills(texts: List[str], top_n: int) -> List[Tuple]`**
- Get most frequent skills across texts
- Returns: List of (skill, count) tuples

**`get_stats() -> Dict`**
- Get model statistics
- Returns: version, total_skills, categories, etc.

### Convenience Functions

**`extract_skills(text: str) -> List[str]`**
- Quick one-line extraction
- Creates extractor instance automatically

## 📄 License

This model is released under the **MIT License**.

**Citation:**
```bibtex
@software{indonesian_skill_extractor_2024,
  author = {Herlambang Haryo Putro},
  title = {Indonesian Skill Extractor v1.0},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/herlambangharyoputro/indonesian-skill-extractor-v1}
}
```

## 🤝 Contributions

Part of the **Job Market Intelligence Platform** project.

**Repository**: [GitHub - job-market-intelligence-platform](https://github.com/herlambangharyoputro/job-market-intelligence-platform)

**Related Datasets:**
- [Indonesian Job Market - Raw 2024](https://huggingface.co/datasets/herlambangharyoputro/indonesian-job-market-raw-2024)
- [Indonesian Job Market - Tokenized 2024](https://huggingface.co/datasets/herlambangharyoputro/indonesian-job-market-tokenized-2024)

**Contributions welcome!** If you:
- Find missing skills or categories
- Have suggestions for improvements
- Want to add more language support
- Build interesting projects using this model

Please open an issue or pull request on GitHub.

## 📧 Contact

- **Author**: Herlambang Haryo Putro
- **Email**: herlambangharyoputro@gmail.com
- **GitHub**: [@herlambangharyoputro](https://github.com/herlambangharyoputro)
- **Project**: Job Market Intelligence Platform

## 🔄 Version History

- **v1.0.0** (December 2024): Initial release
  - 200+ skills across 7 categories
  - 40+ aliases for normalization
  - Proficiency level detection
  - Indonesian language support
  - Zero dependencies

## ⚠️ Limitations

### Coverage
- Limited to predefined skill taxonomy (200+ skills)
- New/emerging skills may be categorized as 'other'
- Domain-specific skills may not be recognized

### Language
- Primarily optimized for Indonesian job market
- May not capture all regional variations
- English technical terms preferred over Indonesian equivalents

### Accuracy
- Rule-based approach may miss context-dependent skills
- Acronyms can be ambiguous (e.g., "AI" = Artificial Intelligence or Adobe Illustrator)
- Proficiency detection based on keywords only

### Recommendations
- Best for structured skill lists (bullets, commas)
- Review 'other' category for domain-specific additions
- Combine with manual review for critical applications
- Consider ML-based approach for unstructured text

## 🎯 Future Improvements

Planned features for v2.0:
- Expanded skill taxonomy (300+ skills)
- Industry-specific categories
- Skill clustering and relationships
- Confidence scoring
- Multi-language support (Javanese, Sundanese)
- Experience year extraction
- Certification detection

---

**Last Updated**: December 2024  
**Model Version**: 1.0.0  
**Status**: ✅ Production Ready  
**Type**: Rule-based NER

**For questions or collaboration, visit [GitHub](https://github.com/herlambangharyoputro/job-market-intelligence-platform).**