--- language: - id - en library_name: custom license: mit tags: - indonesian - skill-extraction - ner - job-market - nlp - rule-based - skills-taxonomy - indonesian-nlp pipeline_tag: token-classification --- # Indonesian Skill Extractor v1.0 A production-ready, rule-based NER system for extracting and categorizing technical and soft skills from Indonesian job postings. ## 🎯 Model Description This model specializes in identifying and categorizing skills from Indonesian job market texts. It uses a comprehensive skill taxonomy with **200+ predefined skills** across **7 categories**, combined with intelligent pattern matching and normalization. ### Key Features - βœ… **Zero Dependencies**: Pure Python, no ML frameworks required - βœ… **200+ Skills**: Comprehensive taxonomy across 7 categories - βœ… **Bilingual**: Handles English and Indonesian (including code-switching) - βœ… **Skill Normalization**: Maps aliases to canonical forms (jsβ†’javascript, etc.) - βœ… **Proficiency Detection**: Identifies beginner/intermediate/expert levels - βœ… **Fast & Deterministic**: 1000+ docs/sec, reproducible results - βœ… **Production Ready**: Lightweight (~20 KB), easy integration ### Skill Categories | Category | Count | Examples | |----------|-------|----------| | **programming** | 30+ | Python, Java, JavaScript, TypeScript, PHP, C++, Go, Rust | | **frontend** | 40+ | React, Vue, Angular, Next.js, HTML, CSS, Tailwind, Webpack | | **backend** | 30+ | Node.js, Django, Laravel, Spring Boot, Express, FastAPI | | **database** | 25+ | MySQL, PostgreSQL, MongoDB, Redis, Elasticsearch, Oracle | | **cloud** | 35+ | AWS, Azure, GCP, Docker, Kubernetes, Jenkins, Terraform | | **data_science** | 30+ | Pandas, TensorFlow, PyTorch, Tableau, Power BI, Spark | | **soft_skills** | 20+ | Communication, Leadership, Teamwork, Problem Solving | **Total**: 200+ skills with 40+ aliases and variations ## πŸš€ Quick Start ### Installation No installation required! Just download the single Python file: ```python # Download skill_extractor.py from this repository # Place it in your project directory from skill_extractor import IndonesianSkillExtractor # Or use convenience function from skill_extractor import extract_skills ``` ### Basic Usage ```python from skill_extractor import IndonesianSkillExtractor # Initialize extractor = IndonesianSkillExtractor() # Extract skills from text text = "Menguasai Python, React, MySQL, dan komunikasi yang baik" result = extractor.extract(text) print(result) # Output: # { # 'skills': [ # {'original': 'Python', 'normalized': 'python', 'category': 'programming', 'proficiency': None}, # {'original': 'React', 'normalized': 'react', 'category': 'frontend', 'proficiency': None}, # {'original': 'MySQL', 'normalized': 'mysql', 'category': 'database', 'proficiency': None}, # {'original': 'komunikasi', 'normalized': 'komunikasi', 'category': 'soft_skills', 'proficiency': None} # ], # 'total_count': 4, # 'unique_count': 4, # 'by_category': { # 'programming': [...], # 'frontend': [...], # 'database': [...], # 'soft_skills': [...] # } # } ``` ### Simple Extraction ```python from skill_extractor import extract_skills # Quick extraction (returns list of skill names) skills = extract_skills("Python, React, MySQL, AWS") print(skills) # Output: ['python', 'react', 'mysql', 'aws'] ``` ### Batch Processing ```python extractor = IndonesianSkillExtractor() texts = [ "Python, Django, PostgreSQL", "React, TypeScript, Node.js", "AWS, Docker, Kubernetes" ] results = extractor.batch_extract(texts) for i, result in enumerate(results): print(f"Text {i+1}: {result['total_count']} skills, {len(result['by_category'])} categories") ``` ### Get Top Skills ```python extractor = IndonesianSkillExtractor() job_descriptions = [ "Python, Django, React...", "Java, Spring, MySQL...", "Python, FastAPI, PostgreSQL..." ] top_skills = extractor.get_top_skills(job_descriptions, top_n=5) print(top_skills) # Output: [('python', 2), ('react', 1), ('django', 1), ...] ``` ## πŸ“Š Features ### 1. Skill Normalization Handles variations and aliases: ```python extractor = IndonesianSkillExtractor() # These all normalize to the same skill texts = ["JS", "js", "JavaScript", "javascript"] for text in texts: skills = extract_skills(text) print(skills) # All output: ['javascript'] ``` **40+ Aliases Supported:** - js β†’ javascript - ts β†’ typescript - py β†’ python - reactjs, react.js β†’ react - nodejs β†’ node.js - pg, postgres β†’ postgresql - mongo β†’ mongodb - k8s β†’ kubernetes ### 2. Proficiency Detection Extracts skill levels from text: ```python text = "Expert in Python, Advanced React, Basic MySQL" result = extractor.extract(text) for skill in result['skills']: print(f"{skill['normalized']}: {skill['proficiency']}") # Output: # python: expert # react: expert (advanced maps to expert) # mysql: beginner (basic maps to beginner) ``` **Proficiency Keywords:** - **Expert**: expert, advanced, mahir, ahli, mastery - **Intermediate**: intermediate, menengah, competent - **Beginner**: beginner, basic, pemula, dasar ### 3. Indonesian Language Support Handles Indonesian skill names and code-switching: ```python text = "Komunikasi yang baik, kerja sama tim, kepemimpinan, Python" result = extractor.extract(text) for skill in result['skills']: print(f"{skill['original']} β†’ {skill['category']}") # Output: # Komunikasi β†’ soft_skills # kerja sama tim β†’ soft_skills # kepemimpinan β†’ soft_skills (leadership) # Python β†’ programming ``` ### 4. Comprehensive Parsing Handles multiple formats: ```python # Comma-separated extract_skills("Python, React, MySQL") # Semicolon-separated extract_skills("Python; React; MySQL") # Bullet points extract_skills("β€’ Python β€’ React β€’ MySQL") # Newline-separated extract_skills("Python\nReact\nMySQL") # Mixed with proficiency extract_skills("Python (Expert), React (2 years), MySQL") ``` ## πŸ“ˆ Performance | Metric | Value | |--------|-------| | **Speed** | 1000+ docs/second | | **Model Size** | ~20 KB (pure Python) | | **Dependencies** | None (stdlib only) | | **Skills Covered** | 200+ | | **Categories** | 7 | | **Aliases** | 40+ | | **Languages** | Indonesian + English | ### Comparison with ML Models | Feature | Skill Extractor | BERT-based NER | |---------|----------------|----------------| | **Training Data** | Not required | Required (1000+ samples) | | **Model Size** | 20 KB | 300+ MB | | **Speed** | 1000+ docs/sec | 50 docs/sec | | **Deterministic** | βœ… Yes | ❌ No | | **Explainable** | βœ… Yes | ❌ No | | **Easy to Update** | βœ… Just edit dict | ❌ Requires retraining | ## 🎯 Use Cases ### 1. Job-Candidate Matching ```python # Extract skills from job posting job_skills = extract_skills(job_description) # Extract skills from resume candidate_skills = extract_skills(resume_text) # Calculate match percentage matching_skills = set(job_skills) & set(candidate_skills) match_score = len(matching_skills) / len(job_skills) * 100 ``` ### 2. Skills Gap Analysis ```python # Get market demand market_skills = extractor.get_top_skills(job_postings, top_n=20) # Get candidate pool skills candidate_skills = extractor.get_top_skills(resumes, top_n=20) # Find gaps in_demand = set(s[0] for s in market_skills) available = set(s[0] for s in candidate_skills) skill_gaps = in_demand - available ``` ### 3. Trend Analysis ```python from collections import Counter # Group by time period skills_by_month = {} for job in jobs: month = job['month'] skills = extract_skills(job['requirements']) if month not in skills_by_month: skills_by_month[month] = [] skills_by_month[month].extend(skills) # Analyze trends for month, skills in skills_by_month.items(): top_5 = Counter(skills).most_common(5) print(f"{month}: {top_5}") ``` ### 4. Resume Screening ```python required_skills = ['python', 'django', 'postgresql'] nice_to_have = ['react', 'docker', 'aws'] def score_resume(resume_text): candidate_skills = set(extract_skills(resume_text)) # Required skills (2 points each) required_score = len(candidate_skills & set(required_skills)) * 2 # Nice to have (1 point each) bonus_score = len(candidate_skills & set(nice_to_have)) * 1 return required_score + bonus_score # Rank candidates candidates = [...] ranked = sorted(candidates, key=lambda c: score_resume(c['resume']), reverse=True) ``` ## πŸ”§ API Reference ### `IndonesianSkillExtractor` Main class for skill extraction. #### Methods: **`extract(text: str) -> Dict`** - Full extraction with metadata - Returns: skills, counts, categories, proficiency **`extract_simple(text: str) -> List[str]`** - Simple extraction returning skill names - Returns: List of normalized skill strings **`batch_extract(texts: List[str]) -> List[Dict]`** - Process multiple texts - Returns: List of extraction results **`get_top_skills(texts: List[str], top_n: int) -> List[Tuple]`** - Get most frequent skills across texts - Returns: List of (skill, count) tuples **`get_stats() -> Dict`** - Get model statistics - Returns: version, total_skills, categories, etc. ### Convenience Functions **`extract_skills(text: str) -> List[str]`** - Quick one-line extraction - Creates extractor instance automatically ## πŸ“„ License This model is released under the **MIT License**. **Citation:** ```bibtex @software{indonesian_skill_extractor_2024, author = {Herlambang Haryo Putro}, title = {Indonesian Skill Extractor v1.0}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/herlambangharyoputro/indonesian-skill-extractor-v1} } ``` ## 🀝 Contributions Part of the **Job Market Intelligence Platform** project. **Repository**: [GitHub - job-market-intelligence-platform](https://github.com/herlambangharyoputro/job-market-intelligence-platform) **Related Datasets:** - [Indonesian Job Market - Raw 2024](https://huggingface.co/datasets/herlambangharyoputro/indonesian-job-market-raw-2024) - [Indonesian Job Market - Tokenized 2024](https://huggingface.co/datasets/herlambangharyoputro/indonesian-job-market-tokenized-2024) **Contributions welcome!** If you: - Find missing skills or categories - Have suggestions for improvements - Want to add more language support - Build interesting projects using this model Please open an issue or pull request on GitHub. ## πŸ“§ Contact - **Author**: Herlambang Haryo Putro - **Email**: herlambangharyoputro@gmail.com - **GitHub**: [@herlambangharyoputro](https://github.com/herlambangharyoputro) - **Project**: Job Market Intelligence Platform ## πŸ”„ Version History - **v1.0.0** (December 2024): Initial release - 200+ skills across 7 categories - 40+ aliases for normalization - Proficiency level detection - Indonesian language support - Zero dependencies ## ⚠️ Limitations ### Coverage - Limited to predefined skill taxonomy (200+ skills) - New/emerging skills may be categorized as 'other' - Domain-specific skills may not be recognized ### Language - Primarily optimized for Indonesian job market - May not capture all regional variations - English technical terms preferred over Indonesian equivalents ### Accuracy - Rule-based approach may miss context-dependent skills - Acronyms can be ambiguous (e.g., "AI" = Artificial Intelligence or Adobe Illustrator) - Proficiency detection based on keywords only ### Recommendations - Best for structured skill lists (bullets, commas) - Review 'other' category for domain-specific additions - Combine with manual review for critical applications - Consider ML-based approach for unstructured text ## 🎯 Future Improvements Planned features for v2.0: - Expanded skill taxonomy (300+ skills) - Industry-specific categories - Skill clustering and relationships - Confidence scoring - Multi-language support (Javanese, Sundanese) - Experience year extraction - Certification detection --- **Last Updated**: December 2024 **Model Version**: 1.0.0 **Status**: βœ… Production Ready **Type**: Rule-based NER **For questions or collaboration, visit [GitHub](https://github.com/herlambangharyoputro/job-market-intelligence-platform).**