--- language: - ru license: apache-2.0 tags: - medical - icd-10 - multi-label-classification - russian - conditional-distillation base_model: alexyalunin/RuBioBERT pipeline_tag: text-classification --- # ICD-10 subgroup classifier - group C (distilled specialist) Multi-label classifier over 3-character ICD-10 subgroups inside chapter **C**. This specialist was distilled from local BERT teacher models into `alexyalunin/RuBioBERT`. Teacher weights are not uploaded to Hugging Face. ## Intended use / Назначение - **EN:** Decision-support signal for suggesting candidate ICD-10 subgroups from Russian clinical notes. **Not** a substitute for clinician judgment; not validated for autonomous diagnosis. - **RU:** Вспомогательный сигнал для предложения кандидатных 3-символьных кодов МКБ-10 по русскому клиническому тексту. **Не заменяет** врача и не предназначен для автономных клинических решений. ## Training data / Обучающие данные - Source CSV: `datasets/subgroups/group_C.csv` - SHA-256: `9526d7bd571f6aa94d0e162b727474a36dc63f71f79f0d78f400195b786bec26` - Splits: train=346 · val=75 · test=75 - Labels: 70; rare/interface-only ids are listed in `label_map.json`. ## Training route - Approach: `direct_hard_training_no_distillation` - Base model: `alexyalunin/RuBioBERT` - Direct validation hit@3: `0.96` - No-distillation threshold: `0.9` - Teacher models (fallback KD only): `[]` - Selected KD config (fallback only): temperature=`None`, hard_loss_weight=`None` ## Metrics (test split) | metric | final specialist | teacher ensemble / fallback | |---|---:|---:| | macro_f1 | 0.5899 | | | micro_f1 | 0.5926 | | | weighted_f1 | 0.6300 | | | subset_accuracy | 0.2133 | | | hit@1 | 0.8667 | | | hit@3 | 0.9200 | | | recall@3 | 0.8813 | | | mrr | 0.9023 | | Full per-label breakdown is available in `metrics.json`. ## Inference ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch repo = "Dmitry43243242/icd10-ru-subgroup-c" tok = AutoTokenizer.from_pretrained(repo) mdl = AutoModelForSequenceClassification.from_pretrained(repo) mdl.eval() text = "жалобы пациента..." inp = tok(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): probs = torch.sigmoid(mdl(**inp).logits)[0] preds = [mdl.config.id2label[i] for i, p in enumerate(probs.tolist()) if p >= 0.5] top5 = sorted( [(mdl.config.id2label[i], p) for i, p in enumerate(probs.tolist())], key=lambda x: -x[1], )[:5] print(preds, top5) ```