--- license: apache-2.0 base_model: hfl/chinese-macbert-base library_name: transformers pipeline_tag: text-classification language: - zh tags: - biomedical - pharmacovigilance - adverse-drug-reaction - meddra - chinese - macbert model-index: - name: mention2meddra-macbert-base results: - task: type: text-classification name: Mention-candidate pair classification dataset: type: other name: Expert-annotated Chinese ADR normalization corpus metrics: - type: roc_auc value: 0.9966351963325706 name: Pair-level ROC-AUC - type: f1 value: 0.9589433131535499 name: Pair-level F1 - type: exact_match value: 0.8957816377171216 name: Mention-level exact set match - type: f1 value: 0.9586813186813188 name: Mention-level micro-F1 --- # mention2meddra-macbert-base This repository contains the fine-tuned MacBERT cross-encoder used as the second-stage reranker in a Chinese adverse drug reaction (ADR) mention-to-MedDRA normalization workflow. The first stage retrieves candidate Preferred Terms (PTs); this model scores each mention-candidate pair and outputs the probability that the candidate PT is a correct mapping for the mention. The model weights and tokenizer files in this repository are publicly released under the Apache License 2.0. ## Model Details - Base model: `hfl/chinese-macbert-base` - Architecture: `BertForSequenceClassification` - Task: binary mention-candidate pair classification - Positive label: `match` - Negative label: `not_match` - Maximum sequence length used in the study: 64 tokens - Candidate context: PT name, LLT aliases, HLT, HLGT, and SOC - Recommended mention-level decoding threshold: 0.3 after top-100 BM25 retrieval ## Training Data Boundary The model was fine-tuned on an expert-annotated Chinese ADR normalization corpus derived from a regional pharmacovigilance system. This public model repository does not contain real adverse-event records, expert annotation files, licensed MedDRA dictionary files, source database extracts, or downstream signal-detection datasets. Full study reproduction requires the original source data, licensed terminology resources, the candidate retrieval stage, and the public code package. ## Evaluation Held-out test performance reported for the associated Journal of Biomedical Informatics submission: | Metric | Value | | --- | ---: | | Pair-level accuracy | 0.983664 | | Pair-level precision | 0.962865 | | Pair-level recall | 0.955054 | | Pair-level F1 | 0.958943 | | Pair-level ROC-AUC | 0.996635 | | Mention-level exact set match | 0.895782 | | Mention-level micro-F1 | 0.958681 | | Mention-level top-1 accuracy | 0.977978 | | Mention-level Recall@3 | 0.983561 | | Mention-level Recall@5 | 0.999690 | These metrics were obtained within the study corpus and evaluation protocol. External validation is required before applying the model to other regions, institutions, drug classes, MedDRA versions, or operational workflows. ## Intended Use The model is intended for research use as a reranking component in Chinese ADR terminology normalization. It should be used with a candidate generator and licensed terminology resources. It is not a standalone clinical decision system and should not be used for clinical, regulatory, or safety actions without independent validation and appropriate expert review. ## License The model weights, configuration files, tokenizer files, and model card in this repository are released under the Apache License 2.0. The license applies to the released artifacts in this repository and does not grant rights to non-distributed source datasets or licensed terminology resources. ## Input Format The model expects a text pair: - sequence A: the Chinese ADR mention or raw report phrase - sequence B: a rendered candidate PT context containing PT, LLT, HLT, HLGT, and SOC fields The public code repository contains the template and evaluation utilities: https://github.com/xumingjun5208/mention2meddra