---
license: apache-2.0
base_model: hfl/chinese-macbert-base
library_name: transformers
pipeline_tag: text-classification
language:
  - zh
tags:
  - biomedical
  - pharmacovigilance
  - adverse-drug-reaction
  - meddra
  - chinese
  - macbert
model-index:
  - name: mention2meddra-macbert-base
    results:
      - task:
          type: text-classification
          name: Mention-candidate pair classification
        dataset:
          type: other
          name: Expert-annotated Chinese ADR normalization corpus
        metrics:
          - type: roc_auc
            value: 0.9966351963325706
            name: Pair-level ROC-AUC
          - type: f1
            value: 0.9589433131535499
            name: Pair-level F1
          - type: exact_match
            value: 0.8957816377171216
            name: Mention-level exact set match
          - type: f1
            value: 0.9586813186813188
            name: Mention-level micro-F1
---

# mention2meddra-macbert-base

This repository contains the fine-tuned MacBERT cross-encoder used as the second-stage reranker in a Chinese adverse drug reaction (ADR) mention-to-MedDRA normalization workflow. The first stage retrieves candidate Preferred Terms (PTs); this model scores each mention-candidate pair and outputs the probability that the candidate PT is a correct mapping for the mention.

The model weights and tokenizer files in this repository are publicly released under the Apache License 2.0.

## Model Details

- Base model: `hfl/chinese-macbert-base`
- Architecture: `BertForSequenceClassification`
- Task: binary mention-candidate pair classification
- Positive label: `match`
- Negative label: `not_match`
- Maximum sequence length used in the study: 64 tokens
- Candidate context: PT name, LLT aliases, HLT, HLGT, and SOC
- Recommended mention-level decoding threshold: 0.3 after top-100 BM25 retrieval

## Training Data Boundary

The model was fine-tuned on an expert-annotated Chinese ADR normalization corpus derived from a regional pharmacovigilance system. This public model repository does not contain real adverse-event records, expert annotation files, licensed MedDRA dictionary files, source database extracts, or downstream signal-detection datasets.

Full study reproduction requires the original source data, licensed terminology resources, the candidate retrieval stage, and the public code package.

## Evaluation

Held-out test performance reported for the associated Journal of Biomedical Informatics submission:

| Metric | Value |
| --- | ---: |
| Pair-level accuracy | 0.983664 |
| Pair-level precision | 0.962865 |
| Pair-level recall | 0.955054 |
| Pair-level F1 | 0.958943 |
| Pair-level ROC-AUC | 0.996635 |
| Mention-level exact set match | 0.895782 |
| Mention-level micro-F1 | 0.958681 |
| Mention-level top-1 accuracy | 0.977978 |
| Mention-level Recall@3 | 0.983561 |
| Mention-level Recall@5 | 0.999690 |

These metrics were obtained within the study corpus and evaluation protocol. External validation is required before applying the model to other regions, institutions, drug classes, MedDRA versions, or operational workflows.

## Intended Use

The model is intended for research use as a reranking component in Chinese ADR terminology normalization. It should be used with a candidate generator and licensed terminology resources. It is not a standalone clinical decision system and should not be used for clinical, regulatory, or safety actions without independent validation and appropriate expert review.

## License

The model weights, configuration files, tokenizer files, and model card in this repository are released under the Apache License 2.0. The license applies to the released artifacts in this repository and does not grant rights to non-distributed source datasets or licensed terminology resources.

## Input Format

The model expects a text pair:

- sequence A: the Chinese ADR mention or raw report phrase
- sequence B: a rendered candidate PT context containing PT, LLT, HLT, HLGT, and SOC fields

The public code repository contains the template and evaluation utilities:

https://github.com/xumingjun5208/mention2meddra