DNivalis commited on
Commit
c6b1c19
Β·
verified Β·
1 Parent(s): 5244cad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -1
README.md CHANGED
@@ -4,4 +4,135 @@ language:
4
  - en
5
  base_model:
6
  - FacebookAI/roberta-large
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  base_model:
6
  - FacebookAI/roberta-large
7
+ tags:
8
+ - token-classification
9
+ - named-entity-recognition
10
+ - medical-nlp
11
+ - crf
12
+ - biomedical
13
+ - jargon-identification
14
+ ---
15
+
16
+ # Medical Jargon Identifier with CRF
17
+
18
+ A PyTorch model that performs **fine-grained medical jargon identification** using a **RoBERTa-large** backbone enhanced by a **Conditional Random Field (CRF)** layer.
19
+ Fine-tuned on the **MedReadMe** dataset introduced by Jiang & Xu (2024).
20
+
21
+ ---
22
+
23
+ ## 🧠 Overview
24
+
25
+ * **Architecture**: RoBERTa-large β†’ Linear classifier β†’ CRF
26
+ * **Task**: Token-level classification into **7 medical jargon categories** + BIO tagging
27
+ * **Input**: Raw English text (sentences or paragraphs)
28
+ * **Output**: Word-level spans labeled with jargon type and boundaries
29
+
30
+ ---
31
+
32
+ ## 🎯 Supported Jargon Categories
33
+
34
+ | Label (BIO) | Meaning |
35
+ | ------------------------------------ | -------------------------------------------- |
36
+ | `medical-jargon-google-easy` | Easily Google-able medical terms |
37
+ | `medical-jargon-google-hard` | Complex, hard-to-Google medical terms |
38
+ | `medical-name-entity` | Named diseases, drugs, procedures |
39
+ | `general-complex` | Complex general vocabulary |
40
+ | `abbr-medical` | Medical abbreviations (e.g., ECG, CBC) |
41
+ | `abbr-general` | General abbreviations |
42
+ | `general-medical-multisense` | Words with both lay and medical meanings |
43
+
44
+ ---
45
+
46
+ ## πŸ“ Files & Format
47
+
48
+ * `pytorch_model.bin` – model weights
49
+ * `config.json` – hyper-parameters & label map
50
+ * `tokenizer.json`, `vocab.json`, `merges.txt` – RoBERTa tokenizer assets
51
+ * `modeling_jargon.py` – custom `CRFTokenClassificationModel` class
52
+ * `requirements.txt` – runtime dependencies
53
+
54
+ ---
55
+
56
+ ## πŸ”§ Quick Start
57
+
58
+ ```python
59
+ from transformers import AutoTokenizer
60
+ from modeling_jargon import CRFTokenClassificationModel
61
+
62
+ model_name = "your-username/medical-jargon-crf"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
64
+ model = CRFTokenClassificationModel.from_pretrained(model_name)
65
+ model.eval()
66
+
67
+ text = "The patient presented with elevated CRP and intermittent AF."
68
+ inputs = tokenizer(text, return_tensors="pt")
69
+ with torch.no_grad():
70
+ logits = model(**inputs)["logits"]
71
+ tags = model.decode(logits, inputs["attention_mask"])[0]
72
+
73
+ # Convert IDs β†’ labels
74
+ id2label = model.config.id2label
75
+ spans = [(i, id2label[t]) for i, t in enumerate(tags) if t != 0]
76
+ ```
77
+
78
+ ---
79
+
80
+ ## πŸ₯ Supported Tasks
81
+
82
+ * **Medical jargon detection** – binary, 3-class, or 7-category granularity
83
+ * **Named-entity recognition** – extract spans of medical interest
84
+ * **Readability analysis** – density of jargon per sentence or document
85
+ * **Downstream QA & summarization** – filter or simplify complex terms
86
+
87
+ ---
88
+
89
+ ## 🌍 Language
90
+
91
+ English only.
92
+
93
+ ---
94
+
95
+ ## πŸ“š Training Data
96
+
97
+ Fine-tuned on **MedReadMe**: 4,520 sentences with fine-grained jargon span annotations, including the novel *Google-Easy* and *Google-Hard* categories .
98
+
99
+ ---
100
+
101
+ ## πŸ“– Citation
102
+
103
+ If you use this model or the underlying dataset, please cite:
104
+
105
+ ```bibtex
106
+ @article{jiang2024medreadmesystematicstudyfinegrained,
107
+ title={MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain},
108
+ author={Chao Jiang and Wei Xu},
109
+ year={2024},
110
+ eprint={2405.02144},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.CL},
113
+ url={https://arxiv.org/abs/2405.02144}
114
+ }
115
+ ```
116
+
117
+ ---
118
+
119
+ ## πŸ“ License & Usage
120
+
121
+ Licensed under **Apache 2.0**.
122
+
123
+ * βœ… Allowed: research, commercial use, derivative works
124
+ * Include license notice and attribution in any distribution
125
+
126
+ ---
127
+
128
+ ## ⚠️ Important Notes
129
+
130
+ * Model outputs are **not medical advice**; use for research/educational purposes only.
131
+ * Performance may vary on text that differs substantially from the MedReadMe training domain.
132
+ * Consider additional post-processing for production systems (e.g., confidence filtering).
133
+
134
+ ---
135
+
136
+ ## ☎️ Contact
137
+
138
+ For questions, issues, or licensing inquiries, open an issue on the [model repository](https://huggingface.co/DNivalis/med-jargon-crf).