DNivalis
/

med-jargon-crf

 - en
 base_model:
 - FacebookAI/roberta-large
+tags:
+- token-classification
+- named-entity-recognition
+- medical-nlp
+- crf
+- biomedical
+- jargon-identification
+---
+# Medical Jargon Identifier with CRF
+A PyTorch model that performs **fine-grained medical jargon identification** using a **RoBERTa-large** backbone enhanced by a **Conditional Random Field (CRF)** layer.
+Fine-tuned on the **MedReadMe** dataset introduced by Jiang & Xu (2024).
+---
+## 🧠 Overview
+* **Architecture**: RoBERTa-large → Linear classifier → CRF
+* **Task**: Token-level classification into **7 medical jargon categories** + BIO tagging
+* **Input**: Raw English text (sentences or paragraphs)
+* **Output**: Word-level spans labeled with jargon type and boundaries
+---
+## 🎯 Supported Jargon Categories
+| Label (BIO)                          | Meaning                                      |
+| ------------------------------------ | -------------------------------------------- |
+| `medical-jargon-google-easy`         | Easily Google-able medical terms             |
+| `medical-jargon-google-hard`         | Complex, hard-to-Google medical terms        |
+| `medical-name-entity`                | Named diseases, drugs, procedures            |
+| `general-complex`                    | Complex general vocabulary                   |
+| `abbr-medical`                       | Medical abbreviations (e.g., ECG, CBC)       |
+| `abbr-general`                       | General abbreviations                        |
+| `general-medical-multisense`         | Words with both lay and medical meanings     |
+---
+## 📁 Files & Format
+* `pytorch_model.bin` – model weights
+* `config.json` – hyper-parameters & label map
+* `tokenizer.json`, `vocab.json`, `merges.txt` – RoBERTa tokenizer assets
+* `modeling_jargon.py` – custom `CRFTokenClassificationModel` class
+* `requirements.txt` – runtime dependencies
+---
+## 🔧 Quick Start
+```python
+from transformers import AutoTokenizer
+from modeling_jargon import CRFTokenClassificationModel
+model_name = "your-username/medical-jargon-crf"
+tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
+model = CRFTokenClassificationModel.from_pretrained(model_name)
+model.eval()
+text = "The patient presented with elevated CRP and intermittent AF."
+inputs = tokenizer(text, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs)["logits"]
+    tags = model.decode(logits, inputs["attention_mask"])[0]
+# Convert IDs → labels
+id2label = model.config.id2label
+spans = [(i, id2label[t]) for i, t in enumerate(tags) if t != 0]
+```
+---
+## 🏥 Supported Tasks
+* **Medical jargon detection** – binary, 3-class, or 7-category granularity
+* **Named-entity recognition** – extract spans of medical interest
+* **Readability analysis** – density of jargon per sentence or document
+* **Downstream QA & summarization** – filter or simplify complex terms
+---
+## 🌍 Language
+English only.
+---
+## 📚 Training Data
+Fine-tuned on **MedReadMe**: 4,520 sentences with fine-grained jargon span annotations, including the novel *Google-Easy* and *Google-Hard* categories .
+---
+## 📖 Citation
+If you use this model or the underlying dataset, please cite:
+```bibtex
+@article{jiang2024medreadmesystematicstudyfinegrained,
+  title={MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain},
+  author={Chao Jiang and Wei Xu},
+  year={2024},
+  eprint={2405.02144},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  url={https://arxiv.org/abs/2405.02144}
+}
+```
+---
+## 📝 License & Usage
+Licensed under **Apache 2.0**.
+* ✅ Allowed: research, commercial use, derivative works
+* Include license notice and attribution in any distribution
+---
+## ⚠️ Important Notes
+* Model outputs are **not medical advice**; use for research/educational purposes only.
+*  Performance may vary on text that differs substantially from the MedReadMe training domain.
+* Consider additional post-processing for production systems (e.g., confidence filtering).
+---
+## ☎️ Contact
+For questions, issues, or licensing inquiries, open an issue on the [model repository](https://huggingface.co/DNivalis/med-jargon-crf).