Text Generation
PEFT
TensorBoard
Safetensors
English
medical
radiology
medical-coding
icd-10
cpt
llama-3
llama-3-70b
lora
healthcare
clinical
conversational
Instructions to use vineetdaniels/NYXMed-V17-Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use vineetdaniels/NYXMed-V17-Model with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model") model = PeftModel.from_pretrained(base_model, "vineetdaniels/NYXMed-V17-Model") - Notebooks
- Google Colab
- Kaggle
Add real 500-record validation results: ICD recall 83.4%, CPT 90.6%, Modifier 97.0%
5a66960 verified | license: llama3 | |
| language: | |
| - en | |
| library_name: peft | |
| pipeline_tag: text-generation | |
| base_model: vineetdaniels/NYXMed-V16-Model | |
| tags: | |
| - medical | |
| - radiology | |
| - medical-coding | |
| - icd-10 | |
| - cpt | |
| - llama-3 | |
| - llama-3-70b | |
| - lora | |
| - peft | |
| - healthcare | |
| - clinical | |
| # NYXMed V17 — Radiology Medical Coding LLM | |
| <p align="center"> | |
| <em>Llama-3-70B fine-tune for autonomous CPT, ICD-10, and modifier coding from radiology reports.</em> | |
| </p> | |
| <p align="center"> | |
| <a href="https://huggingface.co/vineetdaniels/NYXMed-V16-Model">V16 (base)</a> · | |
| <a href="https://huggingface.co/vineetdaniels/NYXMed-V17-Model">V17 (this model)</a> · | |
| <a href="https://huggingface.co/vineetdaniels/NYXMed-V17-Epoch1">V17 Epoch-1 (frozen checkpoint)</a> | |
| </p> | |
| --- | |
| ## TL;DR | |
| V17 is a **LoRA adapter** trained on top of [`vineetdaniels/NYXMed-V16-Model`](https://huggingface.co/vineetdaniels/NYXMed-V16-Model) (a Llama-3-70B fine-tune). It was trained on **113,032 coder-reviewed radiology cases** with a focus on raising ICD-10 accuracy without regressing CPT or modifier performance. | |
| | Metric | V16 (base) | **V17 (this)** | Δ | | |
| |---|---|---|---| | |
| | **CPT exact match** | ~85% | **90.6%** | **+5.6 pts** | | |
| | **Modifier exact match** | ~95% | **97.0%** | +2.0 pts | | |
| | **Mean ICD recall** | ~65% | **83.4%** | **+18.4 pts** | | |
| | **Final eval_loss** | ~0.25 | **0.0824** | **−67%** | | |
| | Train examples | ~67K | **113,032** | +69% | | |
| | Adds Exam Description + Reason | ❌ | ✅ | — | | |
| V17 was trained to push **ICD recall above 80%** without regressing CPT — both goals achieved. Full metric breakdown in **Evaluation** below. | |
| --- | |
| ## What V17 is for | |
| V17 takes a radiology report and outputs the billing codes a human coder would assign: | |
| ``` | |
| Input: Exam description, reason for exam, full report text, and | |
| (optionally) retrieval-augmented examples + candidate codes. | |
| Output: CPT[, CPT2], MOD, ICD1, ICD2, ICD3, ... | |
| e.g. 93970, 26, M79.89, I83.93 | |
| ``` | |
| It is designed to be the LLM core inside an **autonomous coding pipeline** with retrieval (RAG), post-processing rules, and audit feedback loops — not as a standalone end-user model. | |
| ### Targets the model predicts | |
| - **CPT-4** procedure codes (supports multi-code outputs) | |
| - **Modifier-26 / TC / LT / RT / 50 / 59 / …** | |
| - **ICD-10-CM** diagnosis codes (multi-label, ordered by clinical priority) | |
| --- | |
| ## Evaluation | |
| Internal evaluation is performed against the live coder-reviewed Supabase dataset using a held-out validation split of **5,950 records**. | |
| ### Training-time eval_loss curve (held-out 250-sample slice) | |
| | Epoch | Step | eval_loss | | |
| |---|---|---| | |
| | 0.03 | 100 | (≈ baseline) | | |
| | 0.42 | 1,500 | 0.144 | | |
| | 0.85 | 3,000 | 0.103 | | |
| | 0.99 | 3,500 | 0.0875 | | |
| | **1.02** | **3,600 ← best** | **0.0824** | | |
| | 1.10 | 3,900 (stopped) | 0.0841 | | |
| Early stopping triggered at step 3,900 (1.1 epochs); `load_best_model_at_end=True` reverted to the step-3,600 checkpoint. | |
| ### Domain-specific accuracy | |
| Measured on **n = 500** randomly sampled held-out radiology reports (greedy decoding, batch=4, 4×H200): | |
| | Metric | V17 | | |
| |---|---| | |
| | **CPT exact match** | **90.60%** | | |
| | Primary CPT match | 91.40% | | |
| | **Modifier exact match** | **97.00%** | | |
| | **ICD-10 exact match** (full set) | 69.60% | | |
| | ICD-10 any-overlap | 90.40% | | |
| | ICD-10 root-overlap (`A99.x`-level) | 92.20% | | |
| | **Mean ICD recall** | **83.37%** | | |
| | Mean ICD precision | 85.05% | | |
| | All-three exact (CPT + MOD + full ICD set) | 64.00% | | |
| V17's primary training objective — **raise ICD recall above 80%** — was met (83.37%) while CPT (90.6%) and Modifier (97.0%) far exceeded the no-regression floor. Code-set-overlap metrics show V17 is identifying the correct *family* of ICD codes 92% of the time, with most remaining errors being specificity refinements (e.g. predicting `M25.5` instead of `M25.511`) rather than wrong-diagnosis errors. | |
| --- | |
| ## How to use | |
| V17 is published as a **LoRA adapter**. You need the V16 base model alongside it. | |
| ### Option A — Transformers + PEFT | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| import torch | |
| BASE = "vineetdaniels/NYXMed-V16-Model" | |
| ADAPTER = "vineetdaniels/NYXMed-V17-Model" | |
| tokenizer = AutoTokenizer.from_pretrained(ADAPTER, use_fast=True) | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| base = AutoModelForCausalLM.from_pretrained( | |
| BASE, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| attn_implementation="sdpa", | |
| ) | |
| model = PeftModel.from_pretrained(base, ADAPTER) | |
| model.eval() | |
| messages = [ | |
| {"role": "system", "content": "You are an expert radiology coder specializing in ICD-10 and CPT coding for radiology reports.\n\nFollow the coding rules provided in each request carefully."}, | |
| {"role": "user", "content": "<your prompt with few-shot examples + report>"}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") | |
| with torch.inference_mode(): | |
| out = model.generate(**inputs, max_new_tokens=64, do_sample=False) | |
| print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Option B — Merge & deploy with vLLM | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| base = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model", torch_dtype=torch.bfloat16) | |
| merged = PeftModel.from_pretrained(base, "vineetdaniels/NYXMed-V17-Model").merge_and_unload() | |
| merged.save_pretrained("./nyxmed-v17-merged") | |
| AutoTokenizer.from_pretrained("vineetdaniels/NYXMed-V17-Model").save_pretrained("./nyxmed-v17-merged") | |
| ``` | |
| Then serve with vLLM: | |
| ```bash | |
| vllm serve ./nyxmed-v17-merged \ | |
| --dtype bfloat16 \ | |
| --tensor-parallel-size 4 \ | |
| --max-model-len 4096 | |
| ``` | |
| ### Generation settings (recommended) | |
| | Param | Value | | |
| |---|---| | |
| | `do_sample` | `False` (greedy) | | |
| | `max_new_tokens` | `64` | | |
| | `temperature` | n/a | | |
| | `top_p` | n/a | | |
| Greedy decoding gives the most reproducible coding output. The model is robust enough that sampling rarely helps. | |
| --- | |
| ## Prompt format | |
| V17 expects the Llama-3 chat template. The user message should contain (in order): | |
| 1. **Few-shot examples** retrieved by RAG (BM25 + FAISS + reranker) | |
| 2. **CPT candidate list** (top-K from RAG, ordered) | |
| 3. **ICD-10 candidate list** (top-K from RAG, ordered) | |
| 4. **Coding rules** (project-specific guardrails) | |
| 5. **The actual report**, in one of two formats (V17 was trained on both, ~70/30 split): | |
| - **Explicit**: separate `Exam Description:` and `Reason for Exam:` lines, then the body. | |
| - **Embedded**: report text only, with description/indication inline as in the source. | |
| The expected assistant output is a single line: | |
| ``` | |
| <CPT>[ <CPT2> ...], <MOD>, <ICD1>, <ICD2>, ... | |
| ``` | |
| Empty modifier slot is allowed (e.g. `74176, , R10.84`). | |
| --- | |
| ## Training details | |
| | Setting | Value | | |
| |---|---| | |
| | **Base model** | `vineetdaniels/NYXMed-V16-Model` (Llama-3-70B-Instruct fine-tune) | | |
| | **Method** | LoRA with DeepSpeed ZeRO-3 | | |
| | **LoRA rank (`r`)** | 64 | | |
| | **LoRA alpha** | 128 | | |
| | **LoRA dropout** | 0.05 | | |
| | **Target modules** | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | | |
| | **Trainable params** | ~417 M of ~70.9 B (0.59%) | | |
| | **Train examples** | 113,032 | | |
| | **Validation examples** | 250 (sampled from 5,950-record held-out pool) | | |
| | **Sequence length** | 2,560 tokens | | |
| | **Effective batch size** | 32 (per-device 1 × grad accum 8 × 4 GPUs) | | |
| | **Optimizer** | AdamW + DeepSpeed ZeRO-3 | | |
| | **Learning rate** | 1e-5 (cosine schedule, 3% warmup) | | |
| | **Epochs** | 2 (early stopped at 1.10) | | |
| | **Total steps** | 3,900 | | |
| | **Best step** | 3,600 (loaded back via `load_best_model_at_end`) | | |
| | **Attention impl.** | `sdpa` (PyTorch built-in Flash Attention 2) | | |
| | **Precision** | bfloat16 | | |
| | **Hardware** | 4 × NVIDIA H200 SXM 80GB | | |
| | **Wall-clock runtime** | 16.95 hours | | |
| ### Data composition | |
| | Source | Count | Notes | | |
| |---|---|---| | |
| | Supabase coder-reviewed cases | ~46,000 | Includes 30K+ new records collected after V16 | | |
| | Specificity-correction pairs | ~5,000 | Unspecified → specific ICD upgrades, 3× weighted | | |
| | Hard-case audit set | ~3,000 | Multi-code or modifier-heavy reports | | |
| | V16-era retained set | ~59,000 | Filtered to exclude records V16 already trained on | | |
| A 3-layer **self-leakage defense** (content hash + cosine similarity + metadata fingerprint) prevented any training record from retrieving itself as a few-shot example during prompt assembly. **108K candidate retrievals were blocked** by this filter during training-data preparation. | |
| --- | |
| ## Intended use & limitations | |
| ### Intended use | |
| - Augmenting human radiology coders in a **review-then-accept** workflow. | |
| - Pre-coding reports for a downstream audit / verification pipeline. | |
| - Research on LLM-based medical coding. | |
| ### Out of scope | |
| - Direct billing without human review. | |
| - Non-radiology specialties (cardiology, pathology, etc.). The training data is radiology-only. | |
| - ICD-10 codes outside the radiology-relevant subset are under-represented. | |
| ### Known limitations | |
| - **Long reports** (> 2,560 tokens) are truncated during inference; performance on extreme outliers may degrade. | |
| - **Rare CPT/ICD combinations** appear infrequently in training and remain harder cases. | |
| - The model is **English-only**. | |
| - Outputs are deterministic with greedy decoding but the model can still produce **hallucinated codes** — production deployment must include code-validity checks against the official CMS code sets. | |
| ### Bias & safety | |
| This is a **clinical decision-support model**. It must not be used to make autonomous billing or treatment decisions without review by a credentialed coder or clinician. The training data is sourced from a single organization's coder-reviewed dataset and may carry institutional coding preferences. | |
| --- | |
| ## Recovery / Checkpoints | |
| If a deployment ever needs to roll back, the following snapshots are available on the Hub: | |
| | Checkpoint | Where | Notes | | |
| |---|---|---| | |
| | V17 final (best step 3,600) | this repo | `eval_loss = 0.0824` | | |
| | V17 Epoch-1 (step 3,500) | [`vineetdaniels/NYXMed-V17-Epoch1`](https://huggingface.co/vineetdaniels/NYXMed-V17-Epoch1) | `eval_loss = 0.0875`, frozen for safety | | |
| | V16 (base) | [`vineetdaniels/NYXMed-V16-Model`](https://huggingface.co/vineetdaniels/NYXMed-V16-Model) | Required to load this adapter | | |
| Adapter weights (`adapter_model.safetensors`) are 1.66 GB. Full training history is available in `training_metrics.json` and TensorBoard logs in this repo under `logs/`. | |
| --- | |
| ## Acknowledgements | |
| Built on Meta's [Llama-3](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) via Hugging Face's `transformers`, `peft`, `accelerate`, and `deepspeed` libraries. | |