Add real 500-record validation results: ICD recall 83.4%, CPT 90.6%, Modifier 97.0%

5a66960 verified about 1 month ago

10.8 kB

	---
	license: llama3
	language:
	- en
	library_name: peft
	pipeline_tag: text-generation
	base_model: vineetdaniels/NYXMed-V16-Model
	tags:
	- medical
	- radiology
	- medical-coding
	- icd-10
	- cpt
	- llama-3
	- llama-3-70b
	- lora
	- peft
	- healthcare
	- clinical
	---

	# NYXMed V17 — Radiology Medical Coding LLM

	<p align="center">
	<em>Llama-3-70B fine-tune for autonomous CPT, ICD-10, and modifier coding from radiology reports.</em>
	</p>

	<p align="center">
	<a href="https://huggingface.co/vineetdaniels/NYXMed-V16-Model">V16 (base)</a> ·
	<a href="https://huggingface.co/vineetdaniels/NYXMed-V17-Model">V17 (this model)</a> ·
	<a href="https://huggingface.co/vineetdaniels/NYXMed-V17-Epoch1">V17 Epoch-1 (frozen checkpoint)</a>
	</p>

	---

	## TL;DR

	V17 is a LoRA adapter trained on top of [`vineetdaniels/NYXMed-V16-Model`](https://huggingface.co/vineetdaniels/NYXMed-V16-Model) (a Llama-3-70B fine-tune). It was trained on 113,032 coder-reviewed radiology cases with a focus on raising ICD-10 accuracy without regressing CPT or modifier performance.

	\| Metric \| V16 (base) \| V17 (this) \| Δ \|
	\|---\|---\|---\|---\|
	\| CPT exact match \| ~85% \| 90.6% \| +5.6 pts \|
	\| Modifier exact match \| ~95% \| 97.0% \| +2.0 pts \|
	\| Mean ICD recall \| ~65% \| 83.4% \| +18.4 pts \|
	\| Final eval_loss \| ~0.25 \| 0.0824 \| −67% \|
	\| Train examples \| ~67K \| 113,032 \| +69% \|
	\| Adds Exam Description + Reason \| ❌ \| ✅ \| — \|

	V17 was trained to push ICD recall above 80% without regressing CPT — both goals achieved. Full metric breakdown in Evaluation below.

	---

	## What V17 is for

	V17 takes a radiology report and outputs the billing codes a human coder would assign:

	```
	Input: Exam description, reason for exam, full report text, and
	(optionally) retrieval-augmented examples + candidate codes.

	Output: CPT[, CPT2], MOD, ICD1, ICD2, ICD3, ...
	e.g. 93970, 26, M79.89, I83.93
	```

	It is designed to be the LLM core inside an autonomous coding pipeline with retrieval (RAG), post-processing rules, and audit feedback loops — not as a standalone end-user model.

	### Targets the model predicts

	- CPT-4 procedure codes (supports multi-code outputs)
	- Modifier-26 / TC / LT / RT / 50 / 59 / …
	- ICD-10-CM diagnosis codes (multi-label, ordered by clinical priority)

	---

	## Evaluation

	Internal evaluation is performed against the live coder-reviewed Supabase dataset using a held-out validation split of 5,950 records.

	### Training-time eval_loss curve (held-out 250-sample slice)

	\| Epoch \| Step \| eval_loss \|
	\|---\|---\|---\|
	\| 0.03 \| 100 \| (≈ baseline) \|
	\| 0.42 \| 1,500 \| 0.144 \|
	\| 0.85 \| 3,000 \| 0.103 \|
	\| 0.99 \| 3,500 \| 0.0875 \|
	\| 1.02 \| 3,600 ← best \| 0.0824 \|
	\| 1.10 \| 3,900 (stopped) \| 0.0841 \|

	Early stopping triggered at step 3,900 (1.1 epochs); `load_best_model_at_end=True` reverted to the step-3,600 checkpoint.

	### Domain-specific accuracy

	Measured on n = 500 randomly sampled held-out radiology reports (greedy decoding, batch=4, 4×H200):

	\| Metric \| V17 \|
	\|---\|---\|
	\| CPT exact match \| 90.60% \|
	\| Primary CPT match \| 91.40% \|
	\| Modifier exact match \| 97.00% \|
	\| ICD-10 exact match (full set) \| 69.60% \|
	\| ICD-10 any-overlap \| 90.40% \|
	\| ICD-10 root-overlap (`A99.x`-level) \| 92.20% \|
	\| Mean ICD recall \| 83.37% \|
	\| Mean ICD precision \| 85.05% \|
	\| All-three exact (CPT + MOD + full ICD set) \| 64.00% \|

	V17's primary training objective — raise ICD recall above 80% — was met (83.37%) while CPT (90.6%) and Modifier (97.0%) far exceeded the no-regression floor. Code-set-overlap metrics show V17 is identifying the correct family of ICD codes 92% of the time, with most remaining errors being specificity refinements (e.g. predicting `M25.5` instead of `M25.511`) rather than wrong-diagnosis errors.

	---

	## How to use

	V17 is published as a LoRA adapter. You need the V16 base model alongside it.

	### Option A — Transformers + PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	BASE = "vineetdaniels/NYXMed-V16-Model"
	ADAPTER = "vineetdaniels/NYXMed-V17-Model"

	tokenizer = AutoTokenizer.from_pretrained(ADAPTER, use_fast=True)
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	base = AutoModelForCausalLM.from_pretrained(
	BASE,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="sdpa",
	)
	model = PeftModel.from_pretrained(base, ADAPTER)
	model.eval()

	messages = [
	{"role": "system", "content": "You are an expert radiology coder specializing in ICD-10 and CPT coding for radiology reports.\n\nFollow the coding rules provided in each request carefully."},
	{"role": "user", "content": "<your prompt with few-shot examples + report>"},
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	with torch.inference_mode():
	out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
	print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Option B — Merge & deploy with vLLM

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	base = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model", torch_dtype=torch.bfloat16)
	merged = PeftModel.from_pretrained(base, "vineetdaniels/NYXMed-V17-Model").merge_and_unload()
	merged.save_pretrained("./nyxmed-v17-merged")
	AutoTokenizer.from_pretrained("vineetdaniels/NYXMed-V17-Model").save_pretrained("./nyxmed-v17-merged")
	```

	Then serve with vLLM:

	```bash
	vllm serve ./nyxmed-v17-merged \
	--dtype bfloat16 \
	--tensor-parallel-size 4 \
	--max-model-len 4096
	```

	### Generation settings (recommended)

	\| Param \| Value \|
	\|---\|---\|
	\| `do_sample` \| `False` (greedy) \|
	\| `max_new_tokens` \| `64` \|
	\| `temperature` \| n/a \|
	\| `top_p` \| n/a \|

	Greedy decoding gives the most reproducible coding output. The model is robust enough that sampling rarely helps.

	---

	## Prompt format

	V17 expects the Llama-3 chat template. The user message should contain (in order):

	1. Few-shot examples retrieved by RAG (BM25 + FAISS + reranker)
	2. CPT candidate list (top-K from RAG, ordered)
	3. ICD-10 candidate list (top-K from RAG, ordered)
	4. Coding rules (project-specific guardrails)
	5. The actual report, in one of two formats (V17 was trained on both, ~70/30 split):
	- Explicit: separate `Exam Description:` and `Reason for Exam:` lines, then the body.
	- Embedded: report text only, with description/indication inline as in the source.

	The expected assistant output is a single line:

	```
	<CPT>[ <CPT2> ...], <MOD>, <ICD1>, <ICD2>, ...
	```

	Empty modifier slot is allowed (e.g. `74176, , R10.84`).

	---

	## Training details

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `vineetdaniels/NYXMed-V16-Model` (Llama-3-70B-Instruct fine-tune) \|
	\| Method \| LoRA with DeepSpeed ZeRO-3 \|
	\| LoRA rank (`r`) \| 64 \|
	\| LoRA alpha \| 128 \|
	\| LoRA dropout \| 0.05 \|
	\| Target modules \| `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` \|
	\| Trainable params \| ~417 M of ~70.9 B (0.59%) \|
	\| Train examples \| 113,032 \|
	\| Validation examples \| 250 (sampled from 5,950-record held-out pool) \|
	\| Sequence length \| 2,560 tokens \|
	\| Effective batch size \| 32 (per-device 1 × grad accum 8 × 4 GPUs) \|
	\| Optimizer \| AdamW + DeepSpeed ZeRO-3 \|
	\| Learning rate \| 1e-5 (cosine schedule, 3% warmup) \|
	\| Epochs \| 2 (early stopped at 1.10) \|
	\| Total steps \| 3,900 \|
	\| Best step \| 3,600 (loaded back via `load_best_model_at_end`) \|
	\| Attention impl. \| `sdpa` (PyTorch built-in Flash Attention 2) \|
	\| Precision \| bfloat16 \|
	\| Hardware \| 4 × NVIDIA H200 SXM 80GB \|
	\| Wall-clock runtime \| 16.95 hours \|

	### Data composition

	\| Source \| Count \| Notes \|
	\|---\|---\|---\|
	\| Supabase coder-reviewed cases \| ~46,000 \| Includes 30K+ new records collected after V16 \|
	\| Specificity-correction pairs \| ~5,000 \| Unspecified → specific ICD upgrades, 3× weighted \|
	\| Hard-case audit set \| ~3,000 \| Multi-code or modifier-heavy reports \|
	\| V16-era retained set \| ~59,000 \| Filtered to exclude records V16 already trained on \|

	A 3-layer self-leakage defense (content hash + cosine similarity + metadata fingerprint) prevented any training record from retrieving itself as a few-shot example during prompt assembly. 108K candidate retrievals were blocked by this filter during training-data preparation.

	---

	## Intended use & limitations

	### Intended use
	- Augmenting human radiology coders in a review-then-accept workflow.
	- Pre-coding reports for a downstream audit / verification pipeline.
	- Research on LLM-based medical coding.

	### Out of scope
	- Direct billing without human review.
	- Non-radiology specialties (cardiology, pathology, etc.). The training data is radiology-only.
	- ICD-10 codes outside the radiology-relevant subset are under-represented.

	### Known limitations
	- Long reports (> 2,560 tokens) are truncated during inference; performance on extreme outliers may degrade.
	- Rare CPT/ICD combinations appear infrequently in training and remain harder cases.
	- The model is English-only.
	- Outputs are deterministic with greedy decoding but the model can still produce hallucinated codes — production deployment must include code-validity checks against the official CMS code sets.

	### Bias & safety
	This is a clinical decision-support model. It must not be used to make autonomous billing or treatment decisions without review by a credentialed coder or clinician. The training data is sourced from a single organization's coder-reviewed dataset and may carry institutional coding preferences.

	---

	## Recovery / Checkpoints

	If a deployment ever needs to roll back, the following snapshots are available on the Hub:

	\| Checkpoint \| Where \| Notes \|
	\|---\|---\|---\|
	\| V17 final (best step 3,600) \| this repo \| `eval_loss = 0.0824` \|
	\| V17 Epoch-1 (step 3,500) \| [`vineetdaniels/NYXMed-V17-Epoch1`](https://huggingface.co/vineetdaniels/NYXMed-V17-Epoch1) \| `eval_loss = 0.0875`, frozen for safety \|
	\| V16 (base) \| [`vineetdaniels/NYXMed-V16-Model`](https://huggingface.co/vineetdaniels/NYXMed-V16-Model) \| Required to load this adapter \|

	Adapter weights (`adapter_model.safetensors`) are 1.66 GB. Full training history is available in `training_metrics.json` and TensorBoard logs in this repo under `logs/`.

	---

	## Acknowledgements

	Built on Meta's [Llama-3](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) via Hugging Face's `transformers`, `peft`, `accelerate`, and `deepspeed` libraries.