How to use from
Unsloth Studio
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AmareshHebbar/hindi-medical-qwen25-3b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AmareshHebbar/hindi-medical-qwen25-3b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AmareshHebbar/hindi-medical-qwen25-3b to start chatting
Load model with FastModel
pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="AmareshHebbar/hindi-medical-qwen25-3b",
    max_seq_length=2048,
)
Quick Links

๐Ÿ•‰๏ธ Hindi Medical Reasoning

Qwen2.5-3B fine-tuned for hindi medical reasoning

Hugging Face Dataset License Base Model Unsloth W&B

Part of the Medical AI Fine-tuned Model Suite โ€” 16 specialist models, one per task


TL;DR

Answers medical questions with detailed chain-of-thought clinical reasoning for Hindi-language and Indic medical AI.

INPUT:  A 45-year-old presents with fatigue, weight gain, cold intolerance and bradycardia for 6 months.
OUTPUT: Chain-of-thought reasoning through hypothyroidism differential leading to TSH and free T4 workup, then diagnosis and management plan.
Base model unsloth/Qwen2.5-3B-Instruct
Method QLoRA, 4-bit NF4, rank 16
Training data hindi-medical-sft โ€” 19,704 real-world rows
Training compute NVIDIA RTX A6000 (48GB), ~2.0h
License Apache 2.0

Architecture

                  +-------------------------+
  user prompt --> |  Qwen2.5-3B-Instruct  | --> base weights (frozen, 4-bit NF4)
                  |  + LoRA adapter (r=16)  | --> hindi-medical-qwen25-3b
                  +-------------------------+
                              |
                              v
                     structured output
                  (code / JSON / classification)

This repo contains only the LoRA adapter (~60MB), not the full merged weights. Load it on top of the base model as shown below โ€” this keeps the download small and lets you swap adapters on one base model in memory.


Intended use

Fine-tune for Hindi-language medical Q&A and ABDM-compatible clinical assistants.

Direct use

Describe symptoms, get back step-by-step clinical reasoning toward a likely diagnosis.

Downstream use

Power a multilingual symptom-checker chatbot or an Indic-language clinical education tool.

Out of scope

Definitive diagnosis or treatment without physician oversight. The training data is primarily English-sourced reasoning chains; verify Hindi-specific clinical terminology accuracy before production use.

This model is not a substitute for a certified medical professional's judgment. Output should be reviewed by a qualified person before being used in a clinical or billing decision.


Quickstart

Option A โ€” Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = "unsloth/Qwen2.5-3B-Instruct"
adapter    = "AmareshHebbar/hindi-medical-qwen25-3b"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)

messages = [
    {"role": "system", "content": "เค†เคช เคเค• เคšเคฟเค•เคฟเคคเฅเคธเคพ เคธเคนเคพเคฏเค• เคนเฅˆเค‚เฅค เคฐเฅ‹เค—เฅ€ เค•เฅ‡ เคฒเค•เฅเคทเคฃเฅ‹เค‚ เค•เฅ‡ เค†เคงเคพเคฐ เคชเคฐ ICD-10 เค•เฅ‹เคก เค”เคฐ เคธเค‚เคญเคพเคตเคฟเคค เคจเคฟเคฆเคพเคจ เคชเฅเคฐเคฆเคพเคจ เค•เคฐเฅ‡เค‚เฅค"},
    {"role": "user", "content": "A 45-year-old presents with fatigue, weight gain, cold intolerance and bradycardia for 6 months."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Expected output:

Chain-of-thought reasoning through hypothyroidism differential leading to TSH and free T4 workup, then diagnosis and management plan.

Option B โ€” Unsloth (2x faster load + inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="AmareshHebbar/hindi-medical-qwen25-3b",
    max_seq_length=512,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "เค†เคช เคเค• เคšเคฟเค•เคฟเคคเฅเคธเคพ เคธเคนเคพเคฏเค• เคนเฅˆเค‚เฅค เคฐเฅ‹เค—เฅ€ เค•เฅ‡ เคฒเค•เฅเคทเคฃเฅ‹เค‚ เค•เฅ‡ เค†เคงเคพเคฐ เคชเคฐ ICD-10 เค•เฅ‹เคก เค”เคฐ เคธเค‚เคญเคพเคตเคฟเคค เคจเคฟเคฆเคพเคจ เคชเฅเคฐเคฆเคพเคจ เค•เคฐเฅ‡เค‚เฅค"},
    {"role": "user", "content": "A patient reports sudden chest pain radiating to the left arm."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Option C โ€” vLLM (production serving, OpenAI-compatible)

vllm serve unsloth/Qwen2.5-3B-Instruct \
    --enable-lora \
    --lora-modules hindi-medical-qwen25-3b=AmareshHebbar/hindi-medical-qwen25-3b \
    --host 0.0.0.0 --port 8000 --dtype bfloat16
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="hindi-medical-qwen25-3b",
    messages=[
        {"role": "system", "content": "เค†เคช เคเค• เคšเคฟเค•เคฟเคคเฅเคธเคพ เคธเคนเคพเคฏเค• เคนเฅˆเค‚เฅค เคฐเฅ‹เค—เฅ€ เค•เฅ‡ เคฒเค•เฅเคทเคฃเฅ‹เค‚ เค•เฅ‡ เค†เคงเคพเคฐ เคชเคฐ ICD-10 เค•เฅ‹เคก เค”เคฐ เคธเค‚เคญเคพเคตเคฟเคค เคจเคฟเคฆเคพเคจ เคชเฅเคฐเคฆเคพเคจ เค•เคฐเฅ‡เค‚เฅค"},
        {"role": "user", "content": "A child presents with high fever and a rash for five days."},
    ],
    temperature=0.1,
)
print(response.choices[0].message.content)

Option D โ€” GGUF / llama.cpp (CPU / edge inference)

This repo ships LoRA adapter weights, not a pre-merged GGUF. To run on llama.cpp, merge first:

pip install unsloth
python -c "
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained('AmareshHebbar/hindi-medical-qwen25-3b', load_in_4bit=False)
model.save_pretrained_gguf('hindi-medical-qwen25-3b-gguf', tokenizer, quantization_method='q4_k_m')
"

Training details

Data

Trained on 19,704 examples extracted from FreedomIntelligence/medical-o1-reasoning-SFT โ€” medical chain-of-thought reasoning pairs (source). No synthetic or LLM-generated training data โ€” every example pairs real-world input with its authoritative output.

Split Rows
Train 15,763
Validation 1,970
Test 1,971

Full extraction pipeline documented on the dataset card.

Hyperparameters

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit NF4 (QLoRA)
Max sequence length 512
Optimizer paged_adamw_8bit
LR schedule 2e-4, cosine
Gradient checkpointing Unsloth (smart offload)

Training compute

GPU NVIDIA RTX A6000 (48GB)
Cloud provider RunPod
Training time ~2.0h (incl. eval + hub push)
Tracking W&B run
CO2 estimate self-reported, not measured with a carbon tracker โ€” treat as approximate

Fine-tuned with Unsloth for 2x faster training and reduced VRAM, using TRL's SFTTrainer. Full project: wandb.ai/amareshhebbar-/axiomapper.


Bias, risks & limitations

Data recency. Training data reflects a specific snapshot in time (CMS FY2026 / dataset publish date). Codes, rates, and rules referenced may become outdated as source authorities issue updates โ€” always cross-check against the live authoritative source before high-stakes use.

Failure mode. Like any LLM, this model can produce a plausible-sounding but incorrect output, especially on rare, ambiguous, or highly compound real-world cases that fall outside the training distribution. It does not know when it's wrong.

Language. English-language input only (Hindi-medical model excepted, where Hindi system prompts are used but underlying clinical reasoning data is largely English-sourced).

Not a regulated medical device. This model has not been validated, cleared, or approved by any regulatory body (FDA, CDSCO, or equivalent) as a medical device or clinical decision support tool. It is a research/engineering artifact.

Misapplication risk. Do not use this model as the sole basis for a clinical, billing, or compliance decision affecting a real patient or claim. Do not deploy in an emergency triage context without a human-in-the-loop and clear escalation paths.


FAQ

Q: Can I merge the adapter into the base model for faster inference? Yes โ€” use model.merge_and_unload() after loading with PEFT, or use Unsloth's save_pretrained_merged() method.

Q: Why QLoRA instead of full fine-tuning? The base model already has strong language and medical knowledge from pretraining. QLoRA adapts only ~0.5-1% of parameters, which is enough to specialize the output format and domain without the cost or overfitting risk of full fine-tuning.

Q: Can I fine-tune this further on my own data? Yes, this adapter can be used as a starting checkpoint for continued fine-tuning. Note this may require merging first depending on your training framework.

Q: Why is the output format so strict? Each task was trained on a fixed system prompt and consistent output structure. Following the documented system prompt closely (see Quickstart above) gives the most reliable results โ€” deviating from it may produce inconsistent formatting.

Q: Does this model store or transmit my input data? No. Like any open-weight model, all inference happens locally on your own infrastructure (or wherever you deploy it) โ€” nothing is sent back to the model author.


Troubleshooting

Symptom Likely cause Fix
ValueError: padding_token not set Base tokenizer has no pad token Set tokenizer.pad_token = tokenizer.eos_token before inference
Garbled / repeated output Wrong chat template applied Make sure you use tokenizer.apply_chat_template, not a raw string prompt
CUDA OOM on load Insufficient VRAM Use load_in_4bit=True (already default above) or reduce max_seq_length
Adapter loads but ignores fine-tuning Base model mismatch Confirm you loaded the exact base listed above โ€” adapters are not portable across different base models or quantizations

Related models in this suite

Model Task Size
icd10-coder-qwen25-7b ICD-10-CM medical coding 7B
snomed-mapper-qwen25-7b Clinical concept mapping 7B
icd10-to-drg-qwen25-1b ICD-10 to DRG reimbursement 1.5B
pmjay-classifier-qwen25-3b India PM-JAY classification 3B

Full suite overview: AmareshHebbar/medical-ai-model-suite


Changelog

Version Date Notes
v1.0 2026 Initial release โ€” QLoRA fine-tune on 19,704 real-world rows

Citation

@misc{medicalai2026,
  author    = {Hebbar, Amaresh},
  title     = {Medical AI Fine-tuning Suite},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/AmareshHebbar}
}

Contact

GitHub LinkedIn Hugging Face

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AmareshHebbar/hindi-medical-qwen25-3b

Base model

Qwen/Qwen2.5-3B
Adapter
(469)
this model

Dataset used to train AmareshHebbar/hindi-medical-qwen25-3b