---
license: apache-2.0
base_model: Qwen/Qwen3.5-35B-A3B
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3_5
- moe
- clinical
- ehr
- medical
- tool-use
- agentic-evidence-seeking
- clinseekagent
- clinseek-bench
- sft
model-index:
- name: ClinSeek-35B-A3B
results:
- task:
type: text-generation
name: Agentic EHR evidence seeking
dataset:
type: AgentEHR-Bench
name: AgentEHR-Bench
metrics:
- type: f1
name: Average F1
value: 34.0
---
# ClinSeek-35B-A3B
ClinSeek-35B-A3B is our open-source model for
[ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical
Reasoning](https://arxiv.org/abs/2605.20176). We trained it by supervised
fine-tuning from `Qwen/Qwen3.5-35B-A3B` on ClinSeekAgent trajectories generated
by Claude Opus 4.6.
ClinSeekAgent studies a clinical reasoning setting where evidence is not handed
to the model in a pre-curated prompt. Instead, an agent must actively retrieve
patient-specific evidence from raw EHR tables, consult external medical
knowledge when needed, and synthesize the acquired evidence into a final
decision. ClinSeek-35B-A3B is trained to imitate this long-horizon evidence
seeking behavior in native tool-call format.
## Release Information
| Item | Value |
| --- | --- |
| Model | `ClinSeek-35B-A3B` |
| Base model | `Qwen/Qwen3.5-35B-A3B` |
| Training method | Supervised fine-tuning |
| Teacher model | Claude Opus 4.6 |
| Training signal | ClinSeekAgent evidence-seeking trajectories |
| Primary target setting | Agentic EHR evidence seeking |
| Technical report | https://arxiv.org/abs/2605.20176 |
| Code | https://github.com/UCSC-VLAA/ClinSeekAgent |
| Benchmark metadata | https://huggingface.co/datasets/UCSC-VLAA/ClinSeek-Bench |
| Project page | https://ucsc-vlaa.github.io/ClinSeekAgent/ |
## Training Data And Objective
ClinSeek-35B-A3B validates ClinSeekAgent as a training-time pipeline. Claude
Opus 4.6 is used as the teacher model to generate ClinSeekAgent trajectories
from the training split of the text-based benchmark. The student model is then
fine-tuned with supervised learning on the resulting trajectories.
The trajectories are rendered in native tool-call format with
`` / `` turns, teaching the model how to search the
EHR rather than only imitate final answers.
Training configuration:
| Component | Configuration |
| --- | --- |
| Base model | Qwen3.5-35B-A3B |
| Training objective | SFT on ClinSeekAgent trajectories |
| Training / validation size | 7,204 / 147 examples |
| Maximum sequence length | 52,000 tokens |
| Training epochs | 3 |
| Global batch size | 32 |
| Micro batch size | 1 per GPU |
| Optimizer | Megatron optimizer with CPU offload |
| Learning rate | 2e-5 |
| Minimum learning rate | 2e-6 |
| Learning rate schedule | Cosine decay with 10 warmup steps |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 |
| Backend | Megatron + mbridge |
| Hardware | 8 H200 GPUs |
| Tensor / expert / pipeline parallelism | TP=2, EP=8, PP=1 |
| Random seed | 42 |
This release contains the model weights and tokenizer files. It does not
redistribute protected clinical source data, patient-level databases, private
trajectories, experiment logs, or raw MIMIC-derived records.
## Evaluation
We evaluate ClinSeek-35B-A3B on the five-task AgentEHR-Bench setting. The model
improves the Qwen3.5-35B-A3B base model from 22.1 to 34.0 average F1, a +11.9
point gain, and achieves the strongest open-source performance among the
evaluated models.
| Model | Diagnoses | Labs | Microbiology | Procedures | Transfers | Avg. |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Qwen3.5-35B-A3B (base) | 36.6 | 17.7 | 16.2 | 21.9 | 18.1 | 22.1 |
| ClinSeek-35B-A3B | 55.4 | 38.5 | 27.6 | 31.7 | 16.7 | 34.0 |
| Delta | +18.8 | +20.8 | +11.4 | +9.8 | -1.4 | +11.9 |
Our analysis shows that the distilled model learns a different tool-use policy,
not just a different final-answer prior. On the same 500 AgentEHR-Bench
questions, its free-form SQL use increases from 649 calls in the base model to
3,932 calls after SFT, suggesting that ClinSeekAgent trajectories teach the
student to treat the EHR as a programmable database.
For full evaluation scripts and benchmark reconstruction instructions, see:
https://github.com/UCSC-VLAA/ClinSeekAgent.
## Usage
Use the checkpoint with a recent `transformers` release that supports
Qwen3.5-MoE models. For the evaluation setting used in this work, serve the
model with an OpenAI-compatible backend such as vLLM and run the ClinSeekAgent
evaluation drivers.
Basic loading example:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "UCSC-VLAA/ClinSeek-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "system",
"content": "You are a clinical evidence-seeking assistant.",
},
{
"role": "user",
"content": "Answer the clinical question using the available evidence.",
},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```
For tool-using evaluation, use the ClinSeekAgent repository rather than a
single-turn text generation script. The repository provides the EHR MCP server,
tool schemas, prompts, and scoring code expected by this model.
## Citation
Please cite our ClinSeekAgent technical report if you use this model:
```bibtex
@article{clinseekagent2026,
title = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning},
year = {2026},
url = {https://arxiv.org/abs/2605.20176}
}
```
Also cite the upstream datasets, benchmarks, and base models used in your
experiments, including MIMIC, AgentEHR-Bench, and Qwen3.5-35B-A3B where
applicable.