---
license: apache-2.0
base_model: Qwen/Qwen3.5-35B-A3B
library_name: transformers
pipeline_tag: text-generation
tags:
- qwen3_5
- moe
- clinical
- ehr
- medical
- tool-use
- agentic-evidence-seeking
- clinseekagent
- clinseek-bench
- sft
model-index:
- name: ClinSeek-35B-A3B
  results:
  - task:
      type: text-generation
      name: Agentic EHR evidence seeking
    dataset:
      type: AgentEHR-Bench
      name: AgentEHR-Bench
    metrics:
    - type: f1
      name: Average F1
      value: 34.0
---

# ClinSeek-35B-A3B

ClinSeek-35B-A3B is our open-source model for
[ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical
Reasoning](https://arxiv.org/abs/2605.20176). We trained it by supervised
fine-tuning from `Qwen/Qwen3.5-35B-A3B` on ClinSeekAgent trajectories generated
by Claude Opus 4.6.

ClinSeekAgent studies a clinical reasoning setting where evidence is not handed
to the model in a pre-curated prompt. Instead, an agent must actively retrieve
patient-specific evidence from raw EHR tables, consult external medical
knowledge when needed, and synthesize the acquired evidence into a final
decision. ClinSeek-35B-A3B is trained to imitate this long-horizon evidence
seeking behavior in native tool-call format.

<p align="center">
  <img src="assets/performance.png" alt="ClinSeek-35B-A3B performance on AgentEHR-Bench" width="92%">
</p>

## Release Information

| Item | Value |
| --- | --- |
| Model | `ClinSeek-35B-A3B` |
| Base model | `Qwen/Qwen3.5-35B-A3B` |
| Training method | Supervised fine-tuning |
| Teacher model | Claude Opus 4.6 |
| Training signal | ClinSeekAgent evidence-seeking trajectories |
| Primary target setting | Agentic EHR evidence seeking |
| Technical report | https://arxiv.org/abs/2605.20176 |
| Code | https://github.com/UCSC-VLAA/ClinSeekAgent |
| Benchmark metadata | https://huggingface.co/datasets/UCSC-VLAA/ClinSeek-Bench |
| Project page | https://ucsc-vlaa.github.io/ClinSeekAgent/ |

## Training Data And Objective

ClinSeek-35B-A3B validates ClinSeekAgent as a training-time pipeline. Claude
Opus 4.6 is used as the teacher model to generate ClinSeekAgent trajectories
from the training split of the text-based benchmark. The student model is then
fine-tuned with supervised learning on the resulting trajectories.

The trajectories are rendered in native tool-call format with
`<tool_call>` / `<tool_response>` turns, teaching the model how to search the
EHR rather than only imitate final answers.

Training configuration:

| Component | Configuration |
| --- | --- |
| Base model | Qwen3.5-35B-A3B |
| Training objective | SFT on ClinSeekAgent trajectories |
| Training / validation size | 7,204 / 147 examples |
| Maximum sequence length | 52,000 tokens |
| Training epochs | 3 |
| Global batch size | 32 |
| Micro batch size | 1 per GPU |
| Optimizer | Megatron optimizer with CPU offload |
| Learning rate | 2e-5 |
| Minimum learning rate | 2e-6 |
| Learning rate schedule | Cosine decay with 10 warmup steps |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 |
| Backend | Megatron + mbridge |
| Hardware | 8 H200 GPUs |
| Tensor / expert / pipeline parallelism | TP=2, EP=8, PP=1 |
| Random seed | 42 |

This release contains the model weights and tokenizer files. It does not
redistribute protected clinical source data, patient-level databases, private
trajectories, experiment logs, or raw MIMIC-derived records.

## Evaluation

We evaluate ClinSeek-35B-A3B on the five-task AgentEHR-Bench setting. The model
improves the Qwen3.5-35B-A3B base model from 22.1 to 34.0 average F1, a +11.9
point gain, and achieves the strongest open-source performance among the
evaluated models.

| Model | Diagnoses | Labs | Microbiology | Procedures | Transfers | Avg. |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Qwen3.5-35B-A3B (base) | 36.6 | 17.7 | 16.2 | 21.9 | 18.1 | 22.1 |
| ClinSeek-35B-A3B | 55.4 | 38.5 | 27.6 | 31.7 | 16.7 | 34.0 |
| Delta | +18.8 | +20.8 | +11.4 | +9.8 | -1.4 | +11.9 |

Our analysis shows that the distilled model learns a different tool-use policy,
not just a different final-answer prior. On the same 500 AgentEHR-Bench
questions, its free-form SQL use increases from 649 calls in the base model to
3,932 calls after SFT, suggesting that ClinSeekAgent trajectories teach the
student to treat the EHR as a programmable database.

For full evaluation scripts and benchmark reconstruction instructions, see:
https://github.com/UCSC-VLAA/ClinSeekAgent.

## Usage

Use the checkpoint with a recent `transformers` release that supports
Qwen3.5-MoE models. For the evaluation setting used in this work, serve the
model with an OpenAI-compatible backend such as vLLM and run the ClinSeekAgent
evaluation drivers.

Basic loading example:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "UCSC-VLAA/ClinSeek-35B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "system",
        "content": "You are a clinical evidence-seeking assistant.",
    },
    {
        "role": "user",
        "content": "Answer the clinical question using the available evidence.",
    },
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

For tool-using evaluation, use the ClinSeekAgent repository rather than a
single-turn text generation script. The repository provides the EHR MCP server,
tool schemas, prompts, and scoring code expected by this model.

## Citation

Please cite our ClinSeekAgent technical report if you use this model:

```bibtex
@article{clinseekagent2026,
  title = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning},
  year = {2026},
  url = {https://arxiv.org/abs/2605.20176}
}
```

Also cite the upstream datasets, benchmarks, and base models used in your
experiments, including MIMIC, AgentEHR-Bench, and Qwen3.5-35B-A3B where
applicable.