--- license: apache-2.0 base_model: Qwen/Qwen3.5-35B-A3B library_name: transformers pipeline_tag: text-generation tags: - qwen3_5 - moe - clinical - ehr - medical - tool-use - agentic-evidence-seeking - clinseekagent - clinseek-bench - sft model-index: - name: ClinSeek-35B-A3B results: - task: type: text-generation name: Agentic EHR evidence seeking dataset: type: AgentEHR-Bench name: AgentEHR-Bench metrics: - type: f1 name: Average F1 value: 34.0 --- # ClinSeek-35B-A3B ClinSeek-35B-A3B is our open-source model for [ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning](https://arxiv.org/abs/2605.20176). We trained it by supervised fine-tuning from `Qwen/Qwen3.5-35B-A3B` on ClinSeekAgent trajectories generated by Claude Opus 4.6. ClinSeekAgent studies a clinical reasoning setting where evidence is not handed to the model in a pre-curated prompt. Instead, an agent must actively retrieve patient-specific evidence from raw EHR tables, consult external medical knowledge when needed, and synthesize the acquired evidence into a final decision. ClinSeek-35B-A3B is trained to imitate this long-horizon evidence seeking behavior in native tool-call format.

ClinSeek-35B-A3B performance on AgentEHR-Bench

## Release Information | Item | Value | | --- | --- | | Model | `ClinSeek-35B-A3B` | | Base model | `Qwen/Qwen3.5-35B-A3B` | | Training method | Supervised fine-tuning | | Teacher model | Claude Opus 4.6 | | Training signal | ClinSeekAgent evidence-seeking trajectories | | Primary target setting | Agentic EHR evidence seeking | | Technical report | https://arxiv.org/abs/2605.20176 | | Code | https://github.com/UCSC-VLAA/ClinSeekAgent | | Benchmark metadata | https://huggingface.co/datasets/UCSC-VLAA/ClinSeek-Bench | | Project page | https://ucsc-vlaa.github.io/ClinSeekAgent/ | ## Training Data And Objective ClinSeek-35B-A3B validates ClinSeekAgent as a training-time pipeline. Claude Opus 4.6 is used as the teacher model to generate ClinSeekAgent trajectories from the training split of the text-based benchmark. The student model is then fine-tuned with supervised learning on the resulting trajectories. The trajectories are rendered in native tool-call format with `` / `` turns, teaching the model how to search the EHR rather than only imitate final answers. Training configuration: | Component | Configuration | | --- | --- | | Base model | Qwen3.5-35B-A3B | | Training objective | SFT on ClinSeekAgent trajectories | | Training / validation size | 7,204 / 147 examples | | Maximum sequence length | 52,000 tokens | | Training epochs | 3 | | Global batch size | 32 | | Micro batch size | 1 per GPU | | Optimizer | Megatron optimizer with CPU offload | | Learning rate | 2e-5 | | Minimum learning rate | 2e-6 | | Learning rate schedule | Cosine decay with 10 warmup steps | | Weight decay | 0.1 | | Gradient clipping | 1.0 | | Precision | bfloat16 | | Backend | Megatron + mbridge | | Hardware | 8 H200 GPUs | | Tensor / expert / pipeline parallelism | TP=2, EP=8, PP=1 | | Random seed | 42 | This release contains the model weights and tokenizer files. It does not redistribute protected clinical source data, patient-level databases, private trajectories, experiment logs, or raw MIMIC-derived records. ## Evaluation We evaluate ClinSeek-35B-A3B on the five-task AgentEHR-Bench setting. The model improves the Qwen3.5-35B-A3B base model from 22.1 to 34.0 average F1, a +11.9 point gain, and achieves the strongest open-source performance among the evaluated models. | Model | Diagnoses | Labs | Microbiology | Procedures | Transfers | Avg. | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | Qwen3.5-35B-A3B (base) | 36.6 | 17.7 | 16.2 | 21.9 | 18.1 | 22.1 | | ClinSeek-35B-A3B | 55.4 | 38.5 | 27.6 | 31.7 | 16.7 | 34.0 | | Delta | +18.8 | +20.8 | +11.4 | +9.8 | -1.4 | +11.9 | Our analysis shows that the distilled model learns a different tool-use policy, not just a different final-answer prior. On the same 500 AgentEHR-Bench questions, its free-form SQL use increases from 649 calls in the base model to 3,932 calls after SFT, suggesting that ClinSeekAgent trajectories teach the student to treat the EHR as a programmable database. For full evaluation scripts and benchmark reconstruction instructions, see: https://github.com/UCSC-VLAA/ClinSeekAgent. ## Usage Use the checkpoint with a recent `transformers` release that supports Qwen3.5-MoE models. For the evaluation setting used in this work, serve the model with an OpenAI-compatible backend such as vLLM and run the ClinSeekAgent evaluation drivers. Basic loading example: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "UCSC-VLAA/ClinSeek-35B-A3B" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [ { "role": "system", "content": "You are a clinical evidence-seeking assistant.", }, { "role": "user", "content": "Answer the clinical question using the available evidence.", }, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` For tool-using evaluation, use the ClinSeekAgent repository rather than a single-turn text generation script. The repository provides the EHR MCP server, tool schemas, prompts, and scoring code expected by this model. ## Citation Please cite our ClinSeekAgent technical report if you use this model: ```bibtex @article{clinseekagent2026, title = {ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning}, year = {2026}, url = {https://arxiv.org/abs/2605.20176} } ``` Also cite the upstream datasets, benchmarks, and base models used in your experiments, including MIMIC, AgentEHR-Bench, and Qwen3.5-35B-A3B where applicable.