Instructions to use tokyotech-llm/Medical-Qwen3-Swallow-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tokyotech-llm/Medical-Qwen3-Swallow-32B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tokyotech-llm/Medical-Qwen3-Swallow-32B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("tokyotech-llm/Medical-Qwen3-Swallow-32B") model = AutoModelForMultimodalLM.from_pretrained("tokyotech-llm/Medical-Qwen3-Swallow-32B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use tokyotech-llm/Medical-Qwen3-Swallow-32B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tokyotech-llm/Medical-Qwen3-Swallow-32B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tokyotech-llm/Medical-Qwen3-Swallow-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tokyotech-llm/Medical-Qwen3-Swallow-32B
- SGLang
How to use tokyotech-llm/Medical-Qwen3-Swallow-32B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tokyotech-llm/Medical-Qwen3-Swallow-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tokyotech-llm/Medical-Qwen3-Swallow-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tokyotech-llm/Medical-Qwen3-Swallow-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tokyotech-llm/Medical-Qwen3-Swallow-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tokyotech-llm/Medical-Qwen3-Swallow-32B with Docker Model Runner:
docker model run hf.co/tokyotech-llm/Medical-Qwen3-Swallow-32B
# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM
tokenizer = AutoTokenizer.from_pretrained("tokyotech-llm/Medical-Qwen3-Swallow-32B")
model = AutoModelForMultimodalLM.from_pretrained("tokyotech-llm/Medical-Qwen3-Swallow-32B")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Medical-Qwen3-Swallow-32B
Medical-Qwen3-Swallow-32B is a medical-domain language model based on tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2. It is designed to support research and development toward safe and trustworthy AI for Japanese clinical settings.
The model follows the Qwen3-Swallow model family, which is a bilingual Japanese-English model family based on Qwen3 and developed through continual pre-training, supervised fine-tuning, and reinforcement learning with verifiable rewards.
Highlights
- Medical-domain adaptation of Qwen3-Swallow-32B-RL-v0.2
- Bilingual Japanese-English capability inherited from Qwen3-Swallow
- Evaluated on Japanese medical and healthcare-related benchmarks
- Intended for research use in medical AI safety and reliability evaluation
Model Details
- Model type: Causal language model
- Base model:
tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2 - Language(s): Japanese, English
- Tokenizer: Qwen3-Swallow tokenizer
- License: Apache License 2.0
Model Performance
The following results compare the base model and this medical-domain model on medical benchmarks. General benchmark results are intentionally omitted because this release focuses on medical-domain performance.
| Model | IgakuQA | JJSIMQA | JMMLU Medical | MMLU_Medical_JP | MedMCQA_JP | MedQA_JP | JUSMLEQA_JP | YakugakuQA |
|---|---|---|---|---|---|---|---|---|
tokyotech-llm/Qwen3-Swallow-32B-RL-v0.2 |
0.763 | 0.736 | 0.763 | 0.789 | 0.589 | 0.621 | 0.674 | 0.676 |
Medical-Qwen3-Swallow-32B |
0.812 | 0.809 | 0.799 | 0.817 | 0.649 | 0.684 | 0.719 | 0.735 |
Usage
This model is expected to work with Hugging Face Transformers and vLLM-compatible inference stacks.
vLLM
vllm serve tokyotech-llm/Medical-Qwen3-Swallow-32B --reasoning-parser qwen3 --max-model-len 32768
Once the server is running, you can send requests using an OpenAI-compatible client.
from openai import OpenAI
model_name = "tokyotech-llm/Medical-Qwen3-Swallow-32B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
result = client.chat.completions.create(
model=model_name,
messages=[
{"role": "user", "content": "日本語で、臨床現場における生成AI利用時の注意点を説明してください。"}
],
max_tokens=2048,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
"min_p": 0,
},
)
print(result.choices[0].message.content)
Best Practices
We recommend using the generation parameters specified in generation_config.json when available. For Qwen3-Swallow models, commonly used settings include temperature=0.6, top_p=0.95, top_k=20, and min_p=0.
We also recommend specifying a maximum context length of 32,768 tokens or less for inference unless your serving stack has been validated with a longer context.
Training Data
This model was adapted from Qwen3-Swallow-32B-RL-v0.2 using a mixture that emphasizes medical-domain text while retaining general-domain data. The medical-domain data includes resources such as biomedical literature, medical synthetic data, medical QA-style data, and clinical guideline-style text.
Risks and Limitations
This model is intended for research and development. It has not been validated as a medical device and must not be used as a substitute for professional medical judgment. Outputs may contain factual errors, unsafe recommendations, or unsupported clinical claims. Any clinical use requires careful human review, validation, and compliance with applicable laws, regulations, and institutional policies.
License
Apache License 2.0
How to Cite
If you find our work helpful, please feel free to cite these papers. The Qwen3-Swallow and GPT-OSS-Swallow Technical Paper (Training Details) will be released in March.
Continual Pre-Training
@inproceedings{
fujii2024continual,
title={Continual Pre-Training for Cross-Lingual {LLM} Adaptation: Enhancing Japanese Language Capabilities},
author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae Mizuki and Rio Yokota and Naoaki Okazaki},
booktitle={First Conference on Language Modeling},
year={2024}
}
Supervised Fine-Tuning
@inproceedings{
ma2025building,
title={Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models},
author={Youmi Ma and Sakae Mizuki and Kazuki Fujii and Taishi Nakamura and Masanari Ohi and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Koki Maeda and Kakeru Hattori and Takumi Okamoto and Shigeki Ishida and Rio Yokota and Hiroya Takamura and Naoaki Okazaki},
booktitle={Second Conference on Language Modeling},
year={2025}
}
References
[Yang, 2025] Alibaba. Qwen3 Technical Report, arxiv:2505.09388.
Acknowledgements
This work builds on Qwen3 and Qwen3-Swallow. We thank the Qwen team and the contributors to the Qwen3-Swallow project.
この成果は、国立研究開発法人新エネルギー・産業技術総合開発機構(NEDO)の助成事業(JPNP25006)の結果得られたものです。
This model is based on the results obtained from the project, JPNP25006, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
- Downloads last month
- 5
Model tree for tokyotech-llm/Medical-Qwen3-Swallow-32B
Base model
Qwen/Qwen3-32B
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tokyotech-llm/Medical-Qwen3-Swallow-32B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)