---
license: apache-2.0
base_model:
- openai/gpt-oss-120b
---


# Model Overview

- **Model Architecture:** gpt-oss-120b
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with gpt-oss-120b model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8. 

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*lm_head *self_attn* *router*"

python3 internal_scripts/quantize_quark.py \
    --model_dir openai/gpt-oss-120b \
    --quant_scheme w_mxfp4_a_fp8 \
    --exclude_layers $exclude_layers \
    --num_calib_data 512 \
    --output_dir amd/gpt-oss120b-w-mxfp4-a-fp8 \
    --model_export hf_format \
    --multi_gpu
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on AIME25 and GPQA Diamond benchmarks with `low` reasoning effort. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>gpt-oss-120b </strong>
   </td>
   <td><strong>gpt-oss120b-w-mxfp4-a-fp8(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>AIME25 
   </td>
   <td>65.25
   </td>
   <td>67.12
   </td>
   <td>102.87%
   </td>
  </tr>
  <tr>
   <td>GPQA 
   </td>
   <td>51.67
   </td>
   <td>53.42
   </td>
   <td>103.39%
   </td>
  </tr>
</table>

### Reproduction

The results of AIME25 and GPQA Diamond were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `low` effort setting, and vLLM docker `rocm/vllm-private:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_gptoss_wmxfp4_afp8_20251030`.

#### Launching server
```
vllm serve amd/gpt-oss120b-w-mxfp4-a-fp8 \
  --tensor_parallel_size 2 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 1024
```

#### Evaluating model in a new terminal
```
python -m gpt_oss.evals --model /shareddata/amd/gpt-oss120b-w-mxfp4-a-fp8 --eval aime25,gpqa --reasoning-effort low --n-threads 128
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Benchmark	gpt-oss-120b	gpt-oss120b-w-mxfp4-a-fp8(this model)	Recovery
AIME25	65.25	67.12	102.87%
GPQA	51.67	53.42	103.39%