How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama cli -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama cli -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./llama-cli -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
Use Docker
docker model run hf.co/SamPurkis/gpt-oss-puzzle-88B-GGUF:MXFP4_MOE
Quick Links

gpt-oss-puzzle-88B-GGUF

EXPERIMENTAL - REQUIRES CUSTOM BRANCH

These GGUF files will NOT work with mainline llama.cpp. You must use the branch linked below.

GGUF quantisation of nvidia/gpt-oss-puzzle-88B, an 88B parameter MoE model derived from gpt-oss-120B using NVIDIA's Puzzle NAS framework.

Required Branch

This model requires a custom llama.cpp branch with gpt-oss-puzzle architecture support:

https://github.com/smpurkis/llama.cpp/tree/gpt-oss-puzzle-support

Tracking issue: ggml-org/llama.cpp#21028 PR: ggml-org/llama.cpp#21032

This will not work on mainline llama.cpp until the architecture is merged upstream.

How to Use

# Clone the required branch
git clone --branch gpt-oss-puzzle-support https://github.com/smpurkis/llama.cpp.git
cd llama.cpp

# Build (example with Vulkan)
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j$(nproc)

# Run
./build/bin/llama-cli -m gpt-oss-puzzle-88B.MXFP4_MOE.gguf -ngl 99 -fa 1 -p "Hello"

Available Quantisations

File Quant Size Description
gpt-oss-puzzle-88B.f16.gguf F16 47.0 GiB Full precision (for requantisation)
gpt-oss-puzzle-88B.MXFP4_MOE.gguf MXFP4_MOE 44.8 GiB Native MXFP4 expert weights (matches original model precision)

Architecture Differences from gpt-oss-120B

The puzzle model differs from the standard gpt-oss architecture in ways that require dedicated support:

Property gpt-oss-120B gpt-oss-puzzle-88B
Expert count 128 per layer (uniform) 128 or 64 per layer (heterogeneous)
Attention pattern Interleaved global/SWA (single window) Global + multiple SWA window sizes (128, 8192)
Total parameters ~117B ~88B

Credits

Downloads last month
120
GGUF
Model size
88B params
Architecture
gpt-oss-puzzle
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SamPurkis/gpt-oss-puzzle-88B-GGUF

Quantized
(1)
this model

Paper for SamPurkis/gpt-oss-puzzle-88B-GGUF