LLM-SAST v1 — Qwen3.6-35B-A3B fine-tune (FP8 block)
A 35B-parameter mixture-of-experts model fine-tuned to perform static application security testing (SAST) directly on a single source file — no scanner in the loop, no retrieval, just (system-prompt, file) → JSON-array of findings. Released as block-wise FP8 (e4m3, [128, 128] weight blocks, dynamic per-group=128 e4m3 activations), the same scheme Qwen/Qwen3.6-35B-A3B-FP8 ships with — so it loads in vLLM exactly like the base FP8 release does.
One file in, JSON out. Replaces (does not audit) Checkov / Trivy / KICS / Semgrep / Bearer for the file types it covers.
| Base model | Qwen/Qwen3.6-35B-A3B |
| Fine-tuning data | aioutfitters/llm-sast-v1 v2.0 (15,822 ShareGPT examples, 18,320 findings, 81 categories) |
| Method | Supervised fine-tuning (4-bit QLoRA, r=128, α=128) → merge to BF16 → block-wise FP8 quantize |
| Quantization | compressed-tensors FP8 (e4m3) — weights: block [128, 128]; activations: dynamic per-group=128. Mirrors Qwen3.6-35B-A3B-FP8's scheme. |
| Disk size | 35 GB (vs ~66 GB BF16 merged; vs ~70 GB base FP8) |
| F1 on held-out test split (1,583 ex), this FP8 model | 63.0 % (recall 63.7 %, precision 62.3 %, valid-JSON 94.2 %) |
| F1 of the un-fine-tuned base FP8 | 5.0 % |
| F1 of the BF16 pre-quant intermediate | 63.5 % (FP8 → −0.5 pp, within typical FP8 noise) |
| Recommended serving | vllm ≥ 0.10 with TP=2, BF16 KV cache or FP8 KV cache |
1. Quick start
The model is a drop-in replacement for Qwen/Qwen3.6-35B-A3B-FP8 in any vLLM-based deployment. It uses the same chat template, same tokenizer, and a single fixed system prompt (shipped at system_prompt.txt).
1.1. Serve with vLLM
docker run -d --name llm-sast --gpus all \
--ipc=host --shm-size=32g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8 \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--language-model-only \
--enable-prefix-caching \
--served-model-name sast
Two RTX PRO 6000 (96 GB) or H100 (80 GB) GPUs comfortably hold the model with TP=2 + 8 K context. A single H100/H200 with TP=1 also works (set --gpu-memory-utilization 0.92).
1.2. Call it like any chat model
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
# Read the fixed system prompt that ships with this model
with open("system_prompt.txt") as f:
SYSTEM_PROMPT = f.read()
source_code = open("infra/main.tf").read()
user_msg = (
f"FILE: infra/main.tf\n\n"
+ "\n".join(f"{i+1:>4}: {line}" for i, line in enumerate(source_code.splitlines()))
)
resp = client.chat.completions.create(
model="sast",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_msg},
],
temperature=0.0,
max_tokens=2048,
)
print(resp.choices[0].message.content)
# <think> ...analysis... </think>
# [{"lines":[12,13,14],"category":"missing_encryption_at_rest","severity":"HIGH",
# "reasoning":"...","fix":"..."}]
1.3. Output contract
The assistant turn is always:
<think>
<step-by-step expert security reasoning>
</think>
[ <JSON array of findings> ]
Each finding object:
| field | type | description |
|---|---|---|
lines |
list[int] |
1-indexed inclusive line numbers in the source file |
category |
string |
One of 81 closed-taxonomy categories (defined in system_prompt.txt) |
severity |
enum |
CRITICAL / HIGH / MEDIUM / LOW |
reasoning |
string |
1–2 sentences specific to the actual code |
fix |
string |
1–2 sentences with a concrete remediation |
If the file is clean, the assistant emits [] after the <think> block.
1.4. Parse the response robustly
import json, re
def parse_findings(text: str):
# Strip <think>...</think> if present
body = re.sub(r"<think>.*?</think>\s*", "", text, count=1, flags=re.DOTALL)
# Pull the first JSON array
m = re.search(r"\[\s*(?:\{.*?\}\s*,?\s*)*\]", body, flags=re.DOTALL)
if not m:
return []
try:
arr = json.loads(m.group(0))
except json.JSONDecodeError:
return []
return [f for f in arr if isinstance(f, dict)]
2. Methodology
2.1. Base model
Started from Qwen/Qwen3.6-35B-A3B — a 35B-parameter MoE+linear-attention hybrid: 60 transformer blocks, 256 routed experts per MoE layer, hybrid Mamba-style linear attention on alternating layers, ~3B active parameters per token.
We chose the 35B-A3B because it gave the best base-model SAST instruction-following on our pilot in the 27B-to-72B range, and because its activated-parameter count (~3B) makes it cheap to serve at production concurrency.
2.2. Dataset
Trained on aioutfitters/llm-sast-v1 v2.0: 15,822 ShareGPT examples / 12,657 train / 1,582 val / 1,583 test, disjoint by repository (3,055 distinct source repos).
| Pipeline stage | Engine | Output |
|---|---|---|
| Repo discovery | GitHub Search API + curated tier-1 vulnerable list | 3,055 repos retained |
| File extraction | Tier-aware sweep (tier1 vulnerable, tier2 production, tier3 community) | ~40 K candidate files |
| Best-of-5 labeling | Qwen3.6-27B-FP8 teacher, T-schedule [0.0, 0.3, 0.5, 0.7, 1.0], ±2-line dedup, majority vote |
candidate findings |
| Single-call judge (v9b) | Same teacher, default-KEEP unless clearly wrong prompt + JSON-mode |
confirmed findings |
| Reasoning synthesis | Same teacher, post-hoc | <think> block per example |
| Distribution gates | max_class_frac=0.30, max_cat_frac=0.25, max_empty_frac=0.30 |
15,822 final examples |
Each assistant turn passes the gates: 100 % valid JSON, 100 % <think> reasoning, 100 % closed-taxonomy compliance. The v7 portion was manually audited at 200-sample scale (93.3 % strict per-finding precision); the v8-sweep records were validated by the downstream training F1 (this model). See the dataset card for the full audit + provenance trail.
2.3. Fine-tuning configuration
| Method | Supervised fine-tuning (SFT) on assistant-only loss |
| Adapter | LoRA r=128, α=128, dropout=0, applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (target_parameters=[] — routed expert weights are NOT adapted) |
| Quantization at train time | 4-bit QLoRA via unsloth==2026.5.2 + bitsandbytes NF4 |
| Hardware | Single RTX PRO 6000 Blackwell (96 GB) on a Threadripper PRO 7975WX workstation |
| Optimizer | AdamW 8-bit (adamw_8bit), lr=2e-4, weight_decay=0.01 |
| Schedule | Cosine, warmup_ratio=0.01, 3 epochs |
| Effective batch | per_device_train_batch_size=1, grad_accum=8, bf16=True, seq_length=8192 |
| Eval cadence | Every half-epoch (eval_strategy=steps, eval_steps=395); prediction_loss_only=True, eval_accumulation_steps=1, bf16_full_eval=True, eval_on_start=True |
| Early stopping | EarlyStoppingCallback(patience=2) on eval_loss |
| Total training wall-clock | 11 h 5 min |
Eval-loss trajectory (lower = better):
| step | epoch | eval_loss |
|---|---|---|
| 395 | 0.50 | 0.4991 |
| 790 | 1.00 | 0.4673 |
| 1185 | 1.50 | 0.4556 |
| 1580 | 2.00 | 0.4432 |
| 1975 | 2.50 | 0.4439 |
| 2370 | 3.00 | 0.4421 ← best |
Final adapter was the step-2370 checkpoint. Total trainable parameters: ~ 256 M (0.7 % of base).
2.4. Merge to BF16
The trained LoRA adapter was merged into a BF16 copy of the base model with peft.PeftModel.merge_and_unload(). Output: 16-shard safetensors, ~66 GB. This is the artifact passed to the FP8 quantizer.
2.5. FP8 block-wise quantization
The merged BF16 model was then quantized using llmcompressor==0.10.1a20260407 + compressed-tensors==0.15.0.1 with the FP8_BLOCK recipe to mirror the Qwen3.6-35B-A3B-FP8 scheme exactly:
QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK", # weights e4m3 block [128, 128];
# activations e4m3 dynamic per-group=128
ignore=[
"lm_head",
"re:.*mlp\\.gate$", # MoE router (NOT gate_proj)
"re:.*mlp\\.shared_expert_gate$", # 1-dim shared-expert scalar gate
"re:.*linear_attn\\..*", # Mamba/linear-attention projections
# (in_proj_a/b have output dim 32,
# not divisible by 128 under TP=2)
"re:.*visual\\..*", # vision encoder (we serve language-only)
],
)
The recipe is data-free (no calibration set). Total quantization wall-clock: 57 s on 2× RTX PRO 6000.
The output uses quant_method: compressed-tensors (vLLM has stable native support since v0.5). The base Qwen3.6-35B-A3B-FP8 uses quant_method: fp8 — a different metadata wrapper, but the underlying weight format is identical: e4m3 with [128, 128] block scales.
3. Evaluation
3.1. F1 on the v2.0 held-out test split (1,583 examples)
| metric | base FP8 (no fine-tune) | this model (FP8) | pre-quant BF16 (reference) |
|---|---|---|---|
| F1 | 5.0 % | 63.0 % | 63.5 % |
| Recall | 5.8 % | 63.7 % | 63.3 % |
| Precision | 4.4 % | 62.3 % | 63.8 % |
| Valid JSON | 70.1 % | 94.2 % | 93.3 % |
| Empty-file accuracy | — | 62.5 % | 65.1 % |
| Eval wall-clock (TP=2, concurrency=16) | — | 8.6 min | 10.5 min |
The FP8 quantization preserves accuracy within ~0.5 pp F1 (typical for FP8_BLOCK on a model the size of 35B); recall and valid-JSON went up slightly under FP8, precision and empty-file accuracy went down slightly — net F1 is essentially unchanged. Throughput is ~18 % faster than the BF16 intermediate at the same concurrency.
Methodology: file-by-file, set-comparison F1 against the teacher labels in the test split. A predicted finding matches a teacher finding when category matches and there is ≥1 line of overlap in the cited line range. JSON-validity is computed as parse_findings(response) is not None.
Eval ran on vllm-minfix5 with --tensor-parallel-size 2 --max-model-len 8192 --gpu-memory-utilization 0.85 against the openai-compatible endpoint at concurrency 16.
3.2. Per-classification F1 (this FP8 model)
| classification | F1 | recall | precision | files in test | findings (teacher / model / matched) |
|---|---|---|---|---|---|
jenkinsfile |
76.5 % | 68.4 % | 86.7 % | 16 | 19 / 15 / 13 |
kubernetes |
74.5 % | 74.1 % | 74.9 % | 134 | 201 / 199 / 149 |
dockerfile |
71.0 % | 75.7 % | 66.8 % | 274 | 338 / 383 / 256 |
gitlab_ci |
66.7 % | 78.6 % | 57.9 % | 16 | 14 / 19 / 11 |
typescript |
66.7 % | 100.0 % | 50.0 % | 4 | 1 / 2 / 1 (very small N) |
helm |
64.6 % | 70.7 % | 59.4 % | 42 | 58 / 69 / 41 |
docker_compose |
62.4 % | 62.7 % | 62.0 % | 121 | 255 / 258 / 160 |
python |
59.4 % | 65.5 % | 54.3 % | 49 | 29 / 35 / 19 |
terraform |
59.3 % | 59.5 % | 59.1 % | 193 | 153 / 154 / 91 |
github_actions |
56.9 % | 55.9 % | 57.9 % | 597 | 665 / 642 / 372 |
cloudformation |
54.5 % | 50.0 % | 60.0 % | 34 | 54 / 45 / 27 |
arm_template |
44.4 % | 53.3 % | 38.1 % | 11 | 15 / 21 / 8 |
javascript |
n/a (n=2, 0 findings) | — | — | 2 | 0 / 0 / 0 |
3.3. Top per-category recall (this FP8 model, top 12 by teacher-finding volume)
| category | recall | matched / teacher |
|---|---|---|
container_running_as_root |
88.1 % | 171 / 194 |
missing_security_context |
76.5 % | 88 / 115 |
hardcoded_credential |
75.8 % | 69 / 91 |
mutable_image_tag |
65.7 % | 249 / 379 |
unpinned_action_version |
64.2 % | 212 / 330 |
overly_permissive_ingress |
62.9 % | 22 / 35 |
missing_resource_limits |
61.7 % | 37 / 60 |
overly_permissive_workflow_permissions |
60.0 % | 75 / 125 |
deprecated_resource_feature |
56.5 % | 35 / 62 |
missing_encryption_in_transit |
52.9 % | 18 / 34 |
default_service_account_usage |
40.9 % | 9 / 22 |
public_exposure_unintended |
25.6 % | 11 / 43 ← weakest cell |
4. Limitations
This model inherits the limitations of the v2.0 dataset, which has a pronounced over-representation of CI/CD configuration files and a corresponding under-representation of application source code:
| group | % of v2.0 | distinct repos | notes |
|---|---|---|---|
CI/CD workflows (github_actions 16.8 %, gitlab_ci 0.5 %, jenkinsfile 0.4 %) |
17.6 % | 1,564+ | Heavy on GitHub Actions because the IaC corpus crawl picked up .github/workflows/*.yml from every repo it cloned, regardless of the repo's primary purpose. |
Container & deploy (dockerfile 12.0 %, helm 8.6 %, docker_compose 7.1 %) |
27.7 % | 751+ | Well-represented. |
Core IaC (kubernetes 14.8 %, terraform 11.8 %, cloudformation 9.2 %, arm_template 1.8 %) |
37.7 % | 615+ | Well-represented. |
Application code (python 6.7 %, typescript 4.1 %, javascript 3.0 %, java 2.2 %, go 0.9 %, rust 0.0 %) |
17.0 % | 5–146 per language | Under-represented: only 146 Python repos, 5 Java repos, 5 Go repos, 1 Rust repo, 0 Ruby repos. The model has learned IaC/container security much more thoroughly than application-code security. |
What this means in practice:
- Strong: Kubernetes / Dockerfile / Helm / Terraform / CloudFormation / GitHub Actions security review — F1 at or near the published 63.5 % aggregate, individual classifications above 60 %.
- Weak: Python / JavaScript / TypeScript / Java / Go security review — per-classification F1 in the 49–60 % range, against a small number of test files (4–48 per language). The model can find common application-code issues (hardcoded credentials, basic injection patterns, missing-auth-check) but is not a confident replacement for Semgrep/Bearer at this version.
- Untrained: Ruby (0 examples) — model output for Ruby files should be considered unreliable. Rust has 7 train examples and 1 source repo — same caveat.
In addition, the dataset inherits five further limitations from the labeling pipeline:
- File-level only. No multi-file context, no project-wide reasoning. Cross-file dataflow vulnerabilities are out of scope.
- English reasoning. All
<think>blocks and JSON values are in English even when source code has non-English comments. - Teacher-class circularity. Labels were produced and audited by Qwen3.6-class models. The 93.3 % strict precision in the v1.0 manual audit is an audit, not ground truth — treat the gold standard as approximate.
- Severity is calibrated, not exploit-tested. A
CRITICALfinding is one the teacher reasoned has direct exploit potential; it has not been validated against a working PoC. - No CWE field. The taxonomy maps approximately to CWE but the JSON schema does not currently emit a
cwefield.
The dataset's published roadmap includes a v3.0 — application-code rebalance entry that explicitly targets these gaps; a future model release will be re-trained against that corpus.
5. Versioning
| Model release | 2026-05-09 |
| Dataset version | aioutfitters/llm-sast-v1 v2.0 (2026-05-09) |
| Base model | Qwen/Qwen3.6-35B-A3B (frozen — adapter merged + quantized; routed expert weights inherited unchanged from base) |
| System prompt SHA[:12] | 7de4fe802a11 (matches system_prompt.txt shipped here) |
| Compatible with | vllm ≥ 0.10 (block-FP8 was stabilized in vLLM 0.9), transformers ≥ 5.2 (qwen3_5_moe model class) |
6. License
Apache-2.0. Base model and dataset are both Apache-2.0; the LoRA delta + quantization is also released under Apache-2.0.
7. Citation
@misc{aioutfitters_llm_sast_qwen36_35b_fp8_2026,
title = {LLM-SAST v1 (Qwen3.6-35B-A3B FP8): an LLM-native static
application security testing model},
author = {AI Outfitters},
year = {2026},
url = {https://huggingface.co/aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8},
note = {Fine-tune of Qwen/Qwen3.6-35B-A3B on aioutfitters/llm-sast-v1 v2.0,
quantized to FP8 e4m3 block [128, 128].}
}
- Downloads last month
- 202
Model tree for aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8
Base model
Qwen/Qwen3.6-35B-A3B