LLM-SAST v1 — Qwen3.6-35B-A3B fine-tune (FP8 block)

A 35B-parameter mixture-of-experts model fine-tuned to perform static application security testing (SAST) directly on a single source file — no scanner in the loop, no retrieval, just (system-prompt, file) → JSON-array of findings. Released as block-wise FP8 (e4m3, [128, 128] weight blocks, dynamic per-group=128 e4m3 activations), the same scheme Qwen/Qwen3.6-35B-A3B-FP8 ships with — so it loads in vLLM exactly like the base FP8 release does.

One file in, JSON out. Replaces (does not audit) Checkov / Trivy / KICS / Semgrep / Bearer for the file types it covers.


Base model	`Qwen/Qwen3.6-35B-A3B`
Fine-tuning data	`aioutfitters/llm-sast-v1` v2.0 (15,822 ShareGPT examples, 18,320 findings, 81 categories)
Method	Supervised fine-tuning (4-bit QLoRA, r=128, α=128) → merge to BF16 → block-wise FP8 quantize
Quantization	`compressed-tensors` FP8 (e4m3) — weights: block `[128, 128]`; activations: dynamic per-group=128. Mirrors `Qwen3.6-35B-A3B-FP8`'s scheme.
Disk size	35 GB (vs ~66 GB BF16 merged; vs ~70 GB base FP8)
F1 on held-out test split (1,583 ex), this FP8 model	63.0 % (recall 63.7 %, precision 62.3 %, valid-JSON 94.2 %)
F1 of the un-fine-tuned base FP8	5.0 %
F1 of the BF16 pre-quant intermediate	63.5 % (FP8 → −0.5 pp, within typical FP8 noise)
Recommended serving	`vllm` ≥ 0.10 with TP=2, BF16 KV cache or FP8 KV cache

1. Quick start

The model is a drop-in replacement for Qwen/Qwen3.6-35B-A3B-FP8 in any vLLM-based deployment. It uses the same chat template, same tokenizer, and a single fixed system prompt (shipped at system_prompt.txt).

1.1. Serve with vLLM

docker run -d --name llm-sast --gpus all \
  --ipc=host --shm-size=32g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
    --model aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8 \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85 \
    --language-model-only \
    --enable-prefix-caching \
    --served-model-name sast

Two RTX PRO 6000 (96 GB) or H100 (80 GB) GPUs comfortably hold the model with TP=2 + 8 K context. A single H100/H200 with TP=1 also works (set --gpu-memory-utilization 0.92).

1.2. Call it like any chat model

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Read the fixed system prompt that ships with this model
with open("system_prompt.txt") as f:
    SYSTEM_PROMPT = f.read()

source_code = open("infra/main.tf").read()
user_msg = (
    f"FILE: infra/main.tf\n\n"
    + "\n".join(f"{i+1:>4}: {line}" for i, line in enumerate(source_code.splitlines()))
)

resp = client.chat.completions.create(
    model="sast",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": user_msg},
    ],
    temperature=0.0,
    max_tokens=2048,
)

print(resp.choices[0].message.content)
# <think> ...analysis... </think>
# [{"lines":[12,13,14],"category":"missing_encryption_at_rest","severity":"HIGH",
#   "reasoning":"...","fix":"..."}]

1.3. Output contract

The assistant turn is always:

<think>
<step-by-step expert security reasoning>
</think>
[ <JSON array of findings> ]

Each finding object:

field	type	description
`lines`	`list[int]`	1-indexed inclusive line numbers in the source file
`category`	`string`	One of 81 closed-taxonomy categories (defined in `system_prompt.txt`)
`severity`	`enum`	`CRITICAL` / `HIGH` / `MEDIUM` / `LOW`
`reasoning`	`string`	1–2 sentences specific to the actual code
`fix`	`string`	1–2 sentences with a concrete remediation

If the file is clean, the assistant emits [] after the <think> block.

1.4. Parse the response robustly

import json, re

def parse_findings(text: str):
    # Strip <think>...</think> if present
    body = re.sub(r"<think>.*?</think>\s*", "", text, count=1, flags=re.DOTALL)
    # Pull the first JSON array
    m = re.search(r"\[\s*(?:\{.*?\}\s*,?\s*)*\]", body, flags=re.DOTALL)
    if not m:
        return []
    try:
        arr = json.loads(m.group(0))
    except json.JSONDecodeError:
        return []
    return [f for f in arr if isinstance(f, dict)]

2. Methodology

2.1. Base model

Started from Qwen/Qwen3.6-35B-A3B — a 35B-parameter MoE+linear-attention hybrid: 60 transformer blocks, 256 routed experts per MoE layer, hybrid Mamba-style linear attention on alternating layers, ~3B active parameters per token.

We chose the 35B-A3B because it gave the best base-model SAST instruction-following on our pilot in the 27B-to-72B range, and because its activated-parameter count (~3B) makes it cheap to serve at production concurrency.

2.2. Dataset

Trained on aioutfitters/llm-sast-v1 v2.0: 15,822 ShareGPT examples / 12,657 train / 1,582 val / 1,583 test, disjoint by repository (3,055 distinct source repos).

Pipeline stage	Engine	Output
Repo discovery	GitHub Search API + curated tier-1 vulnerable list	3,055 repos retained
File extraction	Tier-aware sweep (tier1 vulnerable, tier2 production, tier3 community)	~40 K candidate files
Best-of-5 labeling	Qwen3.6-27B-FP8 teacher, T-schedule `[0.0, 0.3, 0.5, 0.7, 1.0]`, ±2-line dedup, majority vote	candidate findings
Single-call judge (v9b)	Same teacher, `default-KEEP unless clearly wrong` prompt + JSON-mode	confirmed findings
Reasoning synthesis	Same teacher, post-hoc	`<think>` block per example
Distribution gates	`max_class_frac=0.30`, `max_cat_frac=0.25`, `max_empty_frac=0.30`	15,822 final examples

Each assistant turn passes the gates: 100 % valid JSON, 100 % <think> reasoning, 100 % closed-taxonomy compliance. The v7 portion was manually audited at 200-sample scale (93.3 % strict per-finding precision); the v8-sweep records were validated by the downstream training F1 (this model). See the dataset card for the full audit + provenance trail.

2.3. Fine-tuning configuration


Method	Supervised fine-tuning (SFT) on assistant-only loss
Adapter	LoRA r=128, α=128, dropout=0, applied to `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (target_parameters=[] — routed expert weights are NOT adapted)
Quantization at train time	4-bit QLoRA via `unsloth==2026.5.2` + `bitsandbytes` NF4
Hardware	Single RTX PRO 6000 Blackwell (96 GB) on a Threadripper PRO 7975WX workstation
Optimizer	AdamW 8-bit (`adamw_8bit`), `lr=2e-4`, `weight_decay=0.01`
Schedule	Cosine, `warmup_ratio=0.01`, 3 epochs
Effective batch	`per_device_train_batch_size=1`, `grad_accum=8`, `bf16=True`, `seq_length=8192`
Eval cadence	Every half-epoch (`eval_strategy=steps`, `eval_steps=395`); `prediction_loss_only=True`, `eval_accumulation_steps=1`, `bf16_full_eval=True`, `eval_on_start=True`
Early stopping	`EarlyStoppingCallback(patience=2)` on `eval_loss`
Total training wall-clock	11 h 5 min

Eval-loss trajectory (lower = better):

step	epoch	eval_loss
395	0.50	0.4991
790	1.00	0.4673
1185	1.50	0.4556
1580	2.00	0.4432
1975	2.50	0.4439
2370	3.00	0.4421 ← best

Final adapter was the step-2370 checkpoint. Total trainable parameters: ~ 256 M (0.7 % of base).

2.4. Merge to BF16

The trained LoRA adapter was merged into a BF16 copy of the base model with peft.PeftModel.merge_and_unload(). Output: 16-shard safetensors, ~66 GB. This is the artifact passed to the FP8 quantizer.

2.5. FP8 block-wise quantization

The merged BF16 model was then quantized using llmcompressor==0.10.1a20260407 + compressed-tensors==0.15.0.1 with the FP8_BLOCK recipe to mirror the Qwen3.6-35B-A3B-FP8 scheme exactly:

QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",   # weights e4m3 block [128, 128];
                          # activations e4m3 dynamic per-group=128
    ignore=[
        "lm_head",
        "re:.*mlp\\.gate$",                # MoE router (NOT gate_proj)
        "re:.*mlp\\.shared_expert_gate$",   # 1-dim shared-expert scalar gate
        "re:.*linear_attn\\..*",            # Mamba/linear-attention projections
                                            #   (in_proj_a/b have output dim 32,
                                            #   not divisible by 128 under TP=2)
        "re:.*visual\\..*",                 # vision encoder (we serve language-only)
    ],
)

The recipe is data-free (no calibration set). Total quantization wall-clock: 57 s on 2× RTX PRO 6000.

The output uses quant_method: compressed-tensors (vLLM has stable native support since v0.5). The base Qwen3.6-35B-A3B-FP8 uses quant_method: fp8 — a different metadata wrapper, but the underlying weight format is identical: e4m3 with [128, 128] block scales.

3. Evaluation

3.1. F1 on the v2.0 held-out test split (1,583 examples)

metric	base FP8 (no fine-tune)	this model (FP8)	pre-quant BF16 (reference)
F1	5.0 %	63.0 %	63.5 %
Recall	5.8 %	63.7 %	63.3 %
Precision	4.4 %	62.3 %	63.8 %
Valid JSON	70.1 %	94.2 %	93.3 %
Empty-file accuracy	—	62.5 %	65.1 %
Eval wall-clock (TP=2, concurrency=16)	—	8.6 min	10.5 min

The FP8 quantization preserves accuracy within ~0.5 pp F1 (typical for FP8_BLOCK on a model the size of 35B); recall and valid-JSON went up slightly under FP8, precision and empty-file accuracy went down slightly — net F1 is essentially unchanged. Throughput is ~18 % faster than the BF16 intermediate at the same concurrency.

Methodology: file-by-file, set-comparison F1 against the teacher labels in the test split. A predicted finding matches a teacher finding when category matches and there is ≥1 line of overlap in the cited line range. JSON-validity is computed as parse_findings(response) is not None.

Eval ran on vllm-minfix5 with --tensor-parallel-size 2 --max-model-len 8192 --gpu-memory-utilization 0.85 against the openai-compatible endpoint at concurrency 16.

3.2. Per-classification F1 (this FP8 model)

classification	F1	recall	precision	files in test	findings (teacher / model / matched)
`jenkinsfile`	76.5 %	68.4 %	86.7 %	16	19 / 15 / 13
`kubernetes`	74.5 %	74.1 %	74.9 %	134	201 / 199 / 149
`dockerfile`	71.0 %	75.7 %	66.8 %	274	338 / 383 / 256
`gitlab_ci`	66.7 %	78.6 %	57.9 %	16	14 / 19 / 11
`typescript`	66.7 %	100.0 %	50.0 %	4	1 / 2 / 1 (very small N)
`helm`	64.6 %	70.7 %	59.4 %	42	58 / 69 / 41
`docker_compose`	62.4 %	62.7 %	62.0 %	121	255 / 258 / 160
`python`	59.4 %	65.5 %	54.3 %	49	29 / 35 / 19
`terraform`	59.3 %	59.5 %	59.1 %	193	153 / 154 / 91
`github_actions`	56.9 %	55.9 %	57.9 %	597	665 / 642 / 372
`cloudformation`	54.5 %	50.0 %	60.0 %	34	54 / 45 / 27
`arm_template`	44.4 %	53.3 %	38.1 %	11	15 / 21 / 8
`javascript`	n/a (n=2, 0 findings)	—	—	2	0 / 0 / 0

3.3. Top per-category recall (this FP8 model, top 12 by teacher-finding volume)

category	recall	matched / teacher
`container_running_as_root`	88.1 %	171 / 194
`missing_security_context`	76.5 %	88 / 115
`hardcoded_credential`	75.8 %	69 / 91
`mutable_image_tag`	65.7 %	249 / 379
`unpinned_action_version`	64.2 %	212 / 330
`overly_permissive_ingress`	62.9 %	22 / 35
`missing_resource_limits`	61.7 %	37 / 60
`overly_permissive_workflow_permissions`	60.0 %	75 / 125
`deprecated_resource_feature`	56.5 %	35 / 62
`missing_encryption_in_transit`	52.9 %	18 / 34
`default_service_account_usage`	40.9 %	9 / 22
`public_exposure_unintended`	25.6 %	11 / 43 ← weakest cell

4. Limitations

This model inherits the limitations of the v2.0 dataset, which has a pronounced over-representation of CI/CD configuration files and a corresponding under-representation of application source code:

group	% of v2.0	distinct repos	notes
CI/CD workflows (`github_actions` 16.8 %, `gitlab_ci` 0.5 %, `jenkinsfile` 0.4 %)	17.6 %	1,564+	Heavy on GitHub Actions because the IaC corpus crawl picked up `.github/workflows/*.yml` from every repo it cloned, regardless of the repo's primary purpose.
Container & deploy (`dockerfile` 12.0 %, `helm` 8.6 %, `docker_compose` 7.1 %)	27.7 %	751+	Well-represented.
Core IaC (`kubernetes` 14.8 %, `terraform` 11.8 %, `cloudformation` 9.2 %, `arm_template` 1.8 %)	37.7 %	615+	Well-represented.
Application code (`python` 6.7 %, `typescript` 4.1 %, `javascript` 3.0 %, `java` 2.2 %, `go` 0.9 %, `rust` 0.0 %)	17.0 %	5–146 per language	Under-represented: only 146 Python repos, 5 Java repos, 5 Go repos, 1 Rust repo, 0 Ruby repos. The model has learned IaC/container security much more thoroughly than application-code security.

What this means in practice:

Strong: Kubernetes / Dockerfile / Helm / Terraform / CloudFormation / GitHub Actions security review — F1 at or near the published 63.5 % aggregate, individual classifications above 60 %.
Weak: Python / JavaScript / TypeScript / Java / Go security review — per-classification F1 in the 49–60 % range, against a small number of test files (4–48 per language). The model can find common application-code issues (hardcoded credentials, basic injection patterns, missing-auth-check) but is not a confident replacement for Semgrep/Bearer at this version.
Untrained: Ruby (0 examples) — model output for Ruby files should be considered unreliable. Rust has 7 train examples and 1 source repo — same caveat.

In addition, the dataset inherits five further limitations from the labeling pipeline:

File-level only. No multi-file context, no project-wide reasoning. Cross-file dataflow vulnerabilities are out of scope.
English reasoning. All <think> blocks and JSON values are in English even when source code has non-English comments.
Teacher-class circularity. Labels were produced and audited by Qwen3.6-class models. The 93.3 % strict precision in the v1.0 manual audit is an audit, not ground truth — treat the gold standard as approximate.
Severity is calibrated, not exploit-tested. A CRITICAL finding is one the teacher reasoned has direct exploit potential; it has not been validated against a working PoC.
No CWE field. The taxonomy maps approximately to CWE but the JSON schema does not currently emit a cwe field.

The dataset's published roadmap includes a v3.0 — application-code rebalance entry that explicitly targets these gaps; a future model release will be re-trained against that corpus.

5. Versioning


Model release	2026-05-09
Dataset version	`aioutfitters/llm-sast-v1` v2.0 (2026-05-09)
Base model	`Qwen/Qwen3.6-35B-A3B` (frozen — adapter merged + quantized; routed expert weights inherited unchanged from base)
System prompt SHA[:12]	`7de4fe802a11` (matches `system_prompt.txt` shipped here)
Compatible with	`vllm` ≥ 0.10 (block-FP8 was stabilized in vLLM 0.9), `transformers` ≥ 5.2 (qwen3_5_moe model class)

6. License

Apache-2.0. Base model and dataset are both Apache-2.0; the LoRA delta + quantization is also released under Apache-2.0.

7. Citation

@misc{aioutfitters_llm_sast_qwen36_35b_fp8_2026,
  title  = {LLM-SAST v1 (Qwen3.6-35B-A3B FP8): an LLM-native static
            application security testing model},
  author = {AI Outfitters},
  year   = {2026},
  url    = {https://huggingface.co/aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8},
  note   = {Fine-tune of Qwen/Qwen3.6-35B-A3B on aioutfitters/llm-sast-v1 v2.0,
            quantized to FP8 e4m3 block [128, 128].}
}

Downloads last month: 202

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Model tree for aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(424)

this model

aioutfitters
/

llm-sast-v1-qwen3.6-35b-a3b-fp8