LLM-SAST v1 — Qwen3.6-35B-A3B fine-tune (FP8 block)

A 35B-parameter mixture-of-experts model fine-tuned to perform static application security testing (SAST) directly on a single source file — no scanner in the loop, no retrieval, just (system-prompt, file)JSON-array of findings. Released as block-wise FP8 (e4m3, [128, 128] weight blocks, dynamic per-group=128 e4m3 activations), the same scheme Qwen/Qwen3.6-35B-A3B-FP8 ships with — so it loads in vLLM exactly like the base FP8 release does.

One file in, JSON out. Replaces (does not audit) Checkov / Trivy / KICS / Semgrep / Bearer for the file types it covers.

Base model Qwen/Qwen3.6-35B-A3B
Fine-tuning data aioutfitters/llm-sast-v1 v2.0 (15,822 ShareGPT examples, 18,320 findings, 81 categories)
Method Supervised fine-tuning (4-bit QLoRA, r=128, α=128) → merge to BF16 → block-wise FP8 quantize
Quantization compressed-tensors FP8 (e4m3) — weights: block [128, 128]; activations: dynamic per-group=128. Mirrors Qwen3.6-35B-A3B-FP8's scheme.
Disk size 35 GB (vs ~66 GB BF16 merged; vs ~70 GB base FP8)
F1 on held-out test split (1,583 ex), this FP8 model 63.0 % (recall 63.7 %, precision 62.3 %, valid-JSON 94.2 %)
F1 of the un-fine-tuned base FP8 5.0 %
F1 of the BF16 pre-quant intermediate 63.5 % (FP8 → −0.5 pp, within typical FP8 noise)
Recommended serving vllm ≥ 0.10 with TP=2, BF16 KV cache or FP8 KV cache

1. Quick start

The model is a drop-in replacement for Qwen/Qwen3.6-35B-A3B-FP8 in any vLLM-based deployment. It uses the same chat template, same tokenizer, and a single fixed system prompt (shipped at system_prompt.txt).

1.1. Serve with vLLM

docker run -d --name llm-sast --gpus all \
  --ipc=host --shm-size=32g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
    --model aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8 \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85 \
    --language-model-only \
    --enable-prefix-caching \
    --served-model-name sast

Two RTX PRO 6000 (96 GB) or H100 (80 GB) GPUs comfortably hold the model with TP=2 + 8 K context. A single H100/H200 with TP=1 also works (set --gpu-memory-utilization 0.92).

1.2. Call it like any chat model

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Read the fixed system prompt that ships with this model
with open("system_prompt.txt") as f:
    SYSTEM_PROMPT = f.read()

source_code = open("infra/main.tf").read()
user_msg = (
    f"FILE: infra/main.tf\n\n"
    + "\n".join(f"{i+1:>4}: {line}" for i, line in enumerate(source_code.splitlines()))
)

resp = client.chat.completions.create(
    model="sast",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": user_msg},
    ],
    temperature=0.0,
    max_tokens=2048,
)

print(resp.choices[0].message.content)
# <think> ...analysis... </think>
# [{"lines":[12,13,14],"category":"missing_encryption_at_rest","severity":"HIGH",
#   "reasoning":"...","fix":"..."}]

1.3. Output contract

The assistant turn is always:

<think>
<step-by-step expert security reasoning>
</think>
[ <JSON array of findings> ]

Each finding object:

field type description
lines list[int] 1-indexed inclusive line numbers in the source file
category string One of 81 closed-taxonomy categories (defined in system_prompt.txt)
severity enum CRITICAL / HIGH / MEDIUM / LOW
reasoning string 1–2 sentences specific to the actual code
fix string 1–2 sentences with a concrete remediation

If the file is clean, the assistant emits [] after the <think> block.

1.4. Parse the response robustly

import json, re

def parse_findings(text: str):
    # Strip <think>...</think> if present
    body = re.sub(r"<think>.*?</think>\s*", "", text, count=1, flags=re.DOTALL)
    # Pull the first JSON array
    m = re.search(r"\[\s*(?:\{.*?\}\s*,?\s*)*\]", body, flags=re.DOTALL)
    if not m:
        return []
    try:
        arr = json.loads(m.group(0))
    except json.JSONDecodeError:
        return []
    return [f for f in arr if isinstance(f, dict)]

2. Methodology

2.1. Base model

Started from Qwen/Qwen3.6-35B-A3B — a 35B-parameter MoE+linear-attention hybrid: 60 transformer blocks, 256 routed experts per MoE layer, hybrid Mamba-style linear attention on alternating layers, ~3B active parameters per token.

We chose the 35B-A3B because it gave the best base-model SAST instruction-following on our pilot in the 27B-to-72B range, and because its activated-parameter count (~3B) makes it cheap to serve at production concurrency.

2.2. Dataset

Trained on aioutfitters/llm-sast-v1 v2.0: 15,822 ShareGPT examples / 12,657 train / 1,582 val / 1,583 test, disjoint by repository (3,055 distinct source repos).

Pipeline stage Engine Output
Repo discovery GitHub Search API + curated tier-1 vulnerable list 3,055 repos retained
File extraction Tier-aware sweep (tier1 vulnerable, tier2 production, tier3 community) ~40 K candidate files
Best-of-5 labeling Qwen3.6-27B-FP8 teacher, T-schedule [0.0, 0.3, 0.5, 0.7, 1.0], ±2-line dedup, majority vote candidate findings
Single-call judge (v9b) Same teacher, default-KEEP unless clearly wrong prompt + JSON-mode confirmed findings
Reasoning synthesis Same teacher, post-hoc <think> block per example
Distribution gates max_class_frac=0.30, max_cat_frac=0.25, max_empty_frac=0.30 15,822 final examples

Each assistant turn passes the gates: 100 % valid JSON, 100 % <think> reasoning, 100 % closed-taxonomy compliance. The v7 portion was manually audited at 200-sample scale (93.3 % strict per-finding precision); the v8-sweep records were validated by the downstream training F1 (this model). See the dataset card for the full audit + provenance trail.

2.3. Fine-tuning configuration

Method Supervised fine-tuning (SFT) on assistant-only loss
Adapter LoRA r=128, α=128, dropout=0, applied to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (target_parameters=[] — routed expert weights are NOT adapted)
Quantization at train time 4-bit QLoRA via unsloth==2026.5.2 + bitsandbytes NF4
Hardware Single RTX PRO 6000 Blackwell (96 GB) on a Threadripper PRO 7975WX workstation
Optimizer AdamW 8-bit (adamw_8bit), lr=2e-4, weight_decay=0.01
Schedule Cosine, warmup_ratio=0.01, 3 epochs
Effective batch per_device_train_batch_size=1, grad_accum=8, bf16=True, seq_length=8192
Eval cadence Every half-epoch (eval_strategy=steps, eval_steps=395); prediction_loss_only=True, eval_accumulation_steps=1, bf16_full_eval=True, eval_on_start=True
Early stopping EarlyStoppingCallback(patience=2) on eval_loss
Total training wall-clock 11 h 5 min

Eval-loss trajectory (lower = better):

step epoch eval_loss
395 0.50 0.4991
790 1.00 0.4673
1185 1.50 0.4556
1580 2.00 0.4432
1975 2.50 0.4439
2370 3.00 0.4421 ← best

Final adapter was the step-2370 checkpoint. Total trainable parameters: ~ 256 M (0.7 % of base).

2.4. Merge to BF16

The trained LoRA adapter was merged into a BF16 copy of the base model with peft.PeftModel.merge_and_unload(). Output: 16-shard safetensors, ~66 GB. This is the artifact passed to the FP8 quantizer.

2.5. FP8 block-wise quantization

The merged BF16 model was then quantized using llmcompressor==0.10.1a20260407 + compressed-tensors==0.15.0.1 with the FP8_BLOCK recipe to mirror the Qwen3.6-35B-A3B-FP8 scheme exactly:

QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",   # weights e4m3 block [128, 128];
                          # activations e4m3 dynamic per-group=128
    ignore=[
        "lm_head",
        "re:.*mlp\\.gate$",                # MoE router (NOT gate_proj)
        "re:.*mlp\\.shared_expert_gate$",   # 1-dim shared-expert scalar gate
        "re:.*linear_attn\\..*",            # Mamba/linear-attention projections
                                            #   (in_proj_a/b have output dim 32,
                                            #   not divisible by 128 under TP=2)
        "re:.*visual\\..*",                 # vision encoder (we serve language-only)
    ],
)

The recipe is data-free (no calibration set). Total quantization wall-clock: 57 s on 2× RTX PRO 6000.

The output uses quant_method: compressed-tensors (vLLM has stable native support since v0.5). The base Qwen3.6-35B-A3B-FP8 uses quant_method: fp8 — a different metadata wrapper, but the underlying weight format is identical: e4m3 with [128, 128] block scales.

3. Evaluation

3.1. F1 on the v2.0 held-out test split (1,583 examples)

metric base FP8 (no fine-tune) this model (FP8) pre-quant BF16 (reference)
F1 5.0 % 63.0 % 63.5 %
Recall 5.8 % 63.7 % 63.3 %
Precision 4.4 % 62.3 % 63.8 %
Valid JSON 70.1 % 94.2 % 93.3 %
Empty-file accuracy 62.5 % 65.1 %
Eval wall-clock (TP=2, concurrency=16) 8.6 min 10.5 min

The FP8 quantization preserves accuracy within ~0.5 pp F1 (typical for FP8_BLOCK on a model the size of 35B); recall and valid-JSON went up slightly under FP8, precision and empty-file accuracy went down slightly — net F1 is essentially unchanged. Throughput is ~18 % faster than the BF16 intermediate at the same concurrency.

Methodology: file-by-file, set-comparison F1 against the teacher labels in the test split. A predicted finding matches a teacher finding when category matches and there is ≥1 line of overlap in the cited line range. JSON-validity is computed as parse_findings(response) is not None.

Eval ran on vllm-minfix5 with --tensor-parallel-size 2 --max-model-len 8192 --gpu-memory-utilization 0.85 against the openai-compatible endpoint at concurrency 16.

3.2. Per-classification F1 (this FP8 model)

classification F1 recall precision files in test findings (teacher / model / matched)
jenkinsfile 76.5 % 68.4 % 86.7 % 16 19 / 15 / 13
kubernetes 74.5 % 74.1 % 74.9 % 134 201 / 199 / 149
dockerfile 71.0 % 75.7 % 66.8 % 274 338 / 383 / 256
gitlab_ci 66.7 % 78.6 % 57.9 % 16 14 / 19 / 11
typescript 66.7 % 100.0 % 50.0 % 4 1 / 2 / 1 (very small N)
helm 64.6 % 70.7 % 59.4 % 42 58 / 69 / 41
docker_compose 62.4 % 62.7 % 62.0 % 121 255 / 258 / 160
python 59.4 % 65.5 % 54.3 % 49 29 / 35 / 19
terraform 59.3 % 59.5 % 59.1 % 193 153 / 154 / 91
github_actions 56.9 % 55.9 % 57.9 % 597 665 / 642 / 372
cloudformation 54.5 % 50.0 % 60.0 % 34 54 / 45 / 27
arm_template 44.4 % 53.3 % 38.1 % 11 15 / 21 / 8
javascript n/a (n=2, 0 findings) 2 0 / 0 / 0

3.3. Top per-category recall (this FP8 model, top 12 by teacher-finding volume)

category recall matched / teacher
container_running_as_root 88.1 % 171 / 194
missing_security_context 76.5 % 88 / 115
hardcoded_credential 75.8 % 69 / 91
mutable_image_tag 65.7 % 249 / 379
unpinned_action_version 64.2 % 212 / 330
overly_permissive_ingress 62.9 % 22 / 35
missing_resource_limits 61.7 % 37 / 60
overly_permissive_workflow_permissions 60.0 % 75 / 125
deprecated_resource_feature 56.5 % 35 / 62
missing_encryption_in_transit 52.9 % 18 / 34
default_service_account_usage 40.9 % 9 / 22
public_exposure_unintended 25.6 % 11 / 43 ← weakest cell

4. Limitations

This model inherits the limitations of the v2.0 dataset, which has a pronounced over-representation of CI/CD configuration files and a corresponding under-representation of application source code:

group % of v2.0 distinct repos notes
CI/CD workflows (github_actions 16.8 %, gitlab_ci 0.5 %, jenkinsfile 0.4 %) 17.6 % 1,564+ Heavy on GitHub Actions because the IaC corpus crawl picked up .github/workflows/*.yml from every repo it cloned, regardless of the repo's primary purpose.
Container & deploy (dockerfile 12.0 %, helm 8.6 %, docker_compose 7.1 %) 27.7 % 751+ Well-represented.
Core IaC (kubernetes 14.8 %, terraform 11.8 %, cloudformation 9.2 %, arm_template 1.8 %) 37.7 % 615+ Well-represented.
Application code (python 6.7 %, typescript 4.1 %, javascript 3.0 %, java 2.2 %, go 0.9 %, rust 0.0 %) 17.0 % 5–146 per language Under-represented: only 146 Python repos, 5 Java repos, 5 Go repos, 1 Rust repo, 0 Ruby repos. The model has learned IaC/container security much more thoroughly than application-code security.

What this means in practice:

  • Strong: Kubernetes / Dockerfile / Helm / Terraform / CloudFormation / GitHub Actions security review — F1 at or near the published 63.5 % aggregate, individual classifications above 60 %.
  • Weak: Python / JavaScript / TypeScript / Java / Go security review — per-classification F1 in the 49–60 % range, against a small number of test files (4–48 per language). The model can find common application-code issues (hardcoded credentials, basic injection patterns, missing-auth-check) but is not a confident replacement for Semgrep/Bearer at this version.
  • Untrained: Ruby (0 examples) — model output for Ruby files should be considered unreliable. Rust has 7 train examples and 1 source repo — same caveat.

In addition, the dataset inherits five further limitations from the labeling pipeline:

  • File-level only. No multi-file context, no project-wide reasoning. Cross-file dataflow vulnerabilities are out of scope.
  • English reasoning. All <think> blocks and JSON values are in English even when source code has non-English comments.
  • Teacher-class circularity. Labels were produced and audited by Qwen3.6-class models. The 93.3 % strict precision in the v1.0 manual audit is an audit, not ground truth — treat the gold standard as approximate.
  • Severity is calibrated, not exploit-tested. A CRITICAL finding is one the teacher reasoned has direct exploit potential; it has not been validated against a working PoC.
  • No CWE field. The taxonomy maps approximately to CWE but the JSON schema does not currently emit a cwe field.

The dataset's published roadmap includes a v3.0 — application-code rebalance entry that explicitly targets these gaps; a future model release will be re-trained against that corpus.

5. Versioning

Model release 2026-05-09
Dataset version aioutfitters/llm-sast-v1 v2.0 (2026-05-09)
Base model Qwen/Qwen3.6-35B-A3B (frozen — adapter merged + quantized; routed expert weights inherited unchanged from base)
System prompt SHA[:12] 7de4fe802a11 (matches system_prompt.txt shipped here)
Compatible with vllm ≥ 0.10 (block-FP8 was stabilized in vLLM 0.9), transformers ≥ 5.2 (qwen3_5_moe model class)

6. License

Apache-2.0. Base model and dataset are both Apache-2.0; the LoRA delta + quantization is also released under Apache-2.0.

7. Citation

@misc{aioutfitters_llm_sast_qwen36_35b_fp8_2026,
  title  = {LLM-SAST v1 (Qwen3.6-35B-A3B FP8): an LLM-native static
            application security testing model},
  author = {AI Outfitters},
  year   = {2026},
  url    = {https://huggingface.co/aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8},
  note   = {Fine-tune of Qwen/Qwen3.6-35B-A3B on aioutfitters/llm-sast-v1 v2.0,
            quantized to FP8 e4m3 block [128, 128].}
}
Downloads last month
202
Safetensors
Model size
35B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8

Quantized
(424)
this model

Dataset used to train aioutfitters/llm-sast-v1-qwen3.6-35b-a3b-fp8