PPQ(Q)/PPL(base) and KLD seems significantly worse than Step 3.5 Flash

#2
by tarruda - opened

For you IQ4_XS quants of each model:

Model PPL(Q)/PPL(base) KLD
Step 3.5 Flash +2.5984% 0.042753 ± 0.000301
Step 3.5 Flash - 2026-06-03 +11.1222% 0.127994 ± 0.000696
Step 3.7 Flash +15.6194% 0.159934 ± 0.000946

It is surprising since they have the same architecture.

I also can reproduce infinite reasoning loops with the GGUFs (both unsloth IQ4_XS and stepfun Q4_K_S).

The IQ4_XS recipe for both models was the same, so IMO it's more up to the training the model underwent since then? I also need to re-convert these to pick up the tokenizer fix but I've been waiting to see if the Step-3.5-Flash MTP PR merges soon because I'd rather do both at once.

is the issue still there after the reupload? no matter what they did with the training data, IQ4_XS shouldn't be massively worse than Q4_K.

is the issue still there after the reupload?

Haven't tried

I feel that there's something wrong with Step 3.7 on llama.cpp. It is very easy for it to get stuck into reasoning loops at > 30k or so context. The same tests using the official stepfun API work fine, so it is either a bug in llama.cpp implementation or quantization simply impacts it much more than previous versions.

@LagOps there aren't issues AFAIK, but I haven't really played with the model outside of quantizing it and validating the PPL/KLD aren't insane so YMMV? /shrug

@AesSedai I've updated the table to include the new perplexity/kld values from the newly updated 3.5 Flash. This is looking more like some llama.cpp regression.

I did use a different imatrix corpus for the old quants, but surely that's not that big of an impact? Hmm.

Edit 1:
Testing the other imatrix on Step-3.5-Flash to see if the reproduces the lower PPL / KLD.

Edit 2:
Validated that it isn't the imatrix corpus at least:

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
IQ4_XS 91.31 GiB (3.93 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 2.685042 ± 0.012863 +11.1222% 0.127994 ± 0.000696
IQ4_XS_ED 91.31 GiB (3.93 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 2.684072 ± 0.012842 +11.0821% 0.128012 ± 0.000697

@aessedai I know that this is not scientific, but I did try your old Q4_K_M quant (which I had cached locally) for Step 3.5 Flash locally and it seems solid, I couldn't reproduce any infinite reasoning loop with it (still haven't tried the new one though).

In the Step 3.7 PR there were RoPE changes: https://github.com/ggml-org/llama.cpp/pull/23845/changes#diff-f35bc90601d40b7d19e7abd866ba0e45ce96fa13211399d8ef8c3577b7329142 Could this be something that potentially affects quants in long contexts?

@ubergarm 's IQ4_XS quant had similar perplexity to your previous one at ~+2.5%: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF#iq4_xs-10053-gib-438-bpw

Hard to say what could factor into this change, but maybe it is worth re-running the perplexity/kld test on an old llama.cpp version?

I did the conversion after the RoPE changes, so that's integrated into my 3.7 quant.

I'm testing a couple more things out, I had a couple of local tweaks that I'm trying to rule out as the cause of the discrepancy now. Worst case I'll try testing some older llama versions, yeah.

Okay, I double-checked a completely clean and recompiled llama.cpp without my tweaks and it still looks the same (this makes me feel better about the changes I haven't PR'd yet)😀

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
IQ4_XS 91.31 GiB (3.93 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 2.685042 ± 0.012863 +11.1222% 0.127994 ± 0.000696
IQ4_XS_ED 91.31 GiB (3.93 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 2.684072 ± 0.012842 +11.0821% 0.128012 ± 0.000697
IQ4_XS_UPSTREAM 91.31 GiB (3.93 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 2.685042 ± 0.012863 +11.1316% 0.128100 ± 0.000697

Which yes, at this point inclines me to feel there has been a regression somewhere since the file size and BPW are identical so I know the recipe hasn't drifted.

In case you want to run git bisect, a good commit would be 492bc319782b1f13f302911f4c73437382cc8bb9, which is probably the version I used when I ran lm-evaluation-harness here:https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF/discussions/3

I started testing older llama.cpp versions and can almost guarantee that 3dadc88b589ca43b8fca0e1beb22d4b78a09b4dd doesn't have the issue. At least I haven't been able to spot the repeated patterns I see on llama.cpp master, and the behavior looks a lot more like what I see when using stepfun API.

Doing a git bisect now. It will take a while because I have to run the agentic loop 3-4 times in each step to confirm if it is a good/bad commit, but should be done by tomorrow.

I found the cause of the repeated reasoning patterns I was seeing that made it behave differently than the Stepfun API, bisect showed this as the first bad commit: https://github.com/ggml-org/llama.cpp/pull/18675

Doesn't seem like that PR changed anything that could cause differences in perplexity, so that might be a separate regression in llama.cpp.

I reported the autoparser bug: https://github.com/ggml-org/llama.cpp/issues/24181

I can also try to bisect the perplexity/kld loss if you teach me how to run it locally. Never really ran any perplexity kld tests, not even sure if I can with 128G RAM.

I'm concerned that it might be an issue with the convert itself instead of being an inference-time issue too.

I've uploaded the reference logits here: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin

The corpus is just the wiki.test.raw, also uploaded there: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/wiki.test.raw

Running a PPL test is pretty simple and doesn't require the reference logits, running KLD does require the reference logits. They're about the same effort and it's easy to do them both in one command, eg:

./llama.cpp/build/bin/llama-perplexity \
  --threads 48 --flash-attn on -lv 4 \
  --batch-size 8192 --ubatch-size 8192 \
  --file /mnt/srv/host/resources/KLD/wiki.test.raw \
  --kl-divergence-base /mnt/srv/snowdrift/ref-logits/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin --kl-divergence \
  --model /mnt/srv/snowdrift/gguf/Step-3.7-Flash-GGUF/aes_sedai/Step-3.7-Flash-Q5_K_M.gguf

You can adjust the threads and batch/ubatch size to suit your needs, that won't adjust the result. Point --file to the wiki.test.raw file, and --kl-divergence-base to the reference logits, and --model to your quantized model. Just don't add or adjust any --ctx-size parameter since 512 is the default. That will output a block at the end similar to this:

====== Perplexity statistics ======
Mean PPL(Q)                   :   1.911601 ±   0.007329
Mean PPL(base)                :   1.892159 ±   0.007192
Cor(ln(PPL(Q)), ln(PPL(base))):  99.00%
Mean ln(PPL(Q)/PPL(base))     :   0.010223 ±   0.000540
Mean PPL(Q)/PPL(base)         :   1.010275 ±   0.000546
Mean PPL(Q)-PPL(base)         :   0.019442 ±   0.001035

====== KL divergence statistics ======
Mean    KLD:   0.017023 ±   0.000119
Maximum KLD:   2.310791
99.9%   KLD:   0.477511
99.0%   KLD:   0.204605
95.0%   KLD:   0.081936
90.0%   KLD:   0.046680
Median  KLD:   0.001903
10.0%   KLD:   0.000006
 5.0%   KLD:   0.000002
 1.0%   KLD:  -0.000001
 0.1%   KLD:  -0.000007
Minimum KLD:  -0.000316

you're looking for Mean PPL(Q) and Mean KLD for comparison.

If the issue is with the conversion, should I try to convert the BF16 again using an old llama.cpp version, recreate the quants and then run perplexity + kld against the logits?

I think that would be a fair first test, convert to BF16 on the old llama.cpp and on the new llama.cpp and double check those are identical (or close) and quantize to IQ4_XS on each using the old and new versions then compare those. I don't think they would end up bit-identical but I'm sure you could have a frontier LLM whip up a tensor comparison script to tell you the % difference on the raw weights?

That would rule out the convert-side issue at least. If that checks out fine, then it's an inference-time issue and that would mean git-bisecting to find which commit produces the PPL/KLD difference.

I did the conversion + quantization (Q8_0 / IQ3_S / IQ3_S / IQ4_XS recipe) using an old version with a backporting the Step 3.7 flash PR (minus the vision part) and your imatrix. The result is similar to your recent quants:

image

I used Step 3.7 to avoid downloading the 3.5 original tensors, but should probably do it next.

One thing I'm curious: If the problem is in the initial GGUF conversion, wouldn't I need to use reference logits generated and imatrix from the original BF16 conversion?

PPL would be independent of the logit set used at least but for KLD yeah you'd have to have originals again.

Wouldn't it make sense to test this by:

  1. Using an up-to-date version of llamacpp
  2. Regenerating the imatrix for Step 3.5 Flash (so the old version)
  3. Regenerating the same IQ4_XS recipe for 3.5 Flash (So again, the old version)

If the PPL is still fine, then we know llamacpp has not regressed. If it has, then we can start bisecting.

Just to confirm: Step Flash 3.7 is using a new imatrix generated with the 3.7 model, right? Want to make sure you didn't accidentally reuse the 3.5 imatrix to make the 3.7 quants.

I made a new imatrix for each of them, correct. I haven't taken the time to re-test this yet but I'll try to over the weekend.

Did you manage to retest by any chance? No worries if you haven't had time yet, but I was just curious :)

No, I haven't yet. I've had other projects consuming my GPU time and this hasn't been an "it's totally broken across every quantization"-level of issue so it's a bit backburnered at the moment. I will review it but no guarantee on the timeframe.

There's a possibility that this could just have been incorrect perplexity/kld measurement the first time.

The issue I mentioned here was mostly caused by this, and opting out of the autoparser (patch in the GH issue) for Step 3.x makes it behave very similarly to the official API, so the model weights look fine.

Sign up or log in to comment