Instructions to use AesSedai/Step-3.7-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AesSedai/Step-3.7-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AesSedai/Step-3.7-Flash-GGUF", filename="IQ2_S/Step-3.7-Flash-IQ2_S-00001-of-00003.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use AesSedai/Step-3.7-Flash-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Use Docker
docker model run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use AesSedai/Step-3.7-Flash-GGUF with Ollama:
ollama run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
- Unsloth Studio
How to use AesSedai/Step-3.7-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/Step-3.7-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/Step-3.7-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AesSedai/Step-3.7-Flash-GGUF to start chatting
- Pi
How to use AesSedai/Step-3.7-Flash-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AesSedai/Step-3.7-Flash-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AesSedai/Step-3.7-Flash-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use AesSedai/Step-3.7-Flash-GGUF with Docker Model Runner:
docker model run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
- Lemonade
How to use AesSedai/Step-3.7-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Step-3.7-Flash-GGUF-Q4_K_M
List all available models
lemonade list
PPQ(Q)/PPL(base) and KLD seems significantly worse than Step 3.5 Flash
For you IQ4_XS quants of each model:
| Model | PPL(Q)/PPL(base) | KLD |
|---|---|---|
| Step 3.5 Flash | +2.5984% | 0.042753 ± 0.000301 |
| Step 3.5 Flash - 2026-06-03 | +11.1222% | 0.127994 ± 0.000696 |
| Step 3.7 Flash | +15.6194% | 0.159934 ± 0.000946 |
It is surprising since they have the same architecture.
I also can reproduce infinite reasoning loops with the GGUFs (both unsloth IQ4_XS and stepfun Q4_K_S).
The IQ4_XS recipe for both models was the same, so IMO it's more up to the training the model underwent since then? I also need to re-convert these to pick up the tokenizer fix but I've been waiting to see if the Step-3.5-Flash MTP PR merges soon because I'd rather do both at once.
is the issue still there after the reupload? no matter what they did with the training data, IQ4_XS shouldn't be massively worse than Q4_K.
is the issue still there after the reupload?
Haven't tried
I feel that there's something wrong with Step 3.7 on llama.cpp. It is very easy for it to get stuck into reasoning loops at > 30k or so context. The same tests using the official stepfun API work fine, so it is either a bug in llama.cpp implementation or quantization simply impacts it much more than previous versions.
I did use a different imatrix corpus for the old quants, but surely that's not that big of an impact? Hmm.
Edit 1:
Testing the other imatrix on Step-3.5-Flash to see if the reproduces the lower PPL / KLD.
Edit 2:
Validated that it isn't the imatrix corpus at least:
| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD |
|---|---|---|---|---|---|
| IQ4_XS | 91.31 GiB (3.93 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 2.685042 ± 0.012863 | +11.1222% | 0.127994 ± 0.000696 |
| IQ4_XS_ED | 91.31 GiB (3.93 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 2.684072 ± 0.012842 | +11.0821% | 0.128012 ± 0.000697 |
@aessedai I know that this is not scientific, but I did try your old Q4_K_M quant (which I had cached locally) for Step 3.5 Flash locally and it seems solid, I couldn't reproduce any infinite reasoning loop with it (still haven't tried the new one though).
In the Step 3.7 PR there were RoPE changes: https://github.com/ggml-org/llama.cpp/pull/23845/changes#diff-f35bc90601d40b7d19e7abd866ba0e45ce96fa13211399d8ef8c3577b7329142 Could this be something that potentially affects quants in long contexts?
@ubergarm 's IQ4_XS quant had similar perplexity to your previous one at ~+2.5%: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF#iq4_xs-10053-gib-438-bpw
Hard to say what could factor into this change, but maybe it is worth re-running the perplexity/kld test on an old llama.cpp version?
I did the conversion after the RoPE changes, so that's integrated into my 3.7 quant.
I'm testing a couple more things out, I had a couple of local tweaks that I'm trying to rule out as the cause of the discrepancy now. Worst case I'll try testing some older llama versions, yeah.
Okay, I double-checked a completely clean and recompiled llama.cpp without my tweaks and it still looks the same (this makes me feel better about the changes I haven't PR'd yet)😀
| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD |
|---|---|---|---|---|---|
| IQ4_XS | 91.31 GiB (3.93 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 2.685042 ± 0.012863 | +11.1222% | 0.127994 ± 0.000696 |
| IQ4_XS_ED | 91.31 GiB (3.93 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 2.684072 ± 0.012842 | +11.0821% | 0.128012 ± 0.000697 |
| IQ4_XS_UPSTREAM | 91.31 GiB (3.93 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 2.685042 ± 0.012863 | +11.1316% | 0.128100 ± 0.000697 |
Which yes, at this point inclines me to feel there has been a regression somewhere since the file size and BPW are identical so I know the recipe hasn't drifted.
In case you want to run git bisect, a good commit would be 492bc319782b1f13f302911f4c73437382cc8bb9, which is probably the version I used when I ran lm-evaluation-harness here:https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF/discussions/3
I started testing older llama.cpp versions and can almost guarantee that 3dadc88b589ca43b8fca0e1beb22d4b78a09b4dd doesn't have the issue. At least I haven't been able to spot the repeated patterns I see on llama.cpp master, and the behavior looks a lot more like what I see when using stepfun API.
Doing a git bisect now. It will take a while because I have to run the agentic loop 3-4 times in each step to confirm if it is a good/bad commit, but should be done by tomorrow.
I found the cause of the repeated reasoning patterns I was seeing that made it behave differently than the Stepfun API, bisect showed this as the first bad commit: https://github.com/ggml-org/llama.cpp/pull/18675
Doesn't seem like that PR changed anything that could cause differences in perplexity, so that might be a separate regression in llama.cpp.
I reported the autoparser bug: https://github.com/ggml-org/llama.cpp/issues/24181
I can also try to bisect the perplexity/kld loss if you teach me how to run it locally. Never really ran any perplexity kld tests, not even sure if I can with 128G RAM.
I'm concerned that it might be an issue with the convert itself instead of being an inference-time issue too.
I've uploaded the reference logits here: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin
The corpus is just the wiki.test.raw, also uploaded there: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/wiki.test.raw
Running a PPL test is pretty simple and doesn't require the reference logits, running KLD does require the reference logits. They're about the same effort and it's easy to do them both in one command, eg:
./llama.cpp/build/bin/llama-perplexity \
--threads 48 --flash-attn on -lv 4 \
--batch-size 8192 --ubatch-size 8192 \
--file /mnt/srv/host/resources/KLD/wiki.test.raw \
--kl-divergence-base /mnt/srv/snowdrift/ref-logits/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin --kl-divergence \
--model /mnt/srv/snowdrift/gguf/Step-3.7-Flash-GGUF/aes_sedai/Step-3.7-Flash-Q5_K_M.gguf
You can adjust the threads and batch/ubatch size to suit your needs, that won't adjust the result. Point --file to the wiki.test.raw file, and --kl-divergence-base to the reference logits, and --model to your quantized model. Just don't add or adjust any --ctx-size parameter since 512 is the default. That will output a block at the end similar to this:
====== Perplexity statistics ======
Mean PPL(Q) : 1.911601 ± 0.007329
Mean PPL(base) : 1.892159 ± 0.007192
Cor(ln(PPL(Q)), ln(PPL(base))): 99.00%
Mean ln(PPL(Q)/PPL(base)) : 0.010223 ± 0.000540
Mean PPL(Q)/PPL(base) : 1.010275 ± 0.000546
Mean PPL(Q)-PPL(base) : 0.019442 ± 0.001035
====== KL divergence statistics ======
Mean KLD: 0.017023 ± 0.000119
Maximum KLD: 2.310791
99.9% KLD: 0.477511
99.0% KLD: 0.204605
95.0% KLD: 0.081936
90.0% KLD: 0.046680
Median KLD: 0.001903
10.0% KLD: 0.000006
5.0% KLD: 0.000002
1.0% KLD: -0.000001
0.1% KLD: -0.000007
Minimum KLD: -0.000316
you're looking for Mean PPL(Q) and Mean KLD for comparison.
If the issue is with the conversion, should I try to convert the BF16 again using an old llama.cpp version, recreate the quants and then run perplexity + kld against the logits?
I think that would be a fair first test, convert to BF16 on the old llama.cpp and on the new llama.cpp and double check those are identical (or close) and quantize to IQ4_XS on each using the old and new versions then compare those. I don't think they would end up bit-identical but I'm sure you could have a frontier LLM whip up a tensor comparison script to tell you the % difference on the raw weights?
That would rule out the convert-side issue at least. If that checks out fine, then it's an inference-time issue and that would mean git-bisecting to find which commit produces the PPL/KLD difference.
I did the conversion + quantization (Q8_0 / IQ3_S / IQ3_S / IQ4_XS recipe) using an old version with a backporting the Step 3.7 flash PR (minus the vision part) and your imatrix. The result is similar to your recent quants:
I used Step 3.7 to avoid downloading the 3.5 original tensors, but should probably do it next.
One thing I'm curious: If the problem is in the initial GGUF conversion, wouldn't I need to use reference logits generated and imatrix from the original BF16 conversion?
PPL would be independent of the logit set used at least but for KLD yeah you'd have to have originals again.
Wouldn't it make sense to test this by:
- Using an up-to-date version of llamacpp
- Regenerating the imatrix for Step 3.5 Flash (so the old version)
- Regenerating the same IQ4_XS recipe for 3.5 Flash (So again, the old version)
If the PPL is still fine, then we know llamacpp has not regressed. If it has, then we can start bisecting.
Just to confirm: Step Flash 3.7 is using a new imatrix generated with the 3.7 model, right? Want to make sure you didn't accidentally reuse the 3.5 imatrix to make the 3.7 quants.
I made a new imatrix for each of them, correct. I haven't taken the time to re-test this yet but I'll try to over the weekend.
Did you manage to retest by any chance? No worries if you haven't had time yet, but I was just curious :)
No, I haven't yet. I've had other projects consuming my GPU time and this hasn't been an "it's totally broken across every quantization"-level of issue so it's a bit backburnered at the moment. I will review it but no guarantee on the timeframe.
There's a possibility that this could just have been incorrect perplexity/kld measurement the first time.
The issue I mentioned here was mostly caused by this, and opting out of the autoparser (patch in the GH issue) for Step 3.x makes it behave very similarly to the official API, so the model weights look fine.
