Instructions to use AesSedai/Step-3.7-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/Step-3.7-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/Step-3.7-Flash-GGUF",
	filename="IQ2_S/Step-3.7-Flash-IQ2_S-00001-of-00003.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use AesSedai/Step-3.7-Flash-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use AesSedai/Step-3.7-Flash-GGUF with Ollama:
```
ollama run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
```

Unsloth Studio

How to use AesSedai/Step-3.7-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/Step-3.7-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/Step-3.7-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/Step-3.7-Flash-GGUF to start chatting

How to use AesSedai/Step-3.7-Flash-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AesSedai/Step-3.7-Flash-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AesSedai/Step-3.7-Flash-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use AesSedai/Step-3.7-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/Step-3.7-Flash-GGUF:Q4_K_M
```

Lemonade

How to use AesSedai/Step-3.7-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/Step-3.7-Flash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Step-3.7-Flash-GGUF-Q4_K_M

List all available models

lemonade list

PPQ(Q)/PPL(base) and KLD seems significantly worse than Step 3.5 Flash

by tarruda - opened about 1 month ago

Discussion

tarruda

about 1 month ago

•

edited 27 days ago

For you IQ4_XS quants of each model:

Model	PPL(Q)/PPL(base)	KLD
Step 3.5 Flash	+2.5984%	0.042753 ± 0.000301
Step 3.5 Flash - 2026-06-03	+11.1222%	0.127994 ± 0.000696
Step 3.7 Flash	+15.6194%	0.159934 ± 0.000946

It is surprising since they have the same architecture.

I also can reproduce infinite reasoning loops with the GGUFs (both unsloth IQ4_XS and stepfun Q4_K_S).

AesSedai

Owner about 1 month ago

The IQ4_XS recipe for both models was the same, so IMO it's more up to the training the model underwent since then? I also need to re-convert these to pick up the tokenizer fix but I've been waiting to see if the Step-3.5-Flash MTP PR merges soon because I'd rather do both at once.

LagOps

27 days ago

is the issue still there after the reupload? no matter what they did with the training data, IQ4_XS shouldn't be massively worse than Q4_K.

tarruda

27 days ago

is the issue still there after the reupload?

Haven't tried

I feel that there's something wrong with Step 3.7 on llama.cpp. It is very easy for it to get stuck into reasoning loops at > 30k or so context. The same tests using the official stepfun API work fine, so it is either a bug in llama.cpp implementation or quantization simply impacts it much more than previous versions.

AesSedai

Owner 27 days ago

@LagOps there aren't issues AFAIK, but I haven't really played with the model outside of quantizing it and validating the PPL/KLD aren't insane so YMMV? /shrug

tarruda

27 days ago

@AesSedai I've updated the table to include the new perplexity/kld values from the newly updated 3.5 Flash. This is looking more like some llama.cpp regression.

AesSedai

Owner 26 days ago

•

edited 26 days ago

I did use a different imatrix corpus for the old quants, but surely that's not that big of an impact? Hmm.

Edit 1:
Testing the other imatrix on Step-3.5-Flash to see if the reproduces the lower PPL / KLD.

Edit 2:
Validated that it isn't the imatrix corpus at least:

Quant	Size	Mixture	PPL	1-(Mean PPL(Q)/PPL(base))	KLD
IQ4_XS	91.31 GiB (3.93 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	2.685042 ± 0.012863	+11.1222%	0.127994 ± 0.000696
IQ4_XS_ED	91.31 GiB (3.93 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	2.684072 ± 0.012842	+11.0821%	0.128012 ± 0.000697

tarruda

26 days ago

@aessedai I know that this is not scientific, but I did try your old Q4_K_M quant (which I had cached locally) for Step 3.5 Flash locally and it seems solid, I couldn't reproduce any infinite reasoning loop with it (still haven't tried the new one though).

In the Step 3.7 PR there were RoPE changes: https://github.com/ggml-org/llama.cpp/pull/23845/changes#diff-f35bc90601d40b7d19e7abd866ba0e45ce96fa13211399d8ef8c3577b7329142 Could this be something that potentially affects quants in long contexts?

@ubergarm 's IQ4_XS quant had similar perplexity to your previous one at ~+2.5%: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF#iq4_xs-10053-gib-438-bpw

Hard to say what could factor into this change, but maybe it is worth re-running the perplexity/kld test on an old llama.cpp version?

AesSedai

Owner 26 days ago

I did the conversion after the RoPE changes, so that's integrated into my 3.7 quant.

I'm testing a couple more things out, I had a couple of local tweaks that I'm trying to rule out as the cause of the discrepancy now. Worst case I'll try testing some older llama versions, yeah.

AesSedai

Owner 26 days ago

Okay, I double-checked a completely clean and recompiled llama.cpp without my tweaks and it still looks the same (this makes me feel better about the changes I haven't PR'd yet)😀

Quant	Size	Mixture	PPL	1-(Mean PPL(Q)/PPL(base))	KLD
IQ4_XS	91.31 GiB (3.93 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	2.685042 ± 0.012863	+11.1222%	0.127994 ± 0.000696
IQ4_XS_ED	91.31 GiB (3.93 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	2.684072 ± 0.012842	+11.0821%	0.128012 ± 0.000697
IQ4_XS_UPSTREAM	91.31 GiB (3.93 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	2.685042 ± 0.012863	+11.1316%	0.128100 ± 0.000697

Which yes, at this point inclines me to feel there has been a regression somewhere since the file size and BPW are identical so I know the recipe hasn't drifted.

tarruda

26 days ago

In case you want to run git bisect, a good commit would be 492bc319782b1f13f302911f4c73437382cc8bb9, which is probably the version I used when I ran lm-evaluation-harness here:https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF/discussions/3

tarruda

26 days ago

I started testing older llama.cpp versions and can almost guarantee that 3dadc88b589ca43b8fca0e1beb22d4b78a09b4dd doesn't have the issue. At least I haven't been able to spot the repeated patterns I see on llama.cpp master, and the behavior looks a lot more like what I see when using stepfun API.

Doing a git bisect now. It will take a while because I have to run the agentic loop 3-4 times in each step to confirm if it is a good/bad commit, but should be done by tomorrow.

tarruda

26 days ago

I found the cause of the repeated reasoning patterns I was seeing that made it behave differently than the Stepfun API, bisect showed this as the first bad commit: https://github.com/ggml-org/llama.cpp/pull/18675

Doesn't seem like that PR changed anything that could cause differences in perplexity, so that might be a separate regression in llama.cpp.

tarruda

26 days ago

I reported the autoparser bug: https://github.com/ggml-org/llama.cpp/issues/24181

I can also try to bisect the perplexity/kld loss if you teach me how to run it locally. Never really ran any perplexity kld tests, not even sure if I can with 128G RAM.

AesSedai

Owner 25 days ago

•

edited 25 days ago

I'm concerned that it might be an issue with the convert itself instead of being an inference-time issue too.

I've uploaded the reference logits here: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin

The corpus is just the wiki.test.raw, also uploaded there: https://huggingface.co/datasets/AesSedai/reference-logits/blob/main/wiki.test.raw

Running a PPL test is pretty simple and doesn't require the reference logits, running KLD does require the reference logits. They're about the same effort and it's easy to do them both in one command, eg:

./llama.cpp/build/bin/llama-perplexity \
  --threads 48 --flash-attn on -lv 4 \
  --batch-size 8192 --ubatch-size 8192 \
  --file /mnt/srv/host/resources/KLD/wiki.test.raw \
  --kl-divergence-base /mnt/srv/snowdrift/ref-logits/Step-3.7-Flash-BF16-512ctx-wiki.test.raw.bin --kl-divergence \
  --model /mnt/srv/snowdrift/gguf/Step-3.7-Flash-GGUF/aes_sedai/Step-3.7-Flash-Q5_K_M.gguf

You can adjust the threads and batch/ubatch size to suit your needs, that won't adjust the result. Point --file to the wiki.test.raw file, and --kl-divergence-base to the reference logits, and --model to your quantized model. Just don't add or adjust any --ctx-size parameter since 512 is the default. That will output a block at the end similar to this:

====== Perplexity statistics ======
Mean PPL(Q)                   :   1.911601 ±   0.007329
Mean PPL(base)                :   1.892159 ±   0.007192
Cor(ln(PPL(Q)), ln(PPL(base))):  99.00%
Mean ln(PPL(Q)/PPL(base))     :   0.010223 ±   0.000540
Mean PPL(Q)/PPL(base)         :   1.010275 ±   0.000546
Mean PPL(Q)-PPL(base)         :   0.019442 ±   0.001035

====== KL divergence statistics ======
Mean    KLD:   0.017023 ±   0.000119
Maximum KLD:   2.310791
99.9%   KLD:   0.477511
99.0%   KLD:   0.204605
95.0%   KLD:   0.081936
90.0%   KLD:   0.046680
Median  KLD:   0.001903
10.0%   KLD:   0.000006
 5.0%   KLD:   0.000002
 1.0%   KLD:  -0.000001
 0.1%   KLD:  -0.000007
Minimum KLD:  -0.000316

you're looking for Mean PPL(Q) and Mean KLD for comparison.

tarruda

25 days ago

If the issue is with the conversion, should I try to convert the BF16 again using an old llama.cpp version, recreate the quants and then run perplexity + kld against the logits?

AesSedai

Owner 25 days ago

I think that would be a fair first test, convert to BF16 on the old llama.cpp and on the new llama.cpp and double check those are identical (or close) and quantize to IQ4_XS on each using the old and new versions then compare those. I don't think they would end up bit-identical but I'm sure you could have a frontier LLM whip up a tensor comparison script to tell you the % difference on the raw weights?

That would rule out the convert-side issue at least. If that checks out fine, then it's an inference-time issue and that would mean git-bisecting to find which commit produces the PPL/KLD difference.

tarruda

25 days ago

I did the conversion + quantization (Q8_0 / IQ3_S / IQ3_S / IQ4_XS recipe) using an old version with a backporting the Step 3.7 flash PR (minus the vision part) and your imatrix. The result is similar to your recent quants:

I used Step 3.7 to avoid downloading the 3.5 original tensors, but should probably do it next.

One thing I'm curious: If the problem is in the initial GGUF conversion, wouldn't I need to use reference logits generated and imatrix from the original BF16 conversion?

AesSedai

Owner 22 days ago

PPL would be independent of the logit set used at least but for KLD yeah you'd have to have originals again.

Mushoz

21 days ago

Wouldn't it make sense to test this by:

Using an up-to-date version of llamacpp
Regenerating the imatrix for Step 3.5 Flash (so the old version)
Regenerating the same IQ4_XS recipe for 3.5 Flash (So again, the old version)

If the PPL is still fine, then we know llamacpp has not regressed. If it has, then we can start bisecting.

Just to confirm: Step Flash 3.7 is using a new imatrix generated with the 3.7 model, right? Want to make sure you didn't accidentally reuse the 3.5 imatrix to make the 3.7 quants.

AesSedai

Owner 20 days ago

I made a new imatrix for each of them, correct. I haven't taken the time to re-test this yet but I'll try to over the weekend.

Mushoz

11 days ago

Did you manage to retest by any chance? No worries if you haven't had time yet, but I was just curious :)

AesSedai

Owner 7 days ago

•

edited 7 days ago

No, I haven't yet. I've had other projects consuming my GPU time and this hasn't been an "it's totally broken across every quantization"-level of issue so it's a bit backburnered at the moment. I will review it but no guarantee on the timeframe.

tarruda

7 days ago

•

edited 7 days ago

There's a possibility that this could just have been incorrect perplexity/kld measurement the first time.

The issue I mentioned here was mostly caused by this, and opting out of the autoparser (patch in the GH issue) for Step 3.x makes it behave very similarly to the official API, so the model weights look fine.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment