Instructions to use JohnTdi/Bielik-Minitron-Fit-6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JohnTdi/Bielik-Minitron-Fit-6B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="JohnTdi/Bielik-Minitron-Fit-6B",
	filename="Bielik-Minitron-Fit-6B-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use JohnTdi/Bielik-Minitron-Fit-6B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Use Docker

docker model run hf.co/JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

LM Studio
Jan

vLLM

How to use JohnTdi/Bielik-Minitron-Fit-6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JohnTdi/Bielik-Minitron-Fit-6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JohnTdi/Bielik-Minitron-Fit-6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Ollama
How to use JohnTdi/Bielik-Minitron-Fit-6B with Ollama:
```
ollama run hf.co/JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
```

Unsloth Studio

How to use JohnTdi/Bielik-Minitron-Fit-6B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JohnTdi/Bielik-Minitron-Fit-6B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for JohnTdi/Bielik-Minitron-Fit-6B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for JohnTdi/Bielik-Minitron-Fit-6B to start chatting

Docker Model Runner
How to use JohnTdi/Bielik-Minitron-Fit-6B with Docker Model Runner:
```
docker model run hf.co/JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M
```

Lemonade

How to use JohnTdi/Bielik-Minitron-Fit-6B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull JohnTdi/Bielik-Minitron-Fit-6B:Q4_K_M

Run and chat with the model

lemonade run user.Bielik-Minitron-Fit-6B-Q4_K_M

List all available models

lemonade list

Bielik-Minitron Fit 6B

Depth-pruned (40L → 32L, 6.03B) wariant Bielik-Minitron-7B-v3.0-Instruct, odzyskany przez KD + R-Tuning + LoRA souping. Mniejszy i ~20–25% szybszy, do krótkich zadań po polsku. Projekt eksperymentalny/edukacyjny.

Pliki

plik	rozmiar	zastosowanie
`Bielik-Minitron-Fit-6B-Q4_K_M.gguf`	3.4 GiB	mniejszy, szybszy decode — słabszy sprzęt
`Bielik-Minitron-Fit-6B-Q8_0.gguf`	6.0 GiB	wyższa jakość

Użycie (llama.cpp)

llama-cli -m Bielik-Minitron-Fit-6B-Q4_K_M.gguf -p "Kto wygrał mistrzostwa świata w piłce nożnej w 1998 roku?" --temp 0.1
# albo serwer:
llama-server -m Bielik-Minitron-Fit-6B-Q8_0.gguf -c 4096 --temp 0.1

Szablon czatu: ChatML (<|im_start|>role … <|im_end|>). Zalecana temperatura 0–0.2.

Licencja

Model pochodny od speakleash/Bielik-Minitron-7B-v3.0-Instruct — podlega licencji modelu bazowego (sprawdź u źródła przed użyciem komercyjnym). / Derived from speakleash/Bielik-Minitron-7B-v3.0-Instruct — subject to the base model's license.

Bielik-fit 6B Instruct — depth-pruned recovery (eksperyment)

Dla tych, którzy nie lubią dużo czytać: Mniejszy i 20% szybszy Bielik, dobrze się wpasuje na wolniejszy hardware. Najlepiej używać z temp 0-0.2. Słabszy w długim rozumowaniu, podobny w szybkich zadaniach.

Full wersja: Ten projekt miał dla mnie głównie charakter edukacyjny — chciałem od środka poznać architekturę Bielika i przy okazji przetestować pruning.

Punkt wyjścia jest tu level hard: Bielik-Minitron-7B sam już jest skompresowany z modelu 11B. Czyli ścinałem model, który raz już przeszedł kompresję. Zadanie karkołomne — i przyznaję, trochę mi to krwi napsuło.

Wziąłem 40-warstwowego rodzica (7.48B), wyciąłem 8 warstw → 32L (−20% głębokości) i częściowo odzyskałem jakość. Mniej warstw = mniej obliczeń na token = szybszy decode na słabszym sprzęcie. Finalny model ma 6.03B parametrów.

Metoda

1. Pomiar i wybór warstw do cięcia (block-influence). Każdą warstwę oceniłem metryką wpływu: kąt między wejściem a wyjściem bloku (cosine input↔output) oraz wzrost NLL przy „identity-skip" (pominięciu warstwy). Wybrałem region środkowo-późny o najmniejszym wpływie:

CUT = [11, 12, 13, 14, 20, 21, 22, 25]   # 8 z 40 → 32L

Dodatkowo zrobiłem gradient-saliency probe (per-moduł, per-warstwa) pod ewentualny targeted LoRA. Lekcja praktyczna: 30-sample probe potrafi kłamać vs pełny lm-eval (−25pp), więc decyzje o cięciu walidowałem pełnym benchem.

2. Trening integrujący „zszyte" warstwy (knowledge distillation). Po wycięciu warstw model jest rozspojony. Odzyskałem przez online full-vocab KD: KL(teacher ‖ student) + CE, teacher = nietknięty 40L rodzic. Padded batche. Top-k KD dawało bełkot — użyłem pełnego rozkładu. Po ludzku: nietknięty duży model (nauczyciel) na bieżąco pokazuje okrojonemu (uczniowi) jaki rozkład odpowiedzi powinien dawać, a uczeń uczy się go naśladować. Tak „zszywa" wycięte warstwy z powrotem i odzyskuje większość tego, co straciło cięcie — bez uczenia od zera.

3. Trening anty-halucynacyjny (R-Tuning, closed-book abstention). Grounded-abstencja (PoQuAD z kontekstem) NIE działa — uczy „czy w tekście", nie „czy wiem". Działa R-Tuning: probuję ten checkpoint na pytaniach z gold → trafne Q→gold, błędne Q→"nie wiem". Etykiety per-checkpoint, ~50:50 known/unknown, niski udział w miksie (>12% → over-refusal). Efekt: model mówi „nie wiem" tam, gdzie wcześniej pewnie zmyślał. Po ludzku: najpierw sprawdzam, czego model naprawdę nie wie (zadaję mu pytania, do których znam odpowiedź, i patrzę gdzie się myli), a potem uczę go mówić w tych miejscach „nie wiem" zamiast pewnie zmyślać. Trzeba uważać z dawką — za dużo takich przykładów i model zaczyna odmawiać nawet na rzeczy, które zna.

4. LoRA souping. Po takim cięciu model był bardzo wrażliwy na treningi, przy kilku treningach zyskiwał coś kosztem innego zadania więc tutaj lora souping pomiędzy treningami był zbawienny. Po ludzku: LoRA souping to uśrednienie wag z kilku osobnych treningów w jeden model — zamiast wybierać „albo dobry w jednym, albo w drugim", biorę średnią i często wychodzi model dobry w obu naraz.

Wyniki — Open PL LLM Leaderboard (Q8_0)

task	Minitron 7B (rodzic)	Bielik-fit 32L
psc	95.73	94.34
ppc	75.60	70.70
dyk	73.37	85.42
belebele	87.33	76.67
8tags	78.29	79.69
polemo2	77.15	90.44
ŚREDNIA	81.25	82.88

Parametry wyglądają całkiem nieźle — i muszę uczciwie powiedzieć, że częściowo wynika to ze sposobu, w jaki model był trenowany. Wszystkie miksy i skrypty są dostępne na moim profilu Hugging Face. Dane kilkukrotnie sprawdzałem i filtrowałem pod kątem benchmark-leaks (dekontaminacja n-gram + embedding cosine vs test-splity — 0 verbatim leaku na wszystkich taskach).

Diabeł tkwi w szczegółach

Wysoka średnia ≠ „mądrzejszy model". Kluczowa jest regresja na belebele (87 → 77, −10.7pp) i jest ona bezpośrednio związana z redukcją warstw. belebele to długi pasaż + wielokrokowy reasoning — czyli dokładnie to, co ucięte warstwy zapewniały. Model 6B jest słabszy w dłuższym rozumowaniu i długim kontekście; lepiej nadaje się do krótkich zadań (klasyfikacja, faktografia, krótkie odpowiedzi). Wysokie polemo2/dyk to efekt dotrenowania na PL-taskach, nie ogólnej wyższości. Ponadto model był sprawdzany tylko w języku polskim, dataset zawierał elementy treningu z języka angielskiego. Obstawiam dużą regresje na innych językach ale przypominam, że moim celem był mały szybki polski Bielik.

Benchmarki angielskie (EN / math / code)

Generacyjnie, przez chat endpoint, identyczny setup dla obu modeli (n=120 dla ARC/GSM8K, 164 dla HumanEval):

benchmark	Bielik-fit 6B	Minitron 7B	różnica
ARC-Challenge (reasoning EN)	61.7	81.7	−20.0
GSM8K (matematyka)	50.0	69.2	−19.2
HumanEval pass@1 (kod)	34.8	70.1	−35.3

Tu fit-6B wyraźnie odstaje — i to potwierdza całą narrację: cięcie warstw najmocniej kosztuje głęboki reasoning, matematykę i kod (ta sama przyczyna co regresja belebele), a do tego trening był głównie polski. Czyli zgodnie z celem: fit-6B to mały, szybki model do krótkich zadań po polsku — nie do reasoningu, kodu czy angielskiego.

Tool-calling i RAG (vs Minitron 7B)

zdolność	Bielik-fit 6B	Minitron 7B
Tool-calling (BFCL AST)	0.78	0.94
RAG grounded (EM)	0.60	0.69
RAG abstencja (impossible)	0.20	0.53

Cięcie warstw zachowało routing (wybór funkcji — 0 błędów nazwy), ale kosztowało precyzję argumentów i samokontrolę (rozpoznanie „nie wiem"). Spójne z belebele: pruning tnie głębokość, nie wiedzę powierzchniową.

Prędkość — fit-6B vs Minitron 7B (R9700 / RADV Vulkan, gfx1201)

llama-bench, t/s. pp512 = prefill, tg128 = decode, @dN = przy kontekście N tokenów.

Decode (tg128):

kontekst	fit-6B Q4	Minitron 7B Q4	fit-6B Q8	Minitron 7B Q8
0	126	102	85	68
2048	116	94	80	65
8192	91	73	68	54

Prefill (pp512):

kontekst	fit-6B Q4	Minitron 7B Q4	fit-6B Q8	Minitron 7B Q8
0	3940	3038	4726	3822
8192	1718	1345	1830	1456

Rozmiary: fit-6B Q4 3.39 GiB / Q8 5.97 GiB · Minitron 7B Q4 4.19 GiB / Q8 7.40 GiB.

Przy tym samym kwancie fit-6B jest ~24–30% szybszy od Minitron 7B (efekt −20% warstw / −19% parametrów: 6.03B vs 7.48B). Q4 = szybszy decode (memory-bound), Q8 = szybszy prefill (compute-bound, coopmat). Do interaktywnego czatu na słabszym GPU → Q4.

Rekomendacja

Model najlepiej sprawdza się przy temperaturze 0–0.2. Wtedy faktograficznie dorównuje pełnemu Minitron 7B na normalnej i trudnej wiedzy o Polsce (patrz appendix), przy −20% rozmiaru i ~25% szybszym decode. Skompresowany model ma „cieńszą" pewność, więc przy wysokiej temperaturze sampling łatwiej go wykoleja. Do dłuższego rozumowania / długiego kontekstu wybierz pełny model.

Cały projekt ma charakter czysto eksperymentalny i edukacyjny. Asysta przy kodowaniu: Claude Code. Nie odpowiadam za niego. :)

Appendix — testy faktograficzne fit-6B vs Minitron 7B przez temperatury

Każde pytanie zadane przy temperaturach 0.0 / 0.2 / 0.4 / 0.6 / 0.8 / 1.0. OK = odpowiedź trafna (lub trafna abstencja/sprostowanie), HAL/WRG = halucynacja/błąd. Anti-halu system prompt, greedy seed. (Sędzia i autor pytań: Opus 4.8 — wierzę mu na słowo.)

Poziom 1 — adversarialne (pułapki, niszowe fakty, fałszywe przesłanki)

Q1  Pierwszy kolarz z 5× Tour de France?
    fit-6B  OK  OK  OK  OK  HAL HAL      Minitron 7B  HAL HAL HAL HAL HAL HAL
Q2  Jaką jednostkę temp. wprowadził polski naukowiec?
    fit-6B  OK  OK  OK  HAL HAL HAL      Minitron 7B  HAL HAL HAL HAL HAL HAL
Q3  Burmistrz Pacanowa 14 marca 1632?
    fit-6B  OK  OK  OK  HAL HAL HAL      Minitron 7B  HAL HAL HAL HAL HAL HAL
Q4  Imię kota Bolesława Chrobrego?
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q5  Dlaczego Polska graniczy z Hiszpanią?
    fit-6B  HAL HAL HAL HAL HAL HAL      Minitron 7B  OK  OK  OK  OK  OK  OK
Q6  Czy Warszawa jest stolicą Francji?
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q7  Ile Oscarów zdobył Armageddon?
    fit-6B  OK  OK  HAL OK  OK  HAL      Minitron 7B  HAL OK  OK  OK  OK  OK
Q8  Kiedy Mickiewicz zdobył Oscara?
    fit-6B  OK  OK  OK  OK  OK  HAL      Minitron 7B  OK  OK  OK  OK  OK  OK
Q9  Dlaczego Księżyc jest większy od Słońca?
    fit-6B  HAL HAL HAL HAL HAL HAL      Minitron 7B  OK  OK  OK  OK  OK  OK
Q10 W którym roku Sobieski założył Facebooka?
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  HAL HAL HAL
    WYNIK: fit-6B 37/60 (62%) | Minitron 7B 41/60 (68%)

Poziom 2 — normalne fakty o Polsce

Q1 Pierwszy koronowany król? (Chrobry)            fit OK×6   Minitron 7B OK×6
Q2 Rok wstąpienia do UE? (2004)                   fit OK×6   Minitron 7B OK×6
Q3 Najwyższy szczyt? (Rysy)                        fit OK×6   Minitron 7B OK×6
Q4 Autor "Quo Vadis"? (Sienkiewicz)                fit OK×6   Minitron 7B OK×6
Q5 Rok bitwy pod Grunwaldem? (1410)                fit OK×6   Minitron 7B OK×6
Q6 Pierwszy polski papież? (Jan Paweł II)          fit OK×6   Minitron 7B OK×6
Q7 Morze na północy? (Bałtyk)                      fit OK×6   Minitron 7B OK×6
Q8 Wódz pod Wiedniem 1683? (Sobieski)              fit OK×6   Minitron 7B OK×6
Q9 Ile województw? (16)                            fit OK×6   Minitron 7B OK×6
Q10 Autor mazurków/polonezów? (Chopin)             fit OK×6   Minitron 7B OK×6
    WYNIK: fit-6B 60/60 (100%) | Minitron 7B 60/60 (100%)

Poziom 3 — średnio-trudne fakty o Polsce

Q1 Rok Konstytucji 3 Maja? (1791)                  fit OK×6   Minitron 7B OK×6
Q2 Ostatni król Polski? (Poniatowski)              fit OK×6   Minitron 7B OK×6
Q3 Autor "Lalki"? (Prus)                           fit OK×6   Minitron 7B OK×6
Q4 Noblistka literacka 1996? (Szymborska)          fit OK×6   Minitron 7B OK×6
Q5 Rok III rozbioru? (1795)                        fit OK×6   Minitron 7B OK×6
Q6 Druga rzeka po Wiśle? (Odra)                    fit OK×6   Minitron 7B OK×6
Q7 Rok Powstania Warszawskiego? (1944)             fit OK×6   Minitron 7B OK×6
Q8 Pierwszy premier po 1989? (Mazowiecki)          fit OK×6   Minitron 7B OK×6
Q9 Rok unii lubelskiej? (1569)                     fit OK×6   Minitron 7B OK×6
Q10 Malarz "Bitwy pod Grunwaldem"? (Matejko)       fit OK×6   Minitron 7B OK×6
    WYNIK: fit-6B 60/60 (100%) | Minitron 7B 60/60 (100%)

Poziom 4 — trudne fakty o Polsce (chronologia, precyzyjne daty)

Q1  Prezydent przed Wałęsą? (Jaruzelski)
    fit-6B  OK  OK  OK  OK  OK  WRG      Minitron 7B  OK  OK  OK  OK  OK  OK
Q2  Prezydent po Wałęsie 1995? (Kwaśniewski)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q3  Pierwszy prezydent II RP? (Narutowicz)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q4  Rok chrztu Polski? (966)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q5  Dowódca Westerplatte 1939? (Sucharski)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q6  Rok stanu wojennego? (1981)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q7  Bitwa 1920 "Cud nad Wisłą"? (Warszawska)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
Q8  Matematyk od przestrzeni unormowanych? (Banach)
    fit-6B  OK  OK  WRG WRG WRG WRG      Minitron 7B  OK  OK  OK  OK  OK  OK
Q9  Rok hołdu pruskiego? (1525)
    fit-6B  OK  WRG WRG WRG WRG WRG      Minitron 7B  OK  WRG WRG WRG WRG WRG
Q10 Rok śmierci Piłsudskiego? (1935)
    fit-6B  OK  OK  OK  OK  OK  OK       Minitron 7B  OK  OK  OK  OK  OK  OK
    WYNIK: fit-6B 50/60 (83%) | Minitron 7B 55/60 (92%)

═══════════════════════════════════════════════════════════════

🇬🇧 English version

Bielik-Minitron Fit 6B — depth-pruned recovery (experiment)

TL;DR: a smaller, ~20% faster Bielik that fits on slower hardware. Best used at temp 0–0.2. Weaker at long-form reasoning, comparable on short tasks.

Full version: This was mainly an educational project — I wanted to understand Bielik's architecture from the inside and test pruning along the way.

Starting point on hard mode: Bielik-Minitron-7B is already compressed from an 11B model. So I was cutting a model that had already been compressed once. A tough task — and I'll admit it cost me some nerves.

I took the 40-layer parent (7.48B), cut 8 layers → 32L (−20% depth) and partially recovered quality. Fewer layers = less compute per token = faster decode on weaker hardware. Final model: 6.03B params.

Method

1. Layer selection for cutting (block-influence). I scored each layer by influence: input↔output cosine angle and the NLL increase under "identity-skip". I cut the mid-to-late region of lowest influence: CUT = [11,12,13,14,20,21,22,25]. A practical lesson: a 30-sample probe can lie vs full lm-eval (−25pp), so cut decisions were validated with the full benchmark.

2. Layer-integration training (knowledge distillation). After cutting, the model is "unstitched". I recovered it with online full-vocab KD: KL(teacher‖student) + CE, teacher = the untouched 40L parent. In plain terms: the intact big model (teacher) shows the trimmed one (student) what answer distribution it should produce, and the student learns to imitate it — re-stitching the cut layers and recovering most of what was lost, without training from scratch.

3. Anti-hallucination training (R-Tuning, closed-book abstention). Grounded abstention does NOT work (teaches "is it in the text", not "do I know"). R-Tuning works: probe this checkpoint on gold questions → correct Q→gold, wrong Q→"I don't know". In plain terms: I first check what the model genuinely doesn't know, then teach it to say "I don't know" there instead of confidently making things up. Dose carefully — too much and it starts refusing even things it knows.

4. LoRA souping. After cutting, the model was very sensitive to training — gains in one task came at the cost of another. In plain terms: souping averages the weights of several separate training runs into one model — instead of picking "good at A or good at B", I take the average and often get a model good at both.

Results — Open PL LLM Leaderboard (Q8_0)

task	Minitron 7B (parent)	Bielik-fit 32L
psc	95.73	94.34
ppc	75.60	70.70
dyk	73.37	85.42
belebele	87.33	76.67
8tags	78.29	79.69
polemo2	77.15	90.44
AVG	81.25	82.88

The numbers look quite good — and honestly, that's partly due to how the model was trained. All training mixes and scripts are on my Hugging Face profile. Data was checked and filtered several times for benchmark leaks (n-gram + embedding-cosine decontamination vs test splits — 0 verbatim leakage on all tasks).

The devil is in the details

A high average ≠ "smarter model". The key signal is the belebele regression (87 → 77, −10.7pp), directly caused by the layer reduction. belebele = long passage + multi-step reasoning — exactly what the cut layers provided. The 6B model is weaker at long-form reasoning and long context; better suited to short tasks (classification, factual lookup, short answers). The high polemo2/dyk come from fine-tuning on PL tasks, not general superiority. The model was tested only in Polish; the dataset had some English. I expect large regression in other languages — but the goal was a small, fast Polish Bielik.

English benchmarks (EN / math / code)

Generative, via chat endpoint, identical setup for both models (n=120 for ARC/GSM8K, 164 for HumanEval):

benchmark	Bielik-fit 6B	Minitron 7B	diff
ARC-Challenge (EN reasoning)	61.7	81.7	−20.0
GSM8K (math)	50.0	69.2	−19.2
HumanEval pass@1 (code)	34.8	70.1	−35.3

Here fit-6B clearly lags — confirming the whole narrative: layer cutting costs deep reasoning, math and code the most (same cause as belebele), and training was mostly Polish. As intended: fit-6B is a small, fast model for short Polish tasks — not for reasoning, code or English.

Tool-calling & RAG (vs Minitron 7B)

capability	Bielik-fit 6B	Minitron 7B
Tool-calling (BFCL AST)	0.78	0.94
RAG grounded (EM)	0.60	0.69
RAG abstention (impossible)	0.20	0.53

Cutting preserved routing (function choice — 0 name errors) but cost argument precision and self-control (recognizing "I don't know"). Consistent with belebele: pruning cuts depth, not surface knowledge.

Speed — fit-6B vs Minitron 7B (R9700 / RADV Vulkan, gfx1201)

llama-bench, t/s. pp512 = prefill, tg128 = decode, @dN = at context N tokens.

Decode (tg128):

context	fit-6B Q4	Minitron 7B Q4	fit-6B Q8	Minitron 7B Q8
0	126	102	85	68
2048	116	94	80	65
8192	91	73	68	54

Prefill (pp512):

context	fit-6B Q4	Minitron 7B Q4	fit-6B Q8	Minitron 7B Q8
0	3940	3038	4726	3822
8192	1718	1345	1830	1456

Sizes: fit-6B Q4 3.39 GiB / Q8 5.97 GiB · Minitron 7B Q4 4.19 GiB / Q8 7.40 GiB.

At the same quant fit-6B is ~24–30% faster than Minitron 7B (effect of −20% layers / −19% params: 6.03B vs 7.48B). Q4 = faster decode (memory-bound), Q8 = faster prefill (compute-bound, coopmat). For interactive chat on weaker GPUs → Q4.

Recommendation

The model works best at temperature 0–0.2. There it matches full Minitron 7B on normal and hard Polish knowledge (see appendix), at −20% size and ~25% faster decode. A compressed model has "thinner" confidence, so high temperature lets sampling derail it more easily. For long-form reasoning / long context, use the full model.

This project is purely experimental and educational. Coding assistance: Claude Code. Not responsible for it. :)

(For the temperature-by-temperature factual test appendix — fit-6B vs Minitron 7B across 40 questions in 4 difficulty tiers — see the Polish appendix above; the OK/HAL marks are language-agnostic.)

Downloads last month: 35

GGUF

Model size

6B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

Model tree for JohnTdi/Bielik-Minitron-Fit-6B

Base model

speakleash/Bielik-11B-v3-Base-20250730

Finetuned

speakleash/Bielik-Minitron-7B-v3.0-Instruct

Quantized

(6)

this model