darkmaniac7
/

Qwen3-8B-abliterated-v2-MNN

Model card Files Files and versions

Qwen3-8B-abliterated-v2-MNN / README.md

darkmaniac7's picture

Upload README.md with huggingface_hub

338750c verified 3 months ago

|

2.23 kB

	---
	license: apache-2.0
	tags:
	- mnn
	- qwen3
	- mobile
	- on-device
	- tokforge
	- abliterated
	base_model: Qwen/Qwen3-8B
	---

	# Qwen3-8B-abliterated-v2 (MNN)

	Pre-converted [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) abliterated model in MNN format for on-device inference.

	## Model Details
	- Architecture: Qwen3 (standard attention, 36 layers)
	- Parameters: 8B (4-bit quantized)
	- Format: MNN (Alibaba Mobile Neural Network)
	- Vocab: 151,936 tokens
	- Quantization: W4A16 (4-bit weights, 16-bit activations)

	## Files
	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `llm.mnn` \| 631KB \| Model graph \|
	\| `llm.mnn.weight` \| 4.4GB \| Quantized weights \|
	\| `embeddings_bf16.bin` \| 1.2GB \| BF16 embedding table (required) \|
	\| `llm_config.json` \| 4.5KB \| Model config with jinja chat template \|
	\| `tokenizer.txt` \| 3.0MB \| Tokenizer \|
	\| `config.json` \| 210B \| MNN runtime config \|

	## Usage with TokForge
	This model is optimized for [TokForge](https://tokforge.ai) — an Android app for on-device LLM inference.

	### Performance (Speculative Decoding)
	\| Device \| SoC \| Backend \| AR tok/s \| Spec Decode tok/s \| Uplift \|
	\|--------\|-----\|---------\|----------\|-------------------\|--------\|
	\| S26 Ultra \| SM8850 \| OpenCL \| ~14 \| 17.8 \| +27% \|
	\| RedMagic 11 Pro \| SM8850 \| OpenCL \| ~14 \| 17.8 \| +27% \|
	\| Lenovo TB520FU \| SM8650 \| OpenCL \| 9.9 \| 12.2 \| +23% \|

	Draft model: [Qwen3-0.6B](https://huggingface.co/darkmaniac7/TokForge-AccelerationPack-Draft)

	## Abliteration
	This model has been abliterated (safety filters removed) for unrestricted conversation. Use responsibly.

	## Limitations and Intended Use

	- Intended for TokForge / MNN on-device inference, especially Android phones and tablets.
	- The best-known uplift for this model comes from pairing it with a small CPU draft model for speculative decoding.
	- Real throughput varies by SoC, thermal state, backend, and generation length.
	- This repo is a runtime bundle, not a standard Transformers training checkpoint.

	## Community

	- Website: [tokforge.ai](https://tokforge.ai)
	- Discord: [Join the Discord](https://discord.gg/Acv3CBtfVm)

	## Export
	Converted using MNN's `llmexport` pipeline with `--quant_bit 4 --quant_block 128`.