Instructions to use Evrmind/EVR-1-Maano-8b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Evrmind/EVR-1-Maano-8b-Instruct with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Evrmind/EVR-1-Maano-8b-Instruct",
	filename="evr-llama-3.1-8b-instruct.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Evrmind/EVR-1-Maano-8b-Instruct with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct
# Run inference directly in the terminal:
llama-cli -hf Evrmind/EVR-1-Maano-8b-Instruct

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct
# Run inference directly in the terminal:
llama-cli -hf Evrmind/EVR-1-Maano-8b-Instruct

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct
# Run inference directly in the terminal:
./llama-cli -hf Evrmind/EVR-1-Maano-8b-Instruct

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Evrmind/EVR-1-Maano-8b-Instruct

Use Docker

docker model run hf.co/Evrmind/EVR-1-Maano-8b-Instruct

LM Studio
Jan

vLLM

How to use Evrmind/EVR-1-Maano-8b-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Evrmind/EVR-1-Maano-8b-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Evrmind/EVR-1-Maano-8b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Evrmind/EVR-1-Maano-8b-Instruct

Ollama
How to use Evrmind/EVR-1-Maano-8b-Instruct with Ollama:
```
ollama run hf.co/Evrmind/EVR-1-Maano-8b-Instruct
```

Unsloth Studio

How to use Evrmind/EVR-1-Maano-8b-Instruct with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Evrmind/EVR-1-Maano-8b-Instruct to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Evrmind/EVR-1-Maano-8b-Instruct to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Evrmind/EVR-1-Maano-8b-Instruct to start chatting

How to use Evrmind/EVR-1-Maano-8b-Instruct with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Evrmind/EVR-1-Maano-8b-Instruct"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Evrmind/EVR-1-Maano-8b-Instruct with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Evrmind/EVR-1-Maano-8b-Instruct

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Evrmind/EVR-1-Maano-8b-Instruct

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Evrmind/EVR-1-Maano-8b-Instruct with Docker Model Runner:
```
docker model run hf.co/Evrmind/EVR-1-Maano-8b-Instruct
```

Lemonade

How to use Evrmind/EVR-1-Maano-8b-Instruct with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Evrmind/EVR-1-Maano-8b-Instruct

Run and chat with the model

lemonade run user.EVR-1-Maano-8b-Instruct-{{QUANT_TAG}}

List all available models

lemonade list

EVR-1-Maano-8b-Instruct / MODEL_CARD.md

Evrmind

Upload folder using huggingface_hub

1669554 verified 3 months ago

preview code

raw

history blame contribute delete

7.45 kB

metadata

license: other
license_name: evrmind-free-1.0
license_link: LICENSE.md
language:
  - en
library_name: llama.cpp
tags:
  - llama
  - llama-3.1
  - gguf
  - 3-bit
  - quantization
  - evr
  - evrmind
  - text-generation
  - instruct
  - chat
  - on-device
  - maano
pipeline_tag: text-generation
model-index:
  - name: Evrmind EVR-1 Maano-8b-Instruct
    results:
      - task:
          type: text-generation
        metrics:
          - name: Perplexity (wikitext-2, ctx=512)
            type: perplexity
            value: 7.37
          - name: Coherence (rep4 @ 500 tokens)
            type: repetition-rate
            value: 2.77
          - name: Coherence (rep4 @ 1000 tokens)
            type: repetition-rate
            value: 9.66

Evrmind EVR-1 Maano-8b-Instruct (3.93 GiB)

A custom 3-bit compression of Meta's Llama 3.1 8B Instruct that fits in under 4 GiB while maintaining coherent instruction-following at 1000+ tokens.

EVR-1 is not a standard quantization (not Q2, Q3, Q4, etc). It is a custom compression method developed by Evrmind. The compressed weights average approximately 3 bits per parameter; the total GGUF file (3.93 GiB) includes additional metadata and structure overhead.

Key Results

Metric	EVR-1 Instruct (3.93 GiB)
Coherence @ 500 tokens	2.77% repetition
Coherence @ 1000 tokens	9.66% repetition
Perplexity (wikitext-2, ctx=512)	7.37

Chat Example

User: What are the main causes of the French Revolution?

EVR-1 Maano-8b-Instruct:

The French Revolution, which lasted from 1789 to 1799, was a complex event with multiple causes. However, some of the main contributing factors include:

Financial Crisis: France was deeply in debt from its involvement in the American Revolutionary War and other wars. The burden of taxation fell on the common people, who were already facing economic hardship.

Social Inequality: The French nobility held a significant amount of power over commoners through their control over land and government positions.

Enlightenment Ideas: The ideas presented by Enlightenment thinkers such as Rousseau, Voltaire, Diderot encouraged people to think critically about their government and seek change.

Food Shortages: A series of poor harvests led to widespread hunger and starvation.

Absolute Monarchy: The rule of King Louis XVI was seen as tyrannical by many.

The American Revolution's Influence: The American Revolution inspired the French people with its ideas of liberty, equality, and fraternity.

How to Run

Download the model file and the binary for your platform, then:

# Extract the binary
mkdir -p linux-cuda && tar xzf evrmind-linux-cuda.tar.gz -C linux-cuda

# Run (interactive chat)
cd linux-cuda
LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99

# Run (single completion)
LD_LIBRARY_PATH=. ./llama-completion -m ../evr-llama-3.1-8b-instruct.gguf -p "Your prompt here" -n 500 -ngl 99

Platform Binaries

Platform	File	GPU Required
Linux + NVIDIA	`evrmind-linux-cuda.tar.gz`	NVIDIA GPU (CUDA 12)
Linux + Any GPU	`evrmind-linux-vulkan.tar.gz`	Any Vulkan-capable GPU
Windows + NVIDIA	`evrmind-windows-cuda.zip`	NVIDIA GPU (CUDA 12)
Windows + Any GPU	`evrmind-windows-vulkan.zip`	Any Vulkan-capable GPU
macOS (Apple Silicon)	`evrmind-macos-metal.tar.gz`	M1/M2/M3/M4
Android (Termux)	`evrmind-android-vulkan.tar.gz`	Vulkan

Note: The binaries are the same for all EVR-1 models. You only need to download them once. Just point them at whichever GGUF you want to run.

Flags

Flag	Description
`-ngl 99`	Offload all layers to GPU (recommended)
`-n 500`	Generate 500 tokens
`-p "..."`	Your prompt
`-t 8`	Number of CPU threads (for CPU layers)

Model Details

Name: Evrmind EVR-1 Maano-8b-Instruct
Base model: Meta Llama 3.1 8B Instruct
Size: 3.93 GiB (GGUF)
Method: EVR-1 (Evrmind Reconstruction), a custom 3-bit compression method
Backends: CUDA, Vulkan, Metal, CPU
Context: Tested up to 2048 tokens; longer contexts have not been validated at 3-bit compression
Chat template: Llama 3.1 instruct format (built-in)

Benchmarks

Coherence (5 continuation-style prompts, 500 and 1000 tokens each)

Average 4-gram repetition rate (lower = better):

Model	Size	rep4 @ 500	rep4 @ 1000
EVR-1 Instruct	3.93 GiB	2.77%	9.66%

Also Available

EVR-1 Maano-8b, base model (not instruction-tuned), for text completion and creative writing
EVR-1 Bafethu-8b-Reasoning, reasoning model (DeepSeek R1)

Intended Use

This model is intended for on-device chat and instruction-following on laptops, desktops, and edge devices where memory is constrained. An Android (Termux) build is also available. There is no iOS build.

Limitations

Math reasoning is limited, consistent with the base Llama 3.1 8B Instruct at this compression level.
Occasional minor character-level artefacts (e.g., dropped letters) due to 3-bit compression.
Generation quality degrades somewhat beyond 1000 tokens.
As with all heavily quantized models, generated text may contain factual inaccuracies (e.g., incorrect numbers, dates, or scientific details). Always verify factual claims independently.

System Requirements

Storage: ~4 GiB for model weights + ~50 MB for binaries
RAM: 6 GiB minimum (8 GiB recommended)
GPU (recommended): NVIDIA GPU with CUDA 12, Apple Silicon (M1/M2/M3/M4), or any Vulkan-capable GPU
CPU-only: Supported but significantly slower
OS: Linux (x86_64), macOS (Apple Silicon), Windows (x86_64), Android (Termux, ARM64)
Not supported: iOS, 32-bit systems

Safety and Responsible Use

This model inherits the capabilities and limitations of its base model (Meta Llama 3.1 8B Instruct). Like all language models, it can generate incorrect, biased, or harmful content. Users should:

Not rely on this model for factual accuracy without verification
Not use this model to generate content that could cause harm
Apply appropriate content filtering for any user-facing applications
Be aware that 3-bit compression may amplify certain failure modes of the base model

Derivative Works

If you create derivative works, credit "EVR-1 Maano" in your model name and documentation. Commercial use is permitted subject to the Llama 3.1 Community License Agreement.

License

Available for personal, research, and commercial use with attribution, subject to upstream license terms. See LICENSE.md for full terms.

Built with Llama. This model is a derivative of Meta's Llama 3.1 8B Instruct and is subject to the Llama 3.1 Community License Agreement in addition to the Evrmind license.

Citation

@misc{evrmind2026evr1maano8binstruct,
  title={Evrmind EVR-1 Maano-8b-Instruct: A Custom 3-Bit Compression Method for Coherent On-Device Instruction-Following},
  author={Evrmind},
  year={2026},
  url={https://huggingface.co/evrmind/evr-1-maano-8b-instruct}
}

Contact

Email: hello@evrmind.io
Issues: GitHub