---
language:
  - en
license: other
license_name: evrmind-free-1.0
license_link: LICENSE.md
library_name: llama.cpp
tags:
  - llama
  - llama-3.1
  - gguf
  - 3-bit
  - quantization
  - evr
  - evrmind
  - text-generation
  - instruct
  - chat
  - on-device
  - maano
model_name: EVR-1 Maano-8b-Instruct
base_model: meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
model_type: llama
quantized_by: evrmind
num_parameters: 8030000000
---

# Evrmind EVR-1 Maano-8b-Instruct

Llama 3.1 8B Instruct compressed using EVR-1 (Evrmind Reconstruction), a novel compression method developed independently by Evrmind. The compressed weights average approximately 3 bits per parameter; the total GGUF file (3.93 GiB) includes additional metadata and structure overhead.

In our coherence tests (5 continuation-style prompts), EVR-1 Instruct averaged 2.77% repetition (rep4) at 500 tokens and 9.66% at 1000 tokens.

**3.93 GiB | Llama 3.1 8B Instruct | Runs on laptops, desktops, and Android (Termux)**

HuggingFace may display an incorrect parameter count in the sidebar due to the custom compression format. EVR-1 is not a standard quantization (not Q2, Q3, Q4, etc).

## Setup

You need two things: the **model files** (from this HuggingFace repo) and a **platform binary** (from GitHub).

**Step 1:** Clone this repo or download the files:

```bash
# Option A: Clone everything (~4.2 GB, requires git-lfs)
git lfs install
git clone https://huggingface.co/evrmind/evr-1-maano-8b-instruct
cd evr-1-maano-8b-instruct

# Option B: Or download individual files from the "Files" tab above
```

**Step 2:** Download the binary for your platform from the [Downloads](#downloads) table. Save the archive into the `evr-1-maano-8b-instruct` directory, then extract it:

```bash
# Linux + NVIDIA
mkdir -p linux-cuda && tar xzf evrmind-linux-cuda.tar.gz -C linux-cuda

# Linux + Vulkan
mkdir -p linux-vulkan && tar xzf evrmind-linux-vulkan.tar.gz -C linux-vulkan

# macOS (Apple Silicon)
mkdir -p metal && tar xzf evrmind-macos-metal.tar.gz -C metal

# Android (Termux)
mkdir -p android-vulkan && tar xzf evrmind-android-vulkan.tar.gz -C android-vulkan
```

For Windows, extract the `.zip` into a folder with the matching name (e.g., extract `evrmind-windows-cuda.zip` into a folder called `windows-cuda`).

After completing both steps, your directory should look like this:

```
evr-1-maano-8b-instruct/
  evr-llama-3.1-8b-instruct.gguf   <-- model weights
  start-server.sh                    <-- Linux/macOS/Android launcher
  start-server.bat                   <-- Windows launcher
  webui/                             <-- browser interface
  linux-cuda/                        <-- extracted platform binary (example)
    llama-server
    llama-cli
    llama-completion
    ...
```

## Web UI

**Linux, macOS, Android (Termux):**

```bash
./start-server.sh
# Open http://localhost:8080
```

**Windows:**

Double-click `start-server.bat`, or from Command Prompt:

```
start-server.bat
```

Then open http://localhost:8080 in your browser.

**Network access (phone, tablet, other devices on the same WiFi):**

```bash
./start-server.sh --network
```

The script will print the URL to open on other devices. The model runs on your computer; other devices just connect to the web UI. The `--network` and `--cpu` flags are only available in `start-server.sh` (Linux/macOS/Android).

See [WEB_UI.md](WEB_UI.md) for more options and troubleshooting.

## Quick Start (CLI)

These examples assume you have completed [Setup](#setup) and are in the repo directory.

**Linux + NVIDIA GPU:**

```bash
cd linux-cuda
LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**macOS (Apple Silicon):**

```bash
cd metal
./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**Linux + Vulkan:**

```bash
cd linux-vulkan
LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**Android (Termux):**

```bash
cd android-vulkan
LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**Windows + NVIDIA (Command Prompt):**

```cmd
cd windows-cuda
llama-cli.exe -m ..\evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**Windows + Vulkan (Command Prompt):**

```cmd
cd windows-vulkan
llama-cli.exe -m ..\evr-llama-3.1-8b-instruct.gguf -ngl 99
```

**CPU-only (no GPU):**

Use `-ngl 0` instead of `-ngl 99` on any platform. Roughly 5-10x slower but works on any machine.

## Downloads

| Platform | Download | GPU |
|----------|----------|-----|
| Linux + NVIDIA | [evrmind-linux-cuda.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-linux-cuda.tar.gz) | CUDA 12 |
| Linux + Any GPU | [evrmind-linux-vulkan.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-linux-vulkan.tar.gz) | Vulkan |
| Windows + NVIDIA | [evrmind-windows-cuda.zip](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-windows-cuda.zip) | CUDA 12 |
| Windows + Any GPU | [evrmind-windows-vulkan.zip](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-windows-vulkan.zip) | Vulkan |
| macOS (Apple Silicon) | [evrmind-macos-metal.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-macos-metal.tar.gz) | Apple Silicon |
| Android (Termux) | [evrmind-android-vulkan.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-android-vulkan.tar.gz) | Vulkan |

The model weights (`evr-llama-3.1-8b-instruct.gguf`, ~4.2 GB) are available from the **Files** tab on this HuggingFace page. Platform binaries are hosted on [GitHub Releases](https://github.com/evrmind-uk/evr-llama/releases/tag/v1.0.0). You can verify downloads with [SHA256SUMS.txt](SHA256SUMS.txt).

> **Note:** The binaries are the same for all EVR-1 models. You only need to download them once. Just point them at whichever GGUF you want to run.

## Why EVR-1 Maano-8b-Instruct?

Standard quantizations at 3-4 GiB can produce repetition during extended generation. In our tests (5 continuation-style prompts), EVR-1 Maano-8b-Instruct maintained coherent output with an average repetition rate of 2.77% (rep4) at 500 tokens and 9.66% at 1000 tokens.

**EVR-1 Maano-8b-Instruct (3.93 GiB):**
> User: "What are the main causes of the French Revolution?"
>
> "The French Revolution, which lasted from 1789 to 1799, was a complex event with multiple causes. However, some of the main contributing factors include: 1. Financial Crisis: France was deeply in debt from its involvement in the American Revolutionary War... 2. Social Inequality: The French nobility held a significant amount of power... 3. Enlightenment Ideas: The ideas presented by Enlightenment thinkers such as Rousseau, Voltaire..." *(continues coherently for 500+ words)*

## Benchmarks

### Coherence (lower is better)

Average 4-gram repetition rate (lower = better), 5 continuation-style prompts:

| Model | Size | rep4 @ 500 | rep4 @ 1000 |
|-------|------|-----------|-------------|
| **EVR-1 Instruct** | **3.93 GiB** | **2.77%** | **9.66%** |

### Perplexity

| Model | Size | Perplexity (wikitext-2, ctx=512) |
|-------|------|--------------------------------|
| **EVR-1 Instruct** | **3.93 GiB** | **7.37** |

### Accuracy (EVR-1 base model reference numbers)

| Benchmark | EVR-1 Base (3.93 GiB) | Q3_K_M (3.83 GiB) | Q4_K_M (4.69 GiB) |
|-----------|----------------------|--------------------|-------|
| ARC-Challenge (25-shot, 1172q) | 59.8% | 60.8% | 61.3% |
| Perplexity (wikitext-2, ctx=512) | 6.70 | 7.02 | 6.58 |

*Coherence tested with 5 continuation-style prompts at 500 and 1000 tokens each, temperature 0, no repeat penalty. Accuracy numbers above are from the [EVR-1 base model](https://huggingface.co/evrmind/evr-1-maano-8b), shown here for reference. See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for full coherence results and sample outputs.*

## Limitations

- Context window has been tested up to 2048 tokens. Longer contexts may work but have not been validated at 3-bit compression.
- Occasional minor character-level artefacts due to 3-bit compression.
- Math reasoning is limited at this compression level.
- As with all heavily quantized models, generated text may contain factual inaccuracies (e.g., incorrect numbers, dates, or scientific details). Always verify factual claims independently.

## System Requirements

- **Storage:** ~4 GiB for model weights + ~50 MB for binaries
- **RAM:** 6 GiB minimum (8 GiB recommended)
- **GPU (recommended):** NVIDIA (CUDA 12), Apple Silicon, or any Vulkan GPU
- **CPU-only:** Supported but slower (use `-ngl 0` or `--cpu` flag)
- **OS:** Linux, macOS (Apple Silicon), Windows, Android (Termux)
- **Not supported:** iOS, 32-bit systems

## Safety and Responsible Use

This model can generate incorrect, biased, or harmful content. Users should apply appropriate content filtering for user-facing applications. See [MODEL_CARD.md](MODEL_CARD.md) for details.

## Derivative Works

If you create derivative works, credit **"EVR-1 Maano"** in your model name and documentation. Commercial use is permitted subject to the Llama 3.1 Community License Agreement.

## License

This model is dual-licensed:

1. **[Evrmind Free License 1.0](LICENSE.md)**: Covers the EVR-1 compression and distribution. Permits personal, research, and commercial use with attribution.
2. **[Llama 3.1 Community License](META_LLAMA_LICENSE.md)**: Covers the underlying Llama 3.1 weights. Permits commercial use for entities with fewer than 700 million monthly active users.

Both licenses apply. See [LICENSE.md](LICENSE.md) and [META_LLAMA_LICENSE.md](META_LLAMA_LICENSE.md) for full terms.

## Also Available

- **[EVR-1 Maano-8b](https://huggingface.co/evrmind/evr-1-maano-8b)**, base model for text completion
- **[EVR-1 Bafethu-8b-Reasoning](https://huggingface.co/evrmind/evr-1-bafethu-8b-reasoning)**, reasoning model (DeepSeek R1)

## Contact

- Email: hello@evrmind.io
- Issues: [GitHub](https://github.com/evrmind-uk/evr-llama/issues)