--- language: - en license: other license_name: evrmind-free-1.0 license_link: LICENSE.md library_name: llama.cpp tags: - llama - llama-3.1 - gguf - 3-bit - quantization - evr - evrmind - text-generation - instruct - chat - on-device - maano model_name: EVR-1 Maano-8b-Instruct base_model: meta-llama/Llama-3.1-8B-Instruct pipeline_tag: text-generation model_type: llama quantized_by: evrmind num_parameters: 8030000000 --- # Evrmind EVR-1 Maano-8b-Instruct Llama 3.1 8B Instruct compressed using EVR-1 (Evrmind Reconstruction), a novel compression method developed independently by Evrmind. The compressed weights average approximately 3 bits per parameter; the total GGUF file (3.93 GiB) includes additional metadata and structure overhead. In our coherence tests (5 continuation-style prompts), EVR-1 Instruct averaged 2.77% repetition (rep4) at 500 tokens and 9.66% at 1000 tokens. **3.93 GiB | Llama 3.1 8B Instruct | Runs on laptops, desktops, and Android (Termux)** HuggingFace may display an incorrect parameter count in the sidebar due to the custom compression format. EVR-1 is not a standard quantization (not Q2, Q3, Q4, etc). ## Setup You need two things: the **model files** (from this HuggingFace repo) and a **platform binary** (from GitHub). **Step 1:** Clone this repo or download the files: ```bash # Option A: Clone everything (~4.2 GB, requires git-lfs) git lfs install git clone https://huggingface.co/evrmind/evr-1-maano-8b-instruct cd evr-1-maano-8b-instruct # Option B: Or download individual files from the "Files" tab above ``` **Step 2:** Download the binary for your platform from the [Downloads](#downloads) table. Save the archive into the `evr-1-maano-8b-instruct` directory, then extract it: ```bash # Linux + NVIDIA mkdir -p linux-cuda && tar xzf evrmind-linux-cuda.tar.gz -C linux-cuda # Linux + Vulkan mkdir -p linux-vulkan && tar xzf evrmind-linux-vulkan.tar.gz -C linux-vulkan # macOS (Apple Silicon) mkdir -p metal && tar xzf evrmind-macos-metal.tar.gz -C metal # Android (Termux) mkdir -p android-vulkan && tar xzf evrmind-android-vulkan.tar.gz -C android-vulkan ``` For Windows, extract the `.zip` into a folder with the matching name (e.g., extract `evrmind-windows-cuda.zip` into a folder called `windows-cuda`). After completing both steps, your directory should look like this: ``` evr-1-maano-8b-instruct/ evr-llama-3.1-8b-instruct.gguf <-- model weights start-server.sh <-- Linux/macOS/Android launcher start-server.bat <-- Windows launcher webui/ <-- browser interface linux-cuda/ <-- extracted platform binary (example) llama-server llama-cli llama-completion ... ``` ## Web UI **Linux, macOS, Android (Termux):** ```bash ./start-server.sh # Open http://localhost:8080 ``` **Windows:** Double-click `start-server.bat`, or from Command Prompt: ``` start-server.bat ``` Then open http://localhost:8080 in your browser. **Network access (phone, tablet, other devices on the same WiFi):** ```bash ./start-server.sh --network ``` The script will print the URL to open on other devices. The model runs on your computer; other devices just connect to the web UI. The `--network` and `--cpu` flags are only available in `start-server.sh` (Linux/macOS/Android). See [WEB_UI.md](WEB_UI.md) for more options and troubleshooting. ## Quick Start (CLI) These examples assume you have completed [Setup](#setup) and are in the repo directory. **Linux + NVIDIA GPU:** ```bash cd linux-cuda LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **macOS (Apple Silicon):** ```bash cd metal ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **Linux + Vulkan:** ```bash cd linux-vulkan LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **Android (Termux):** ```bash cd android-vulkan LD_LIBRARY_PATH=. ./llama-cli -m ../evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **Windows + NVIDIA (Command Prompt):** ```cmd cd windows-cuda llama-cli.exe -m ..\evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **Windows + Vulkan (Command Prompt):** ```cmd cd windows-vulkan llama-cli.exe -m ..\evr-llama-3.1-8b-instruct.gguf -ngl 99 ``` **CPU-only (no GPU):** Use `-ngl 0` instead of `-ngl 99` on any platform. Roughly 5-10x slower but works on any machine. ## Downloads | Platform | Download | GPU | |----------|----------|-----| | Linux + NVIDIA | [evrmind-linux-cuda.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-linux-cuda.tar.gz) | CUDA 12 | | Linux + Any GPU | [evrmind-linux-vulkan.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-linux-vulkan.tar.gz) | Vulkan | | Windows + NVIDIA | [evrmind-windows-cuda.zip](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-windows-cuda.zip) | CUDA 12 | | Windows + Any GPU | [evrmind-windows-vulkan.zip](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-windows-vulkan.zip) | Vulkan | | macOS (Apple Silicon) | [evrmind-macos-metal.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-macos-metal.tar.gz) | Apple Silicon | | Android (Termux) | [evrmind-android-vulkan.tar.gz](https://github.com/evrmind-uk/evr-llama/releases/download/v1.0.0/evrmind-android-vulkan.tar.gz) | Vulkan | The model weights (`evr-llama-3.1-8b-instruct.gguf`, ~4.2 GB) are available from the **Files** tab on this HuggingFace page. Platform binaries are hosted on [GitHub Releases](https://github.com/evrmind-uk/evr-llama/releases/tag/v1.0.0). You can verify downloads with [SHA256SUMS.txt](SHA256SUMS.txt). > **Note:** The binaries are the same for all EVR-1 models. You only need to download them once. Just point them at whichever GGUF you want to run. ## Why EVR-1 Maano-8b-Instruct? Standard quantizations at 3-4 GiB can produce repetition during extended generation. In our tests (5 continuation-style prompts), EVR-1 Maano-8b-Instruct maintained coherent output with an average repetition rate of 2.77% (rep4) at 500 tokens and 9.66% at 1000 tokens. **EVR-1 Maano-8b-Instruct (3.93 GiB):** > User: "What are the main causes of the French Revolution?" > > "The French Revolution, which lasted from 1789 to 1799, was a complex event with multiple causes. However, some of the main contributing factors include: 1. Financial Crisis: France was deeply in debt from its involvement in the American Revolutionary War... 2. Social Inequality: The French nobility held a significant amount of power... 3. Enlightenment Ideas: The ideas presented by Enlightenment thinkers such as Rousseau, Voltaire..." *(continues coherently for 500+ words)* ## Benchmarks ### Coherence (lower is better) Average 4-gram repetition rate (lower = better), 5 continuation-style prompts: | Model | Size | rep4 @ 500 | rep4 @ 1000 | |-------|------|-----------|-------------| | **EVR-1 Instruct** | **3.93 GiB** | **2.77%** | **9.66%** | ### Perplexity | Model | Size | Perplexity (wikitext-2, ctx=512) | |-------|------|--------------------------------| | **EVR-1 Instruct** | **3.93 GiB** | **7.37** | ### Accuracy (EVR-1 base model reference numbers) | Benchmark | EVR-1 Base (3.93 GiB) | Q3_K_M (3.83 GiB) | Q4_K_M (4.69 GiB) | |-----------|----------------------|--------------------|-------| | ARC-Challenge (25-shot, 1172q) | 59.8% | 60.8% | 61.3% | | Perplexity (wikitext-2, ctx=512) | 6.70 | 7.02 | 6.58 | *Coherence tested with 5 continuation-style prompts at 500 and 1000 tokens each, temperature 0, no repeat penalty. Accuracy numbers above are from the [EVR-1 base model](https://huggingface.co/evrmind/evr-1-maano-8b), shown here for reference. See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for full coherence results and sample outputs.* ## Limitations - Context window has been tested up to 2048 tokens. Longer contexts may work but have not been validated at 3-bit compression. - Occasional minor character-level artefacts due to 3-bit compression. - Math reasoning is limited at this compression level. - As with all heavily quantized models, generated text may contain factual inaccuracies (e.g., incorrect numbers, dates, or scientific details). Always verify factual claims independently. ## System Requirements - **Storage:** ~4 GiB for model weights + ~50 MB for binaries - **RAM:** 6 GiB minimum (8 GiB recommended) - **GPU (recommended):** NVIDIA (CUDA 12), Apple Silicon, or any Vulkan GPU - **CPU-only:** Supported but slower (use `-ngl 0` or `--cpu` flag) - **OS:** Linux, macOS (Apple Silicon), Windows, Android (Termux) - **Not supported:** iOS, 32-bit systems ## Safety and Responsible Use This model can generate incorrect, biased, or harmful content. Users should apply appropriate content filtering for user-facing applications. See [MODEL_CARD.md](MODEL_CARD.md) for details. ## Derivative Works If you create derivative works, credit **"EVR-1 Maano"** in your model name and documentation. Commercial use is permitted subject to the Llama 3.1 Community License Agreement. ## License This model is dual-licensed: 1. **[Evrmind Free License 1.0](LICENSE.md)**: Covers the EVR-1 compression and distribution. Permits personal, research, and commercial use with attribution. 2. **[Llama 3.1 Community License](META_LLAMA_LICENSE.md)**: Covers the underlying Llama 3.1 weights. Permits commercial use for entities with fewer than 700 million monthly active users. Both licenses apply. See [LICENSE.md](LICENSE.md) and [META_LLAMA_LICENSE.md](META_LLAMA_LICENSE.md) for full terms. ## Also Available - **[EVR-1 Maano-8b](https://huggingface.co/evrmind/evr-1-maano-8b)**, base model for text completion - **[EVR-1 Bafethu-8b-Reasoning](https://huggingface.co/evrmind/evr-1-bafethu-8b-reasoning)**, reasoning model (DeepSeek R1) ## Contact - Email: hello@evrmind.io - Issues: [GitHub](https://github.com/evrmind-uk/evr-llama/issues)