How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
# Run inference directly in the terminal:
llama-cli -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
# Run inference directly in the terminal:
llama-cli -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
Use Docker
docker model run hf.co/InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF:
Quick Links

Mistral-Nemo-Instruct-12B-iMat-GGUF

Important Note: Inferencing in llama.cpp has now been merged in PR #8604. Please ensure you are on release b3438 or newer. Text-generation-web-ui (Ooba) is also working as of 7/23. Kobold.cpp working as of v1.71.

Quantized from Mistral-Nemo-Instruct-2407 fp16

  • Weighted quantizations were creating using fp16 GGUF and groups_merged.txt in 92 chunks and n_ctx=512
  • Static fp16 will also be included in repo
  • For a brief rundown of iMatrix quant performance please see this PR
  • All quants are verified working prior to uploading to repo for your safety and convenience

KL-Divergence Reference Chart (Click on image to view in full size)

Quant-specific Tips:

  • If you are getting a cudaMalloc failed: out of memory error, try passing an argument for lower context in llama.cpp, e.g. for 8k: -c 8192
  • If you have all ampere generation or newer cards, you can use flash attention like so: -fa
  • Provided Flash Attention is enabled you can also use quantized cache to save on VRAM e.g. for 8-bit: -ctk q8_0 -ctv q8_0
  • Mistral recommends a temperature of 0.3 for this model

Original model card can be found here

Downloads last month
256
GGUF
Model size
12B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF

Collection including InferenceIllusionist/Mistral-Nemo-Instruct-12B-iMat-GGUF