How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
Use Docker
docker model run hf.co/NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF:Q4_K_M
Quick Links

NB-Llama-3.1-8B-Q4_K_M-GGUF

This model is a quantized version of the original NB-Llama-3.1-8B, converted into the GGUF format using llama.cpp. Quantization significantly reduces the model's memory footprint, enabling efficient inference on a wide range of hardware, including personal devices, without compromising too much quality. These quantized models are mainly provided so that people can test out the models with moderate hardware. If you want to benchmark the models or further finetune the models, we strongly recommend the non-quantized versions.

What is llama.cpp?

llama.cpp is a versatile tool for running large language models optimized for efficiency. It supports multiple quantization formats (e.g., GGML and GGUF) and provides inference capabilities on diverse hardware, including CPUs, GPUs, and mobile devices. The GGUF format is the latest evolution, designed to enhance compatibility and performance.

Benefits of This Model

  • High Performance: Achieves similar quality to the original model while using significantly less memory.
  • Hardware Compatibility: Optimized for running on a variety of hardware, including low-resource systems.
  • Ease of Use: Seamlessly integrates with llama.cpp for fast and efficient inference.

Installation

Install llama.cpp using Homebrew (works on Mac and Linux):

brew install llama.cpp

Usage Instructions

Using with llama.cpp

To use this quantized model with llama.cpp, follow the steps below:

CLI:

llama-cli --hf-repo NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF --hf-file nb-llama-3.1-8b-q4_k_m.gguf -p "Your prompt here"

Server:

llama-server --hf-repo NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF --hf-file nb-llama-3.1-8b-q4_k_m.gguf -c 2048

For more information, refer to the llama.cpp repository.

Additional Resources

Citing & Authors

The model was trained and documentation written by Per Egil Kummervold

Funding and Acknowledgement

Training this model was supported by Google’s TPU Research Cloud (TRC), which generously supplied us with Cloud TPUs essential for our computational needs..

Downloads last month
9
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including NbAiLab/nb-llama-3.1-8B-Q4_K_M-GGUF