How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF:F16
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF:F16
Run Hermes
hermes
Quick Links

Llama-3.1-8B-Instruct-FP16-GGUF

Description

Llama 3.1 8B Instruct in FP16 (baseline)

File: llama-3.1-8b-instruct-f16.gguf
Size: 16G
Format: GGUF
Category: baseline

Quick Start

Download

huggingface-cli download annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF llama-3.1-8b-instruct-f16.gguf --local-dir .

Usage with llama.cpp

./llama-cli -m llama-3.1-8b-instruct-f16.gguf -p "Explain quantum computing" -n 128

Benchmark

./llama-bench -m llama-3.1-8b-instruct-f16.gguf -r 3

Project: AI on Edge Devices

This model is part of an LLM compression research project.

Pipeline

  1. Pruning: 20% structured Taylor pruning (MLP layers only)
  2. SmoothQuant: Activation smoothing for stable quantization
  3. Mixed Precision: Sensitivity-based bit-width allocation (Q4/Q5/Q6)

Results

  • Size: 73.6% reduction (15 GB โ†’ 4 GB)
  • Speed: 273% faster inference (1.16 โ†’ 4.33 tok/s)
  • Deployment: Successfully runs on Raspberry Pi 4

Model Card

Created by: Group 2 (Annus, Arslan, Naveed, Danyal)
Institution: LUMS
Date: December 2024

Citation

@misc{llama31-compressed-Llama-3.1-8B-Instruct-FP16-GGUF,
  author = {Group 2},
  title = {Llama-3.1-8B-Instruct-FP16-GGUF},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF}
}
Downloads last month
4
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for annus-lums/Llama-3.1-8B-Instruct-FP16-GGUF

Quantized
(644)
this model