Instructions to use gbueno86/Brinebreath-Llama-3.1-70B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gbueno86/Brinebreath-Llama-3.1-70B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gbueno86/Brinebreath-Llama-3.1-70B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gbueno86/Brinebreath-Llama-3.1-70B")
model = AutoModelForCausalLM.from_pretrained("gbueno86/Brinebreath-Llama-3.1-70B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use gbueno86/Brinebreath-Llama-3.1-70B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gbueno86/Brinebreath-Llama-3.1-70B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gbueno86/Brinebreath-Llama-3.1-70B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/gbueno86/Brinebreath-Llama-3.1-70B

SGLang

How to use gbueno86/Brinebreath-Llama-3.1-70B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gbueno86/Brinebreath-Llama-3.1-70B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gbueno86/Brinebreath-Llama-3.1-70B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gbueno86/Brinebreath-Llama-3.1-70B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gbueno86/Brinebreath-Llama-3.1-70B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use gbueno86/Brinebreath-Llama-3.1-70B with Docker Model Runner:
```
docker model run hf.co/gbueno86/Brinebreath-Llama-3.1-70B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Brinebreath-Llama-3.1-70B

I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

Notable Performance

7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
Strong performance in MMLU-PRO categories overall
Great performance during manual testing

Creation workflow

Models merged

meta-llama/Meta-Llama-3.1-70B-Instruct
NousResearch/Hermes-3-Llama-3.1-70B
abacusai/Dracarys-Llama-3.1-70B-Instruct
VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct

flowchart TD
    A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
    C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
    B -->| | E[Merge]
    D -->| | E[Merge]
    G[SauerkrautLM] -->|Merge with| E[Merge]
    E[Merge] -->| | F[Brinebreath]

Testing

Hyperparameters

Temperature: 0.0 for automated, 0.9 for manual
Penalize repeat sequence: 1.05
Consider N tokens for penalize: 256
Penalize repetition of newlines
Top-K sampling: 40
Top-P sampling: 0.95
Min-P sampling: 0.05

LLaMAcpp Version

b3600-1-g2339a0be
-fa -ngl -1 -ctk f16 --no-mmap

Tested Files

Brinebreath-Llama-3.1-70B.Q4_0.gguf
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf

Manual testing

Category	Test Case	Brinebreath-Llama-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Common Sense	Ball on cup	OK	OK
	Big duck small horse	OK	OK
	Killers	OK	OK
	Strawberry r's	KO	KO
	9.11 or 9.9 bigger	KO	KO
	Dragon or lens	KO	KO
	Shirts	OK	KO
	Sisters	OK	KO
	Jane faster	OK	OK
Programming	JSON	OK	OK
	Python snake game	OK	KO
Math	Door window combination	OK	KO
Smoke	Poem	OK	OK
	Story	OK	OK

Note: See sample_generations.txt on the main folder of the repo for the raw generations.

MMLU-PRO

Model	Success %
Brinebreath-3.1-70B.Q4_0.gguf	49.0%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf	42.0%

MMLU-PRO category	Brinebreath-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Business	45.0%	40.0%
Law	40.0%	35.0%
Psychology	85.0%	80.0%
Biology	80.0%	75.0%
Chemistry	50.0%	45.0%
History	65.0%	60.0%
Other	55.0%	50.0%
Health	70.0%	65.0%
Economics	80.0%	75.0%
Math	35.0%	30.0%
Physics	45.0%	40.0%
Computer Science	60.0%	55.0%
Philosophy	50.0%	45.0%
Engineering	45.0%	40.0%