Instructions to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", dtype="auto") - llama-cpp-python
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", filename="Romulus-cpt-Llama-3.1-8B-v0.1.Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
- SGLang
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Ollama:
ollama run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
- Unsloth Studio
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Docker Model Runner:
docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
- Lemonade
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Romulus-cpt-Llama-3.1-8B-v0.1-GGUF-Q4_K_M
List all available models
lemonade list
QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF
This is quantized version of louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1 created using llama.cpp
Original Model Card
Romulus, continually pre-trained models for French law.
Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.
The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B tokenizer).
Hyperparameters
The following table outlines the key hyperparameters used for training Romulus.
| Parameter | Description | Value |
|---|---|---|
max_seq_length |
Maximum sequence length for the model | 4096 |
load_in_4bit |
Whether to load the model in 4-bit precision | False |
model_name |
Pre-trained model name from Hugging Face | meta-llama/Meta-Llama-3.1-8B |
r |
Rank of the LoRA adapter | 128 |
lora_alpha |
Alpha value for the LoRA module | 32 |
lora_dropout |
Dropout rate for LoRA layers | 0 |
bias |
Bias type for LoRA adapters | none |
use_gradient_checkpointing |
Whether to use gradient checkpointing | unsloth |
train_batch_size |
Per device training batch size | 8 |
gradient_accumulation_steps |
Number of gradient accumulation steps | 8 |
warmup_ratio |
Warmup steps as a fraction of total steps | 0.1 |
num_train_epochs |
Number of training epochs | 1 |
learning_rate |
Learning rate for the model | 5e-5 |
embedding_learning_rate |
Learning rate for embeddings | 1e-5 |
optim |
Optimizer used for training | adamw_8bit |
weight_decay |
Weight decay to prevent overfitting | 0.01 |
lr_scheduler_type |
Type of learning rate scheduler | linear |
Training script
Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:
# -*- coding: utf-8 -*-
import os
from typing import (
Dict,
)
from datasets import load_dataset
from unsloth import (
FastLanguageModel,
is_bfloat16_supported,
UnslothTrainer,
UnslothTrainingArguments,
)
max_seq_length = 4096
dtype = None
load_in_4bit = False
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Meta-Llama-3.1-8B",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
token="hf_token",
)
model = FastLanguageModel.get_peft_model(
model,
r=128,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"embed_tokens",
"lm_head",
],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=True,
loftq_config=None,
)
prompt = """### Référence :
{}
### Contenu :
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
"""
Format input examples into prompts for a language model.
This function takes a dictionary of examples containing titles and texts,
combines them into formatted prompts, and appends an end-of-sequence token.
Parameters
----------
examples : dict
A dictionary containing two keys:
- 'title': A list of titles.
- 'text': A list of corresponding text content.
Returns
-------
dict
A dictionary with a single key 'text', containing a list of formatted prompts.
Notes
-----
- The function assumes the existence of a global `prompt` variable, which is a
formatting string used to combine the title and text.
- The function also assumes the existence of a global `EOS_TOKEN` variable,
which is appended to the end of each formatted prompt.
- The input lists 'title' and 'text' are expected to have the same length.
Examples
--------
>>> examples = {
... 'title': ['Title 1', 'Title 2'],
... 'text': ['Content 1', 'Content 2']
... }
>>> formatting_cpt_prompts_func(examples)
{'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
"""
refs = examples["ref"]
texts = examples["texte"]
outputs = []
for ref, text in zip(refs, texts):
text = prompt.format(ref, text) + EOS_TOKEN
outputs.append(text)
return {
"text": outputs,
}
cpt_dataset = load_dataset(
"louisbrulenaudet/Romulus-cpt-fr",
split="train",
token="hf_token",
)
cpt_dataset = cpt_dataset.map(
formatting_prompts_func,
batched=True,
)
trainer = UnslothTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=cpt_dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=UnslothTrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=1,
learning_rate=5e-5,
embedding_learning_rate=1e-5,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
report_to="wandb",
save_steps=350,
run_name="romulus-cpt",
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
trainer_stats = trainer.train()
Citing & Authors
If you use this code in your research, please use the following BibTeX entry.
@misc{louisbrulenaudet2024,
author = {Louis Brulé Naudet},
title = {Romulus, continually pre-trained models for French law},
year = {2024}
howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
}
Feedback
If you have any feedback, please reach out at louisbrulenaudet@icloud.com.
- Downloads last month
- 162
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF
Base model
meta-llama/Llama-3.1-8B