Instructions to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF", dtype="auto")

llama-cpp-python

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF",
	filename="Romulus-cpt-Llama-3.1-8B-v0.1.Q2_K.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

Use Docker

docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

SGLang

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Ollama:
```
ollama run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
```

Unsloth Studio

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Docker Model Runner:
```
docker model run hf.co/QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M
```

Lemonade

How to use QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Romulus-cpt-Llama-3.1-8B-v0.1-GGUF-Q4_K_M

List all available models

lemonade list

QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF

This is quantized version of louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1 created using llama.cpp

Original Model Card

Romulus, continually pre-trained models for French law.

Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.

The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B tokenizer).

Hyperparameters

The following table outlines the key hyperparameters used for training Romulus.

Parameter	Description	Value
`max_seq_length`	Maximum sequence length for the model	4096
`load_in_4bit`	Whether to load the model in 4-bit precision	False
`model_name`	Pre-trained model name from Hugging Face	meta-llama/Meta-Llama-3.1-8B
`r`	Rank of the LoRA adapter	128
`lora_alpha`	Alpha value for the LoRA module	32
`lora_dropout`	Dropout rate for LoRA layers	0
`bias`	Bias type for LoRA adapters	none
`use_gradient_checkpointing`	Whether to use gradient checkpointing	unsloth
`train_batch_size`	Per device training batch size	8
`gradient_accumulation_steps`	Number of gradient accumulation steps	8
`warmup_ratio`	Warmup steps as a fraction of total steps	0.1
`num_train_epochs`	Number of training epochs	1
`learning_rate`	Learning rate for the model	5e-5
`embedding_learning_rate`	Learning rate for embeddings	1e-5
`optim`	Optimizer used for training	adamw_8bit
`weight_decay`	Weight decay to prevent overfitting	0.01
`lr_scheduler_type`	Type of learning rate scheduler	linear

Training script

Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:

# -*- coding: utf-8 -*-
import os

from typing import (
    Dict,
)

from datasets import load_dataset
from unsloth import (
    FastLanguageModel,
    is_bfloat16_supported,
    UnslothTrainer,
    UnslothTrainingArguments,
)

max_seq_length = 4096
dtype = None
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="hf_token",
)

model = FastLanguageModel.get_peft_model(
    model,
    r=128,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "embed_tokens",
        "lm_head",
    ],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=True,
    loftq_config=None,
)

prompt = """### Référence :
{}
### Contenu :
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    """
    Format input examples into prompts for a language model.

    This function takes a dictionary of examples containing titles and texts,
    combines them into formatted prompts, and appends an end-of-sequence token.

    Parameters
    ----------
    examples : dict
        A dictionary containing two keys:
        - 'title': A list of titles.
        - 'text': A list of corresponding text content.

    Returns
    -------
    dict
        A dictionary with a single key 'text', containing a list of formatted prompts.

    Notes
    -----
    - The function assumes the existence of a global `prompt` variable, which is a
      formatting string used to combine the title and text.
    - The function also assumes the existence of a global `EOS_TOKEN` variable,
      which is appended to the end of each formatted prompt.
    - The input lists 'title' and 'text' are expected to have the same length.

    Examples
    --------
    >>> examples = {
    ...     'title': ['Title 1', 'Title 2'],
    ...     'text': ['Content 1', 'Content 2']
    ... }
    >>> formatting_cpt_prompts_func(examples)
    {'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
    """
    refs = examples["ref"]
    texts = examples["texte"]
    outputs = []

    for ref, text in zip(refs, texts):
        text = prompt.format(ref, text) + EOS_TOKEN
        outputs.append(text)

    return {
        "text": outputs,
    }


cpt_dataset = load_dataset(
    "louisbrulenaudet/Romulus-cpt-fr",
    split="train",
    token="hf_token",
)

cpt_dataset = cpt_dataset.map(
    formatting_prompts_func,
    batched=True,
)

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=cpt_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=UnslothTrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=5e-5,
        embedding_learning_rate=1e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        report_to="wandb",
        save_steps=350,
        run_name="romulus-cpt",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()

Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Romulus, continually pre-trained models for French law},
  year =         {2024}
  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
}

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

Downloads last month: 162

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(644)

this model

QuantFactory
/

Romulus-cpt-Llama-3.1-8B-v0.1-GGUF