Instructions to use wordcab/llama-natural-instructions-13b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wordcab/llama-natural-instructions-13b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wordcab/llama-natural-instructions-13b")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("wordcab/llama-natural-instructions-13b")
model = AutoModelForMultimodalLM.from_pretrained("wordcab/llama-natural-instructions-13b")

PEFT
How to use wordcab/llama-natural-instructions-13b with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use wordcab/llama-natural-instructions-13b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wordcab/llama-natural-instructions-13b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wordcab/llama-natural-instructions-13b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/wordcab/llama-natural-instructions-13b

SGLang

How to use wordcab/llama-natural-instructions-13b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wordcab/llama-natural-instructions-13b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wordcab/llama-natural-instructions-13b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wordcab/llama-natural-instructions-13b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wordcab/llama-natural-instructions-13b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use wordcab/llama-natural-instructions-13b with Docker Model Runner:
```
docker model run hf.co/wordcab/llama-natural-instructions-13b
```

llama-natural-instructions-13b / README.md

chainyo

Librarian Bot: Add base_model information to model (#4)

159d653 over 2 years ago

preview code

raw

history blame contribute delete

6.34 kB

metadata

language:
  - en
library_name: transformers
tags:
  - peft
  - LoRA
datasets:
  - Muennighoff/natural-instructions
pipeline_tag: text-generation
base_model: decapoda-research/llama-13b-hf

LoRA LLaMA Natural Instructions

This model is a fine-tuned version of llama-13b from Meta, on the Natural Instructions dataset from AllenAI, using the LoRA training technique.

⚠️ This model is for Research purpose only (See the license)

WandB Report

Click on the badge below to see the full report on Weights & Biases.

Usage

Installation

pip install loralib bitsandbytes datasets git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git sentencepiece

Format of the input

The input should be a string of text with the following format:

prompt_template = {
    "prompt": "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
    "response": "### Response:"    
}

def generate_prompt(
    definition: str,
    inputs: str,
    targets: Union[None, str] = None,
) -> str:
    """Generate a prompt from instruction and input."""
    res = prompt_template["prompt"].format(
        instruction=definition, input=inputs
    )

    if targets:
        res = f"{res}{targets}"

    return res

def get_response(output: str) -> str:
    """Get the response from the output."""
    return output.split(prompt_template["response"])[1].strip()

Feel free to use these utility functions to generate the prompt and to extract the response from the model output.

definition is the instruction describing the task. It's generally a single sentence explaining the expected output and the reasoning steps to follow.
inputs is the input to the task. It can be a single sentence or a paragraph. It's the context used by the model to generate the response to the task.
targets is the expected output of the task. It's used for training the model. It's not required for inference.

Inference

You can load the model using only the adapters or load the full model with the adapters and the weights.

The tokenizer

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("wordcab/llama-natural-instructions-13b")
tokenizer.padding_side = "left"
tokenizer.pad_token_id = (0)

Load the model with the adapters

from peft import PeftModel
from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(
    "decapoda-research/llama-13b-hf",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(
    model,
    "wordcab/llama-natural-instructions-13b",
    torch_dtype=torch.float16,
    device_map={"": 0},
)

Load the full model

model = LlamaForCausalLM.from_pretrained(
    "wordcab/llama-natural-instructions-13b",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

Evaluation mode

Don't forget to put the model in evaluation mode. And if you are using PyTorch v2.0 or higher don't forget to call the compile method.

model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

Generate the response

prompt = generate_prompt(
    "In this task, you have to analyze the full sentences and do reasoning and quick maths to find the correct answer.",
    f"You are now a superbowl star. You are the quarterback of the team. Your team is down by 3 points. You are in the last 2 minutes of the game. The other team has a score of 28. What is the score of your team?",
)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048)
input_ids = inputs["input_ids"].to(model.device)

generation_config = GenerationConfig(
    temperature=0.2,
    top_p=0.75,
    top_k=40,
    num_beams=4,
)

with torch.no_grad():
    gen_outputs = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=50,
    )

s = gen_outputs.sequences[0]
output = tokenizer.decode(s, skip_special_tokens=True)
response = prompter.get_response(output)
print(response)
>>> 25

You can try with other prompts that are not maths related as well! :hugs:

Beanchmark

We benchmarked our model on the following tasks: BoolQ, PIQA, WinoGrande, OpenBookQA.

	BoolQ	PIQA	WinoGrande	OpenBookQA	Precision	Inference time (s)
Original LLaMA 7B	76.5	79.8	70.1	57.2	fp32	3 seconds
Original LLaMA 13B	78.1	80.1	73	56.4	fp32	>5 seconds
LoRA LLaMA 7B	63.9	51.3	48.9	31.4	8bit	0.65 seconds
LoRA LLaMA 13B	70	63.93	51.6	50.4	8bit	1.2 seconds

Link to the 7B model: wordcab/llama-natural-instructions-7b

Overall our LoRA model is less performant than the original model from Meta, if we compare the results from the original paper.

The performance degradation is due to the fact we load the model in 8bit and we use the adapters from the LoRA training. Thanks to the 8bit quantization, the model is 4 times faster than the original model and the results are still decent.

Some complex tasks like WinoGrande and OpenBookQA are more difficult to solve with the adapters.

Training Hardware

This model was trained on a GCP instance with 16x NVIDIA A100 40GB GPUs.