Instructions to use mistralai/Mistral-7B-Instruct-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistralai/Mistral-7B-Instruct-v0.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use mistralai/Mistral-7B-Instruct-v0.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Install mistral-common:
pip install --upgrade mistral-common
# Start the vLLM server:
vllm serve "mistralai/Mistral-7B-Instruct-v0.1" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.1

SGLang

How to use mistralai/Mistral-7B-Instruct-v0.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistralai/Mistral-7B-Instruct-v0.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.1 with Docker Model Runner:
```
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.1
```

Unable to load checkpoint shards

#21

by Tilakraj0308 - opened Sep 29, 2023

Discussion

Tilakraj0308

Sep 29, 2023

Got an error something like:
.cache\huggingface\hub\models--mistralai--Mistral-7B-Instruct-v0.1\snapshots\d635d39671aaceec5ef84b745bc21625b324b7f8\pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

19Peppe95

Oct 3, 2023

I have the same issue, any news?

Tilakraj0308

Oct 3, 2023

@19Peppe95 The error is because system is running out of RAM to load the model in one go.
You can use CTransformers to load the model or can try GGUF model versions of your model which is basically much smaller version of it.
Gist - Use GGUF version of this model https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
and use CTransformers to load it from the dowloads and run the program, hopefully it should work.

Starlento

Oct 13, 2023

•

edited Oct 17, 2023

@19Peppe95 The error is because system is running out of RAM to load the model in one go.
You can use CTransformers to load the model or can try GGUF model versions of your model which is basically much smaller version of it.
Gist - Use GGUF version of this model https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
and use CTransformers to load it from the dowloads and run the program, hopefully it should work.

~~I also have the RAM issue. But it is werid that the model is only 14GB in totally and I have 64GB RAM and 24GB VRAM available.~~
Just found out that there was a download issue, the bins are broken so the memory usage when loading the files became uncontrollable.

Tilakraj0308

Oct 16, 2023

This comment has been hidden

Tilakraj0308 changed discussion status to closed Oct 16, 2023

lysandre

Oct 16, 2023

cc @ybelkada regarding low-memory methods to load larger models

ybelkada

Oct 16, 2023

Hi everyone
In case you are facing CPU OOM issues while loading the model please consider using sharded models with small shards, for this model I would recommend using this repository: https://huggingface.co/bn22/Mistral-7B-Instruct-v0.1-sharded

shantanudave

Mar 18, 2024

@ybelkada Hi, the model you shared is not available anymore? :(

ybelkada

Mar 18, 2024

hi @shantanudave
Indeed, please use: https://huggingface.co/alexsherstinsky/Mistral-7B-v0.1-sharded instead

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment