Instructions to use upstage/llama-30b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use upstage/llama-30b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="upstage/llama-30b-instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("upstage/llama-30b-instruct")
model = AutoModelForMultimodalLM.from_pretrained("upstage/llama-30b-instruct")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use upstage/llama-30b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "upstage/llama-30b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "upstage/llama-30b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/upstage/llama-30b-instruct

SGLang

How to use upstage/llama-30b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "upstage/llama-30b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "upstage/llama-30b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "upstage/llama-30b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "upstage/llama-30b-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use upstage/llama-30b-instruct with Docker Model Runner:
```
docker model run hf.co/upstage/llama-30b-instruct
```

What is this?

by TheBloke - opened Jul 13, 2023

Discussion

TheBloke

Jul 13, 2023

Is this a new instruction fine tuned model? If so could you provide some info on what it was trained on?

Thanks in advance

Limerobot

upstage org Jul 19, 2023

@TheBloke
Apologies for the delay. We have recently updated the model card about this model. Please refer to it for more information.
Thank you

kagevazquez

Jul 19, 2023

Your "contact us" should be higher up. Great work!

TheBloke

Jul 19, 2023

Wow yeah this looks really interesting. I will do quantisations of it now, so more people can run it and learn about it

Now that Llama 2 is out, are you planning to bring out a llama-2-13b-instruct, and/or maybe llama-2-70b-instruct? It's a shame there's no Llama 2 34B yet but apparently it's coming fairly soon.

TheBloke

Jul 19, 2023

By the way I suggest you put your full model card in all the variants. The 30B 2048 is definitely the most interesting I think, but it only has a very short model card where the user has to click elsewhere to learn what this is. I would copy the full model card to each model, with a brief line explaining what is different about each particular one. Less work for the user = more interest!

nxnhjrjtbjfzhrovwl

Jul 20, 2023

•

edited Jul 20, 2023

invading this discussion a bit, i would like to know if we will ever get a 65B 2048, after all it's clear that 30B 2048 got much better results than 30B 1024 so probably 65B would follow this trend.

Limerobot

upstage org Jul 24, 2023

@TheBloke Thank you for your interest in our model. Taking into account the number of GPUs available to us, we're planning to fine-tune the Llama2 model. We'll soon release the Llama2-70b model which has been trained with 200k data. We appreciate your valuable suggestions. :)

@nxnhjrjtbjfzhrovwl Given that the Llama2-70b model is better than the 65b, we're planning to fine-tune the Llama2-70b-2048 model first.

Limerobot changed discussion status to closed Jul 24, 2023

TheBloke

Jul 24, 2023

Great to hear!

Ideally you would do Llama2-70B-4096? Given it has increased context.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment