Instructions to use upstage/llama-30b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use upstage/llama-30b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="upstage/llama-30b-instruct")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("upstage/llama-30b-instruct") model = AutoModelForMultimodalLM.from_pretrained("upstage/llama-30b-instruct") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use upstage/llama-30b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "upstage/llama-30b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "upstage/llama-30b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/upstage/llama-30b-instruct
- SGLang
How to use upstage/llama-30b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "upstage/llama-30b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "upstage/llama-30b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "upstage/llama-30b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "upstage/llama-30b-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use upstage/llama-30b-instruct with Docker Model Runner:
docker model run hf.co/upstage/llama-30b-instruct
What is this?
Is this a new instruction fine tuned model? If so could you provide some info on what it was trained on?
Thanks in advance
Your "contact us" should be higher up. Great work!
Wow yeah this looks really interesting. I will do quantisations of it now, so more people can run it and learn about it
Now that Llama 2 is out, are you planning to bring out a llama-2-13b-instruct, and/or maybe llama-2-70b-instruct? It's a shame there's no Llama 2 34B yet but apparently it's coming fairly soon.
By the way I suggest you put your full model card in all the variants. The 30B 2048 is definitely the most interesting I think, but it only has a very short model card where the user has to click elsewhere to learn what this is. I would copy the full model card to each model, with a brief line explaining what is different about each particular one. Less work for the user = more interest!
invading this discussion a bit, i would like to know if we will ever get a 65B 2048, after all it's clear that 30B 2048 got much better results than 30B 1024 so probably 65B would follow this trend.
@TheBloke Thank you for your interest in our model. Taking into account the number of GPUs available to us, we're planning to fine-tune the Llama2 model. We'll soon release the Llama2-70b model which has been trained with 200k data. We appreciate your valuable suggestions. :)
@nxnhjrjtbjfzhrovwl Given that the Llama2-70b model is better than the 65b, we're planning to fine-tune the Llama2-70b-2048 model first.
Great to hear!
Ideally you would do Llama2-70B-4096? Given it has increased context.