Instructions to use phanerozoic/PirateTalk-13b-v1-GPTQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use phanerozoic/PirateTalk-13b-v1-GPTQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="phanerozoic/PirateTalk-13b-v1-GPTQ-4bit")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("phanerozoic/PirateTalk-13b-v1-GPTQ-4bit")
model = AutoModelForMultimodalLM.from_pretrained("phanerozoic/PirateTalk-13b-v1-GPTQ-4bit")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use phanerozoic/PirateTalk-13b-v1-GPTQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/phanerozoic/PirateTalk-13b-v1-GPTQ-4bit

SGLang

How to use phanerozoic/PirateTalk-13b-v1-GPTQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "phanerozoic/PirateTalk-13b-v1-GPTQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use phanerozoic/PirateTalk-13b-v1-GPTQ-4bit with Docker Model Runner:
```
docker model run hf.co/phanerozoic/PirateTalk-13b-v1-GPTQ-4bit
```

Introducing PirateTalk-13b-v1-GPTQ-4bit: Building upon the foundation of the dependable 13b Llama 2 Chat architecture, we proudly unveil the 4-bit quantized iteration of the original PirateTalk-13b-v1 model. Utilizing GPTQ's advanced 4-bit GPU-quantization, this model promises a refined GPU-optimized experience without diluting its intrinsic piratical essence.

Objective: The launch of PirateTalk-13b-v1-GPTQ-4bit embodies our initiative to cater to a wider community of enthusiasts. Recognizing the VRAM constraints some users face, we embarked on this quantization journey. Our aim was to deliver the same captivating PirateTalk experience while considerably reducing the VRAM footprint, making the model more accessible to those with limited GPU resources.

Model Evolution: PirateTalk-13b-v1-GPTQ-4bit is a significant milestone in our quest for GPU-optimized quantization. Through GPTQ's 4-bit quantization technique, we have balanced GPU efficiency with the immersive narrative of our pirate dialect.

Performance Insights: Our experience with PirateTalk-13b-v1-GPTQ-4bit has been enlightening. While the quantized model tends to produce responses of shorter length, what stands out is its ability to retain the core piratical tone and essence that we intended. This balancing act between VRAM efficiency and maintaining a recognizable narrative style showcases the potential of 4-bit GPTQ quantization.

Technical Specifications: With an emphasis on GPU adaptability, PirateTalk-13b-v1-GPTQ-4bit's move to 4-bit GPTQ quantization underlines our dedication to deploying cutting-edge solutions that prioritize GPU efficiency and increased accessibility.

Future Endeavors: Buoyed by the achievements of PirateTalk-13b-v1-GPTQ-4bit, our sights are firmly set on the adventurous seas of further quantization, with 2-bit quantization beckoning us from the horizon.

Downloads last month: 1

Safetensors

Model size

13B params

Tensor type

I32

F16