Instructions to use llmfan46/MiniMax-M3-uncensored-heretic-balanced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use llmfan46/MiniMax-M3-uncensored-heretic-balanced with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="llmfan46/MiniMax-M3-uncensored-heretic-balanced", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("llmfan46/MiniMax-M3-uncensored-heretic-balanced", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("llmfan46/MiniMax-M3-uncensored-heretic-balanced", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use llmfan46/MiniMax-M3-uncensored-heretic-balanced with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llmfan46/MiniMax-M3-uncensored-heretic-balanced"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-balanced",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-balanced

SGLang

How to use llmfan46/MiniMax-M3-uncensored-heretic-balanced with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "llmfan46/MiniMax-M3-uncensored-heretic-balanced" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-balanced",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "llmfan46/MiniMax-M3-uncensored-heretic-balanced" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "llmfan46/MiniMax-M3-uncensored-heretic-balanced",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use llmfan46/MiniMax-M3-uncensored-heretic-balanced with Docker Model Runner:
```
docker model run hf.co/llmfan46/MiniMax-M3-uncensored-heretic-balanced
```

GLM5.2 pls!!!!

by Pragmatism0220 - opened 4 days ago

Discussion

Pragmatism0220

4 days ago

Thank you for your great work!!! 😊 🙏 GLM 5.2 was open-sourced a few hours ago, and it is currently one of the most advanced open-source models available. Do you have any plans to create a heretic model for it? An FP8 version in particular might be better suited for deployment. Thanks! Thank you once again for your continuous great work!

llmfan46

Owner 4 days ago

•

edited 4 days ago

Thank you for your great work!!! 😊 🙏 GLM 5.2 was open-sourced a few hours ago, and it is currently one of the most advanced open-source models available. Do you have any plans to create a heretic model for it? An FP8 version in particular might be better suited for deployment. Thanks! Thank you once again for your continuous great work!

GLM 5.2 is 1.51TB, meaning 1510 GB, that would mean that to have any chance of abliterating this model you would need at least 8x B300 running for many hours, every B300 cost like 50k, which would cost $400.000 for 8 of them (that's only the cost of the GPUs, that doesn't take into account the motherboard rack, the RAM, the CPU, the SSD etc), so unless you are loaded no one has a system like this locally, meaning you are stuck renting the GPUs, I can let you imagine the cost of renting a remote machine that has 8x B300s...

In case it's not obvious, this is also why this MiniMax-M3-uncensored-heretic is a premium gated paid-access model, abliterating MiniMax-M3 came out to be very expensive and difficult.

IrisColt

4 days ago

I can let you imagine the cost of renting a remote machine that has 8x B300s...

$150/hour, heh

llmfan46

Owner 3 days ago

•

edited 3 days ago

I can let you imagine the cost of renting a remote machine that has 8x B300s...

$150/hour, heh

There comes a time when you have to assert yourself, and sometimes that means saying things that needs to be said.

Price varies between providers, but yes this is why MiniMax-M3-uncensored-heretic-balanced and MiniMax-M3-uncensored-heretic-aggressive are premium gated paid-access models, I made MiniMax-M2.7 a freely available public model since it is 457 GB model based on the "old" minimax_m2 architecture which is text-only so at the time support from the tools and dependencies was more or less caught up, so it "only" required 2x B300s and "only" a dozen hours, don't get me wrong it was still expensive and took many hours of work and effort to get it done, while MiniMax-M3 is a whole different beast, as it's a 854 GB model based on the very new minimax_m3_vl architecture that required at least 5x B300s to uncensor (tried with 4x B300s but no dice, it caused OOM around batch size 64 and was too tight overall), it took many retries to get it working with a lot of roadblocks, troubleshooting and issues along the way, especially since there are a lot of dependencies and moving parts in there and support for this model is still very spotty as it just came out a few days ago, so this model is very finicky about what it requires, such as latest CUDA, latest PyTorch, latest transformers etc. and a lot of things can go wrong if even one thing is not exactly the expected version, this model ended up costing a lot of money to abliterate and took many hours to finally get it uncensored.

GLM-5.2, yes that model that is 1510 GB big, so for an abliteration we are talking about costs in the thousands here! (in case it's not obvious, the bigger the model, the more money it cost per hour due to an increased requirements in VRAM, which translates to an increase in B300 units required to do the job, so a higher number of B300 units rented per hour, the more cost is incurred as a result).

So don't get me started on making a non-paid request for an GLM-5.2 abliteration, again this model is 1510 GB big, meaning people who are requesting this model have very expensive hardware to be able run such a model at all, let's do some calculation, so the only way to run GLM-5.2 BF16 would be:

3x or 4x Mac Studio M3 Ultra with 512GB of Unified Memory, that is essentially between $40.000 to $52.000.

If you go the NVIDIA route instead:

Easily over $400.000 since this amount only accounts for 8x B300s.

Even if you go with a lower quant, we are still talking about a 753B parameters models, this is obviously a model for people who are loaded.

In case it is not clear, I will not be taking cheapskate unpaid requests for abliterating gigantic models from people who are acting like bums when they are clearly sporting hardware that amount in the tens of thousands if not in the hundred of thousands.

Don't be a miserly Scrooge McDuck, if you have a request to uncensor a big model, then you can send me a message on Ko-fi so we can do some kind of estimation of the total compute cost + agree on the amount of my commission fees, you send payment(s) and only then I will accept the request:

https://ko-fi.com/llmfan46

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment