Instructions to use LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct

SGLang

How to use LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct with Docker Model Runner:
```
docker model run hf.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
```

model.generate() crashes: AttributeError 'AttentionInterface' has no attribute 'get_interface' (transformers==5.0.0)

#11

by Bias92 - opened Feb 7

Discussion

Bias92

Feb 7

Follow-up: _tied_weights_keys fix confirmed ✅ — but new AttentionInterface error in model.generate()

Thanks for the update @nuxlear ! I pulled the latest snapshot with force_download=True and confirmed:

✅ _tied_weights_keys is now a dict ({"lm_head.weight": "transformer.wte.weight"}) — the original 'list' object has no attribute 'keys' error during post_init is fixed.
✅ Model loads successfully (weights 100% materialized).

$confirm\_tied\_weights\_keys\_fixed$

However, model.generate() now crashes with a new error:

AttributeError: 'AttentionInterface' object has no attribute 'get_interface'

$generate\_error\_1$

$generate\_error\_traceback$

$use\_cache\_false\_same\_error$

$env\_versions$

Tested:

attn_implementation="eager" → same error
use_cache=False → same error

So it doesn’t look KV-cache/SDPA-specific — the failure happens before that.

Possible root cause (hypothesis):
modeling_exaone.py appears to use the older Transformers v4-style attention selection (class-based dispatch like ExaoneSelfAttention / ExaoneFlashAttention / ExaoneSdpaAttention based on config). In Transformers v5, attention dispatch is handled via AttentionInterface / ALL_ATTENTION_FUNCTIONS. This mismatch may be why generate() hits AttentionInterface.get_interface and fails.

Repro:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL = "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    trust_remote_code=True,
    attn_implementation="eager",
)

x = tok("hi\n", return_tensors="pt")
x = {k: v.to(model.device) for k, v in x.items()}

y = model.generate(**x, max_new_tokens=16, do_sample=False, use_cache=True)
# -> AttributeError: 'AttentionInterface' object has no attribute 'get_interface'

Environment:

transformers==5.0.0
torch==2.10.0
Python 3.13.2, Mac / M4 Pro (Apple Silicon)

Happy to help test changes or submit a PR once the attention code is updated.

nuxlear

LG AI Research org Feb 9

Could you upgrade your transformers version to 5.1.0?
The new modeling code is generated with the latest version, which is also compatible with EXAONE-MoE.

Please refer to this PR for more details: https://github.com/huggingface/transformers/pull/43622

Bias92

Feb 10

Confirmed that the issue is resolved with transformers==5.1.0.

Model loads and model.generate() works without the AttentionInterface error.
Tested on Python 3.13.2, Mac / M4 Pro (Apple Silicon)

Thanks @nuxlear for the quick response!

nuxlear changed discussion status to closed Feb 10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment