---
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen3-8B
pipeline_tag: text-generation
language:
- en
tags:
- peft
- lora
- grpo
- trl
- multi-turn
- intent-discovery
datasets:
- kixlab/DiscoverLLM-multiturn-preferences
---

# DiscoverLLM-technical-writing-Qwen3-8B

LoRA adapter fine-tuned from [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) for
**collaborative technical writing (articles, explanations)** with the **DiscoverLLM** training framework
([paper](https://arxiv.org/abs/2602.03429) · [project page](https://tsook.github.io/discoverllm/)).
DiscoverLLM trains LLMs to help users figure out what they want by modeling intent discovery as
the reward signal, then optimizing against a simulator that maintains a latent intent hierarchy.

Trained with **GRPO** on
[`kixlab/DiscoverLLM-multiturn-preferences`](https://huggingface.co/datasets/kixlab/DiscoverLLM-multiturn-preferences)
using [TRL](https://github.com/huggingface/trl) and [PEFT](https://github.com/huggingface/peft).

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_id    = "Qwen/Qwen3-8B"
adapter_id = "kixlab/DiscoverLLM-technical-writing-Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)

messages = [{"role": "user", "content": "Help me write a poem about my younger self."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
```

> Note: the base model [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) may be gated.
> You need to accept its license on the Hub before this adapter will load.

## Training details

- **Method:** GRPO (offline) via [TRL](https://github.com/huggingface/trl). DiscoverLLM uses the standard Group Relative Policy Optimization (GRPO; Shao et al., 2024) algorithm; the contribution is the simulator-derived reward.
- **Adapter:** LoRA (r=32, alpha=64; all attention + MLP projections)
- **Framework versions:** PEFT 0.18.0 / TRL 0.26.2 / Transformers 4.57.4 / PyTorch 2.9.0

## Citation

```bibtex
@article{kim2026discoverllm,
  title={DiscoverLLM: From Executing Intents to Discovering Them},
  author={Kim, Tae Soo and Lee, Yoonjoo and Yu, Jaesang and Chung, John Joon Young and Kim, Juho},
  journal={arXiv preprint arXiv:2602.03429},
  year={2026}
}
```