--- license: apache-2.0 library_name: peft base_model: Qwen/Qwen3-8B pipeline_tag: text-generation language: - en tags: - peft - lora - grpo - trl - multi-turn - intent-discovery datasets: - kixlab/DiscoverLLM-multiturn-preferences --- # DiscoverLLM-technical-writing-Qwen3-8B LoRA adapter fine-tuned from [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) for **collaborative technical writing (articles, explanations)** with the **DiscoverLLM** training framework ([paper](https://arxiv.org/abs/2602.03429) ยท [project page](https://tsook.github.io/discoverllm/)). DiscoverLLM trains LLMs to help users figure out what they want by modeling intent discovery as the reward signal, then optimizing against a simulator that maintains a latent intent hierarchy. Trained with **GRPO** on [`kixlab/DiscoverLLM-multiturn-preferences`](https://huggingface.co/datasets/kixlab/DiscoverLLM-multiturn-preferences) using [TRL](https://github.com/huggingface/trl) and [PEFT](https://github.com/huggingface/peft). ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_id = "Qwen/Qwen3-8B" adapter_id = "kixlab/DiscoverLLM-technical-writing-Qwen3-8B" tokenizer = AutoTokenizer.from_pretrained(adapter_id) base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto") model = PeftModel.from_pretrained(base, adapter_id) messages = [{"role": "user", "content": "Help me write a poem about my younger self."}] inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7) print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True)) ``` > Note: the base model [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) may be gated. > You need to accept its license on the Hub before this adapter will load. ## Training details - **Method:** GRPO (offline) via [TRL](https://github.com/huggingface/trl). DiscoverLLM uses the standard Group Relative Policy Optimization (GRPO; Shao et al., 2024) algorithm; the contribution is the simulator-derived reward. - **Adapter:** LoRA (r=32, alpha=64; all attention + MLP projections) - **Framework versions:** PEFT 0.18.0 / TRL 0.26.2 / Transformers 4.57.4 / PyTorch 2.9.0 ## Citation ```bibtex @article{kim2026discoverllm, title={DiscoverLLM: From Executing Intents to Discovering Them}, author={Kim, Tae Soo and Lee, Yoonjoo and Yu, Jaesang and Chung, John Joon Young and Kim, Juho}, journal={arXiv preprint arXiv:2602.03429}, year={2026} } ```