Abstract
Human-centered evaluation reveals significant gaps between synthetic and real-world LLM personalization performance, with models struggling to extract user attributes and generate truly personalized responses that match human quality judgments.
Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.
Community
Personalization is becoming a core promise of LLM systems: chatbots remember your job, interests, preferences, and past conversations to tailor responses. But โpersonalizedโ does not always mean helpful โ it can also feel uncomfortable, offensive, or just unnecessary.
This raises a basic but surprisingly under-examined question: ๐ช๐ต๐ผ ๐ฑ๐ฒ๐ฐ๐ถ๐ฑ๐ฒ๐ ๐๐ต๐ฒ๐๐ต๐ฒ๐ฟ ๐ฝ๐ฒ๐ฟ๐๐ผ๐ป๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ถ๐ ๐ฎ๐ฐ๐๐๐ฎ๐น๐น๐ ๐ต๐ฒ๐น๐ฝ๐ณ๐๐น โ ๐๐ต๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น, ๐ผ๐ฟ ๐๐ต๐ฒ ๐ฝ๐ฒ๐ฟ๐๐ผ๐ป ๐ฏ๐ฒ๐ถ๐ป๐ด ๐ฝ๐ฒ๐ฟ๐๐ผ๐ป๐ฎ๐น๐ถ๐๐ฒ๐ฑ ๐ณ๐ผ๐ฟ?
Most existing benchmarks rely heavily on synthetic personas, simulated conversations, and LLM judges. In this work, we put ๐ฟ๐ฒ๐ฎ๐น ๐ต๐๐บ๐ฎ๐ป๐ back into the loop.
We study personalization as a three-stage pipeline:
๐ง ๐๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ ๐ฒ๐
๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป โ what should the system infer from conversation history?
๐ฏ ๐ฅ๐ฒ๐น๐ฒ๐๐ฎ๐ป๐ฐ๐ฒ ๐บ๐ฎ๐๐ฐ๐ต๐ถ๐ป๐ด โ which attributes actually matter for the current request?
โ๏ธ ๐ฅ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป โ does personalization improve the user experience?
Using 550 real user conversations and nearly 19,000 human judgments, we find systematic ๐ต๐๐บ๐ฎ๐ปโ๐๐๐ ๐ด๐ฎ๐ฝ๐ at every stage:
โข Models extract noisy and overgeneralized attributes from real conversations. Synthetic data underestimates this difficulty.
โข LLMs and humans disagree on which attributes should be used in a new question(ฮบ=0.30), but each agree well within their own group (ฮบ=0.60 and 0.43).
โข LLMs select ๐ฎโ๐ฏร ๐บ๐ผ๐ฟ๐ฒ ๐ฎ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ as relevant than humans do, suggesting a tendency to over-personalize.
โข Even with human-selected relevant attributes, ๐ฑ๐ฐ.๐ฒ% of personalized responses are judged ๐ป๐ผ ๐ฏ๐ฒ๐๐๐ฒ๐ฟ ๐๐ต๐ฎ๐ป ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ถ๐ฐ ones by humans.
โข LLM judges often overestimate personalization quality, sometimes rewarding surface-level attribute mentions that humans do not find useful.
We also find that lightweight training improves attribute verification and relevance matching substantially. But response-level personalization remains much harder, likely because โgood personalizationโ is inherently individual.
๐ข๐๐ฟ ๐๐ฎ๐ธ๐ฒ๐ฎ๐๐ฎ๐ ๐ถ๐ ๐๐ถ๐บ๐ฝ๐น๐ฒ: Synthetic users and LLM judges aren't enough to capture the complex nature of human preferences. We highlight the importance of ๐ต๐๐บ๐ฎ๐ป ๐ฑ๐ฎ๐๐ฎ and call for ๐ฟ๐ฒ-๐ฐ๐ฒ๐ป๐๐ฒ๐ฟ๐ถ๐ป๐ด ๐ต๐๐บ๐ฎ๐ป๐ in LLM personalization.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SenseJudge: Human-Centric Preference-Driven Judgment Framework (2026)
- LUCid: Redefining Relevance For Lifelong Personalization (2026)
- Personalized Turn-Level User Conversation Satisfaction Benchmark (2026)
- ChildEval: When large language models meet children's personalities (2026)
- RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator (2026)
- JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors (2026)
- Summarization is Not Dead Yet (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.06614 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper