EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments Paper • 2607.02440 • Published 1 day ago • 33
Program-as-Weights: A Programming Paradigm for Fuzzy Functions Paper • 2607.02512 • Published 1 day ago • 36
AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents Paper • 2607.02255 • Published 1 day ago • 32
GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity Paper • 2607.00152 • Published 3 days ago • 3
When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling Paper • 2606.28661 • Published 6 days ago • 2
Building to the Test: Coding Agents Deliver What You Check, Not What You Requested Paper • 2606.28430 • Published 7 days ago • 4
AI translation of literary texts is "fine", but readers still prefer human translations Paper • 2606.26040 • Published 9 days ago • 4
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? Paper • 2607.01211 • Published 2 days ago • 6
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0) Paper • 2604.17091 • Published Apr 18 • 23
Agent READMEs: An Empirical Study of Context Files for Agentic Coding Paper • 2511.12884 • Published Nov 17, 2025 • 29
AtomiMed: Hierarchical Atomic Fact-Checking for Universal Clinical-Aware Medical Report Evaluation Paper • 2606.31292 • Published 3 days ago • 4
Cross-Domain Generalization Failure in Lightweight Intrusion Detection Models for IIoT Networks Paper • 2607.00553 • Published 2 days ago • 5
CausalMix: Data Mixture as Causal Inference for Language Model Training Paper • 2607.01104 • Published 2 days ago • 16
PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception Paper • 2606.28322 • Published 7 days ago • 35
RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue Paper • 2607.01213 • Published 2 days ago • 2
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning Paper • 2606.32017 • Published 3 days ago • 8
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions Paper • 2606.30573 • Published 4 days ago • 4
Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination Paper • 2607.00924 • Published 2 days ago • 4
AutoTrainess: Teaching Language Models to Improve Language Models Autonomously Paper • 2606.31551 • Published 3 days ago • 11