Been wrapping my head around this theoretical banger. Bro just casually answered a fundamental yet very important question: why does weight decay work or in other words why should penalizing weight magnitude help a model perform better on unseen data? He argues that the minimum neural weight norm required to represent a target dataset is closely related to the Kolmogorov complexity of that target dataset: I.e. smaller weight norms correspond to simpler solutions (lower Kolmogorov complexity), and simpler solutions tend to generalize better. This explains why bigger models generalize well on noisy data because there’s enough room to account for optimal KC. So the question now is not hinged on parameter size but on how much information is encoded in those parameters. Thus If norm is related to complexity, researchers can design regularizers that more directly control complexity, cool! It holds true for fixed precision only tho, and he explained clearly why
Anthropic’s new read introduces a new autoencoder (NLA) that now enables an LLM to reason in natural language (words) instead of activations (numbers). They trained Claude (with NLA) to translate its activations into human-readable text. NLA has two parameterized models: an activation verbalizer that converts activations to text, and an activation reconstructor that tries to recreate the activations back to text. While this is cool, it took GRPO to get here lol, proving how cutting-edge we can get when research is opensourced. Very useful for work on interpretability and alignment btw
Supercool! You can now easily train a JEPA world model (15M params) from end-to-end on a single GPU, with planning done under 1s 🤯. - trained with classic prediction loss + SIGReg. - plans purely in raw pixels. - beats SOTA DINO-WM and PLDM. - single hyper-parameter with no heuristics. - fully open sourced!!
Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections.
For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.
Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent: → instead of blindly summing past layers, → it learns which layers matter, → and dynamically weight contributions across depth.
So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!
Incredible work!! They claim this is the year of recursive language models (I hope so). As models get bigger and better managing their context windows to fit longer prompts has been a standing engineering problem. They propose an inference technique that allows the model to externally crunch down long prompts into snippets that it can recursively call itself on, instead of directly feeding the entire prompt into the transformer. This could make models cheaper and more efficient but I doubt if big tech will adopt it since they profit more with the current approach (bigger models = longer context windows = more expensive the model). Once again such work came from academia/oss community cuz I doubt big tech would have shared these findings lol. They probably have much better inference methods that we may never know of haha. Paper: https://arxiv.org/pdf/2512.24601