Jaward Sesay

Jaward

AI & ML interests

Building Lectūra Labs | CS Grad Student @BIT | AI/ML Research: Autonomous Agents, LLMs | Building The Cursor for Learning | Role Model Karpathy

Recent Activity

updated a dataset 1 day ago

Jaward/lectura-agents-data

posted an update 2 days ago

Been wrapping my head around this theoretical banger. Bro just casually answered a fundamental yet very important question: why does weight decay work or in other words why should penalizing weight magnitude help a model perform better on unseen data? He argues that the minimum neural weight norm required to represent a target dataset is closely related to the Kolmogorov complexity of that target dataset: I.e. smaller weight norms correspond to simpler solutions (lower Kolmogorov complexity), and simpler solutions tend to generalize better. This explains why bigger models generalize well on noisy data because there’s enough room to account for optimal KC. So the question now is not hinged on parameter size but on how much information is encoded in those parameters. Thus If norm is related to complexity, researchers can design regularizers that more directly control complexity, cool! It holds true for fixed precision only tho, and he explained clearly why

posted an update 26 days ago

Anthropic’s new read introduces a new autoencoder (NLA) that now enables an LLM to reason in natural language (words) instead of activations (numbers). They trained Claude (with NLA) to translate its activations into human-readable text. NLA has two parameterized models: an activation verbalizer that converts activations to text, and an activation reconstructor that tries to recreate the activations back to text. While this is cool, it took GRPO to get here lol, proving how cutting-edge we can get when research is opensourced. Very useful for work on interpretability and alignment btw

View all activity

Organizations

updated a dataset 1 day ago

Jaward/lectura-agents-data

Viewer • Updated 1 day ago • 280 • 3.67k • 21

posted an update 2 days ago

Post

Been wrapping my head around this theoretical banger. Bro just casually answered a fundamental yet very important question: why does weight decay work or in other words why should penalizing weight magnitude help a model perform better on unseen data?
He argues that the minimum neural weight norm required to represent a target dataset is closely related to the Kolmogorov complexity of that target dataset: I.e. smaller weight norms correspond to simpler solutions (lower Kolmogorov complexity), and simpler solutions tend to generalize better. This explains why bigger models generalize well on noisy data because there’s enough room to account for optimal KC. So the question now is not hinged on parameter size but on how much information is encoded in those parameters. Thus If norm is related to complexity, researchers can design regularizers that more directly control complexity, cool! It holds true for fixed precision only tho, and he explained clearly why

posted an update 26 days ago

Post

126

upvoted a paper about 1 month ago

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

Paper • 2604.27711 • Published Apr 30 • 41

liked a model 2 months ago

CohereLabs/cohere-transcribe-03-2026

Automatic Speech Recognition • 2B • Updated 4 days ago • 315k • 971

posted an update 2 months ago

Post

178

Supercool! You can now easily train a JEPA world model (15M params) from end-to-end on a single GPU, with planning done under 1s 🤯.
- trained with classic prediction loss + SIGReg.
- plans purely in raw pixels.
- beats SOTA DINO-WM and PLDM.
- single hyper-parameter with no heuristics.
- fully open sourced!!

Paper/Code/Data: https://le-wm.github.io/

posted an update 3 months ago

Post

172

Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections.

For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.

Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent:
→ instead of blindly summing past layers,
→ it learns which layers matter,
→ and dynamically weight contributions across depth.

So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!

upvoted a paper 4 months ago

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Paper • 2602.04515 • Published Feb 4 • 39

posted an update 4 months ago

Post

260

data in support of findings in our new work on personalized embodied teaching/learning is out, paper coming soon.
Jaward/lectura-agents-data

liked a dataset 4 months ago

Jaward/lectura-agents-data

Viewer • Updated 1 day ago • 280 • 3.67k • 21

New activity in Jaward/lectura-agents-data 4 months ago

Delete 'claude-4.5' config

#3 opened 4 months ago by

Jaward

Delete 'claude-4.5' config

#2 opened 4 months ago by

Jaward

Delete 'claude-4.5' config

#1 opened 4 months ago by

Jaward

published a dataset 4 months ago

Jaward/lectura-agents-data

Viewer • Updated 1 day ago • 280 • 3.67k • 21

posted an update 5 months ago

Post

959

Incredible work!! They claim this is the year of recursive language models (I hope so). As models get bigger and better managing their context windows to fit longer prompts has been a standing engineering problem. They propose an inference technique that allows the model to externally crunch down long prompts into snippets that it can recursively call itself on, instead of directly feeding the entire prompt into the transformer. This could make models cheaper and more efficient but I doubt if big tech will adopt it since they profit more with the current approach (bigger models = longer context windows = more expensive the model). Once again such work came from academia/oss community cuz I doubt big tech would have shared these findings lol. They probably have much better inference methods that we may never know of haha.
Paper: https://arxiv.org/pdf/2512.24601