Tirth Logs: Deep Techie — “Mixture-of-Recursions: Smart, Lean, Mean”

Intro

Forget brute-force Transformers. MoR is the covert agent that decides which tokens get a full interrogation and which just get a nod. It's efficiency with a spine.

What’s Broken in Transformers

All tokens endure the full 24-layer gauntlet even “the”, “and”, “lol.” Pure waste.
Layers stack unique weights → bloated models and budget nightmares.
No VIP exits: even done tokens linger, sucking GPU juice and dragging latency.
KV cache explodes with every token at every layer.
Zero internal reasoning or dynamic depth. MediumarXiv

MoR’s Three Smart Moves

Parameter Sharing via Recursion
Instead of a unique layer stack, reuse one block like a loop. Fewer weights, same brain. MediumarXiv
Adaptive Token Routing
A tiny router judges each token’s IQ. Easy ones leave early; complex ones stick around. Token-level compute efficiency. MediumarXiv
Selective KV Caching
Only tokens still “in the game” clog up memory. Optionally, reuse the first pass’s KV to cut more slack (with some trade-offs). MediumarXiv

Benchmarks That Dazzle

At 1.7B parameters, MoR matches vanilla Transformer validation loss with 1/3 the unique weights. Medium
Peek into the Pareto frontier: MoR rides equal FLOPs to lower perplexity and better few-shot accuracy while spiffing up throughput by ~2×. arXiv+1

Routing Modes: Subtle Differences

Expert-choice: Each layer checks who continues dynamic but needs careful loss balancing.
Token-choice: Tokens pick their total recursion upfront—simple but less flexible. MediumarXiv

Trade-Off Signals

Token-choice lacks in-flight flexibility.
KV reuse saves memory but slightly dent accuracy.
Routing is fixed post-training; you can’t tweak on the fly.
Not golden for teeny models (<135M parameters). Medium

Takeaway

MoR isn’t Transformer-smashing rebellion it’s practical evolution. Smart compute, tight models, smarter exits. Finally, modern AI brains that stop overthinking.

Tirth Logs

Monday, 1 September 2025

Deep Techie — “Mixture-of-Recursions: Smart, Lean, Mean”

What’s Broken in Transformers

MoR’s Three Smart Moves

Benchmarks That Dazzle

Routing Modes: Subtle Differences

Trade-Off Signals

Takeaway

No comments:

Post a Comment

RAG vs Fine-Tuning: When to Pick Which?

Report Abuse