Monday, 1 September 2025

Deep Techie — “Mixture-of-Recursions: Smart, Lean, Mean”

Intro

Forget brute-force Transformers. MoR is the covert agent that decides which tokens get a full interrogation and which just get a nod. It's efficiency with a spine.

What’s Broken in Transformers

  • All tokens endure the full 24-layer gauntlet even “the”, “and”, “lol.” Pure waste.

  • Layers stack unique weights → bloated models and budget nightmares.

  • No VIP exits: even done tokens linger, sucking GPU juice and dragging latency.

  • KV cache explodes with every token at every layer.

  • Zero internal reasoning or dynamic depth. MediumarXiv

MoR’s Three Smart Moves

  1. Parameter Sharing via Recursion
    Instead of a unique layer stack, reuse one block like a loop. Fewer weights, same brain. MediumarXiv

  2. Adaptive Token Routing
    A tiny router judges each token’s IQ. Easy ones leave early; complex ones stick around. Token-level compute efficiency. MediumarXiv

  3. Selective KV Caching
    Only tokens still “in the game” clog up memory. Optionally, reuse the first pass’s KV to cut more slack (with some trade-offs). MediumarXiv

Benchmarks That Dazzle

  • At 1.7B parameters, MoR matches vanilla Transformer validation loss with 1/3 the unique weights. Medium

  • Peek into the Pareto frontier: MoR rides equal FLOPs to lower perplexity and better few-shot accuracy while spiffing up throughput by ~2×. arXiv+1

Routing Modes: Subtle Differences

  • Expert-choice: Each layer checks who continues dynamic but needs careful loss balancing.

  • Token-choice: Tokens pick their total recursion upfront—simple but less flexible. MediumarXiv

Trade-Off Signals

  • Token-choice lacks in-flight flexibility.

  • KV reuse saves memory but slightly dent accuracy.

  • Routing is fixed post-training; you can’t tweak on the fly.

  • Not golden for teeny models (<135M parameters). Medium

Takeaway

MoR isn’t Transformer-smashing rebellion it’s practical evolution. Smart compute, tight models, smarter exits. Finally, modern AI brains that stop overthinking.

No comments:

Post a Comment

TurboQuant: The Dirty Secret Behind Making AI Look Smarter Than It Is

Let’s get one thing straight. AI isn’t magical. It’s just ridiculously good at faking intelligence while juggling absurd amounts of data. An...