Monday, 1 September 2025

Deep Techie — “Mixture-of-Recursions: Smart, Lean, Mean”

Intro

Forget brute-force Transformers. MoR is the covert agent that decides which tokens get a full interrogation and which just get a nod. It's efficiency with a spine.

What’s Broken in Transformers

  • All tokens endure the full 24-layer gauntlet even “the”, “and”, “lol.” Pure waste.

  • Layers stack unique weights → bloated models and budget nightmares.

  • No VIP exits: even done tokens linger, sucking GPU juice and dragging latency.

  • KV cache explodes with every token at every layer.

  • Zero internal reasoning or dynamic depth. MediumarXiv

MoR’s Three Smart Moves

  1. Parameter Sharing via Recursion
    Instead of a unique layer stack, reuse one block like a loop. Fewer weights, same brain. MediumarXiv

  2. Adaptive Token Routing
    A tiny router judges each token’s IQ. Easy ones leave early; complex ones stick around. Token-level compute efficiency. MediumarXiv

  3. Selective KV Caching
    Only tokens still “in the game” clog up memory. Optionally, reuse the first pass’s KV to cut more slack (with some trade-offs). MediumarXiv

Benchmarks That Dazzle

  • At 1.7B parameters, MoR matches vanilla Transformer validation loss with 1/3 the unique weights. Medium

  • Peek into the Pareto frontier: MoR rides equal FLOPs to lower perplexity and better few-shot accuracy while spiffing up throughput by ~2×. arXiv+1

Routing Modes: Subtle Differences

  • Expert-choice: Each layer checks who continues dynamic but needs careful loss balancing.

  • Token-choice: Tokens pick their total recursion upfront—simple but less flexible. MediumarXiv

Trade-Off Signals

  • Token-choice lacks in-flight flexibility.

  • KV reuse saves memory but slightly dent accuracy.

  • Routing is fixed post-training; you can’t tweak on the fly.

  • Not golden for teeny models (<135M parameters). Medium

Takeaway

MoR isn’t Transformer-smashing rebellion it’s practical evolution. Smart compute, tight models, smarter exits. Finally, modern AI brains that stop overthinking.

No comments:

Post a Comment

RAG vs Fine-Tuning: When to Pick Which?

The rapid evolution of large language models (LLMs) has made them increasingly useful across industries. However, when tailoring these model...