Intro
Forget brute-force Transformers. MoR is the covert agent that decides which tokens get a full interrogation and which just get a nod. It's efficiency with a spine.
What’s Broken in Transformers
-
All tokens endure the full 24-layer gauntlet even “the”, “and”, “lol.” Pure waste.
-
Layers stack unique weights → bloated models and budget nightmares.
-
No VIP exits: even done tokens linger, sucking GPU juice and dragging latency.
-
KV cache explodes with every token at every layer.
MoR’s Three Smart Moves
-
Parameter Sharing via Recursion
Instead of a unique layer stack, reuse one block like a loop. Fewer weights, same brain. MediumarXiv -
Adaptive Token Routing
A tiny router judges each token’s IQ. Easy ones leave early; complex ones stick around. Token-level compute efficiency. MediumarXiv -
Selective KV Caching
Only tokens still “in the game” clog up memory. Optionally, reuse the first pass’s KV to cut more slack (with some trade-offs). MediumarXiv
Benchmarks That Dazzle
-
At 1.7B parameters, MoR matches vanilla Transformer validation loss with 1/3 the unique weights. Medium
-
Peek into the Pareto frontier: MoR rides equal FLOPs to lower perplexity and better few-shot accuracy while spiffing up throughput by ~2×. arXiv+1
Routing Modes: Subtle Differences
-
Expert-choice: Each layer checks who continues dynamic but needs careful loss balancing.
-
Token-choice: Tokens pick their total recursion upfront—simple but less flexible. MediumarXiv
Trade-Off Signals
-
Token-choice lacks in-flight flexibility.
-
KV reuse saves memory but slightly dent accuracy.
-
Routing is fixed post-training; you can’t tweak on the fly.
-
Not golden for teeny models (<135M parameters). Medium
Takeaway
MoR isn’t Transformer-smashing rebellion it’s practical evolution. Smart compute, tight models, smarter exits. Finally, modern AI brains that stop overthinking.
No comments:
Post a Comment