Discussion about this post

User's avatar
Steeven's avatar

One thing I don’t get is how the compute is supposed to be decentralized at frontier scale. At a certain level of scale, your decentralized compute landscape is still easily identifiable. You can hide training a 60B model but hiding a 2T model would be much more challenging. Where exactly are you storing the B200s such that they can’t be easily identifiable and how do you set up the procurement for all those GPUs without identifying yourself.

Maybe it’s a bet on more efficient training causing the power, time, and hardware requirements to go down, but it seems like all of those would likewise push frontier data center capabilities and wouldn’t narrow the gap.

DiLoCo and other algorithms for decentralized training can also be applied to frontier models which train across multiple data centers, but help decentralized training more.

This post isn’t the first I’ve seen for talking about how decentralized training changes everything (although it is the most comprehensive so well done). But I’m still not clear on how this breaks the logic of compute regulation since A. Decentralized runs are nowhere near the rapidly advancing frontier and B. At a certain scale, decentralized training still requires massive purchase orders of GPUs and power which would still allow compute to be verified

IntExp's avatar

great review!

scaling at the agent/swarm level is where foom will manifest, especially if the decentrally trained models remain inferior to SOTA open weight models.

decentrally cooperating agents' intelligence as a function of time and swarm size is the relevant pulse of intexp

1 more comment...

No posts

Ready for more?