3 Comments
User's avatar
Steeven's avatar

One thing I don’t get is how the compute is supposed to be decentralized at frontier scale. At a certain level of scale, your decentralized compute landscape is still easily identifiable. You can hide training a 60B model but hiding a 2T model would be much more challenging. Where exactly are you storing the B200s such that they can’t be easily identifiable and how do you set up the procurement for all those GPUs without identifying yourself.

Maybe it’s a bet on more efficient training causing the power, time, and hardware requirements to go down, but it seems like all of those would likewise push frontier data center capabilities and wouldn’t narrow the gap.

DiLoCo and other algorithms for decentralized training can also be applied to frontier models which train across multiple data centers, but help decentralized training more.

This post isn’t the first I’ve seen for talking about how decentralized training changes everything (although it is the most comprehensive so well done). But I’m still not clear on how this breaks the logic of compute regulation since A. Decentralized runs are nowhere near the rapidly advancing frontier and B. At a certain scale, decentralized training still requires massive purchase orders of GPUs and power which would still allow compute to be verified

Jaime Sevilla's avatar

TBC I also think it would be extremely challenging to hide a frontier training run, but that's not necessarily the point.

Decentralized training could leverage spare compute from multiple owners. Hyperscalers and neoclouds own about 50% of compute, while the rest is more diffuse.

I could see this being used to eg decentrally train models over 1e25 FLOP among many entities, which would then be subject to the EU AI Act.

By necessity, it's much harder to impose regulations on this larger group of agents. There still will likely be a coordinating entity you can hold accountable, but it is not a given the coordinator will be salient or that they will cooperate.

IntExp's avatar

great review!

scaling at the agent/swarm level is where foom will manifest, especially if the decentrally trained models remain inferior to SOTA open weight models.

decentrally cooperating agents' intelligence as a function of time and swarm size is the relevant pulse of intexp