The Coherence Domain — Autonomous Industries

NVLink 5 moves data between GPUs at 1.8 terabytes per second over a span of tens of meters. A long-haul fiber link between two campuses fifty miles apart moves data at 400 gigabits per second with a round-trip latency above ten milliseconds. The ratio between those two numbers is roughly four thousand. That ratio is the reason a frontier training cluster cannot be split across two sites, regardless of how much power is available at each one.

This is the constraint that quietly governs the entire next phase of AI infrastructure. It is not a software constraint. It is not a procurement preference. It is a property of how synchronous gradient descent over hundreds of thousands of accelerators actually works, expressed through the physical media that connect them. The unit of frontier AI infrastructure is set by the longest interconnect that sustains the required throughput at the required latency. Everything else, including how much land must be assembled, how much power must be delivered to one meter, and how the underwriting category is drawn, follows from that.

The industry has a name for the boundary around which this physics holds. It is the coherence domain.

What A Coherence Domain Is

A coherence domain is the region inside which a set of GPUs behaves as if it were a single machine. Inside that region, memory references between accelerators complete in submicrosecond time, all-reduce operations across the full population finish within a step budget, and the loss function actually converges. Outside that region, the same workload runs, but it runs slower, and the slowdown is not linear. It compounds with model size.

The coherence domain has a tiered structure, and each tier corresponds to a physical medium.

At the innermost tier sits NVLink. On Blackwell-generation hardware, NVLink 5 delivers 1.8 terabytes per second of bidirectional bandwidth between any two GPUs in the domain. The NVL72 rack architecture takes that further and presents 72 GPUs as one coherent unit, with the entire population reachable through a copper backplane that spans the height of a single rack. The reach of NVLink is measured in meters. Inside that radius, the fabric is effectively transparent. Outside it, you are no longer on NVLink.

The next tier is InfiniBand. NDR runs at 400 gigabits per second per port today. XDR is on the roadmap at 800 gigabits per second. InfiniBand reaches farther than NVLink, hundreds of meters across a data hall, and it is the medium by which thousands of NVL72 racks are stitched into a single training fabric. The bandwidth per link is roughly an order of magnitude lower than NVLink and the latency floor is higher. The architecture compensates by treating each rack as a scale-up unit and the InfiniBand layer as scale-out connective tissue between them.

The outermost tier is ethernet, with optical transceivers and increasingly long reaches. Ethernet can carry the data, but it cannot deliver the latency profile that synchronous training requires across the full GPU population. It is the right medium for inter-region replication, for inference traffic, for storage, and for control planes. It is not the right medium for the inner loop of a frontier pretraining run.

Those three tiers, NVLink inside the rack, InfiniBand across the hall, ethernet beyond, define a nested set of radii. The coherence domain is the outermost radius at which the fabric still sustains the throughput and latency that the training workload demands. For a frontier-scale run, that radius is the size of a single campus.

The Distance That Kills Throughput

The physical reason the radius is finite is mundane and unforgiving. Signals attenuate. Copper traces lose decibels per meter. Optical fiber introduces propagation delay at roughly five microseconds per kilometer. Each signal regeneration adds latency, jitter, and cost. Each switch hop adds nanoseconds that accumulate across a 10,000-step training run into hours of wall-clock time.

Synchronous gradient descent imposes a step budget. At the end of every step, the gradients computed on every GPU must be reduced into a single aggregated gradient and broadcast back. If any subset of GPUs falls behind, the entire population waits. The slowest link in the fabric sets the floor for the whole cluster. Tail latency, not average latency, is what governs throughput at this scale.

When a cluster is split across two campuses, even fifty miles apart, the inter-campus link becomes the slowest link by a wide margin. Round-trip latency rises from microseconds to milliseconds. Bandwidth per GPU pair collapses by orders of magnitude. The all-reduce step that completed in a few hundred microseconds on a single site now stretches into the millisecond regime. Across a training run, the effective throughput of the cluster drops by ten to thirty percent. Published research on geographically distributed training has been remarkably consistent on that range.

A ten percent throughput loss on a billion-dollar cluster is a hundred-million-dollar opportunity cost over the life of the asset. A thirty percent loss is fatal. No operator running a frontier pretraining run accepts that tax voluntarily. The decision to build single-site is not a preference. It is the only configuration the physics permits.

Why Frontier Training Is Synchronous

A reasonable question at this point is whether the workload itself can be rewritten to tolerate distance. The honest answer is partially, and not at the frontier.

Research on asynchronous and locally synchronous training, most visibly DiLoCo and related work, has shown that pretraining can be made more tolerant of high-latency links by relaxing the synchronization requirement and allowing local optimizer steps to run on each island of GPUs before periodic global averaging. These methods work. They are deployed for some fine-tuning workloads, some continual training pipelines, and increasingly for inference-time learning. They do not yet match the convergence quality or sample efficiency of fully synchronous pretraining on the largest models.

The gap is meaningful. A frontier lab choosing between a synchronous run on a single site and an asynchronous run distributed across two sites is choosing between a model that reaches the target loss in the planned number of tokens and a model that needs measurably more tokens, or measurably more steps, or both, to reach the same loss. At the scale of a 2026 frontier run, where the marginal token costs real money and the wall-clock matters, that gap is decisive.

The shape of the research curve matters too. Asynchronous methods may eventually close the gap. They have not closed it yet, and the labs making campus decisions today are making them on the basis of the workload as it exists, not the workload that might exist in 2029. The infrastructure is being committed on a synchronous assumption.

What The Hyperscalers Have Already Conceded

The pattern is visible in every announced frontier-scale campus. xAI built Colossus in Memphis as a single contiguous site, then extended it horizontally on adjacent parcels rather than splitting across regions. Meta's largest training clusters consolidate on single campuses with intra-campus InfiniBand fabrics. OpenAI's Stargate architecture, in every public description, points at single-site builds of multi-gigawatt scale rather than distributed federations of smaller sites.

None of these operators would choose this topology if the physics allowed otherwise. Single-site builds concentrate construction risk, transmission risk, permitting risk, water risk, and community risk into one geography. The diversification argument for spreading load across multiple campuses is strong on every dimension except the one that governs the actual workload. The operators have all made the same choice because the workload makes it for them.

The convergence is informative. When competitors with different cost structures, different geographies, different political relationships, and different design teams all settle on the same topology, the constraint is upstream of their decisions.

The Integrated Mill Parallel

There is a precedent for this pattern in heavy industry, and the precedent clarifies what is happening.

In the second half of the nineteenth century, steel production was reorganized around a constraint that had nothing to do with management theory and everything to do with thermodynamics. Molten steel cools rapidly. The Bessemer converter, the open hearth furnace, and the rolling mill could each be operated as separate facilities, but the steel between them had to be reheated at every transition, and reheating was expensive enough to dictate the geometry of the entire industry.

The integrated mill solved this by placing the converter, the furnace, and the rolling line on a single contiguous site, with the steel moving from one stage to the next while it was still hot enough to work. Andrew Carnegie's Edgar Thomson Steel Works opened in 1875 on that pattern. Homestead followed in 1881. The economics of integrated steel, the cost curves that eventually consolidated American steel production into a small number of vast sites, all flowed from that physical fact. Steel cooled too quickly to transport between separate facilities, so the facilities had to be one facility.

The parallel to frontier training is precise. Gradients cannot be transported between separate campuses without prohibitive latency, so the campuses have to be one campus. The economics of single-site multi-gigawatt training, the underwriting categories that will eventually form around it, and the scarcity of land that supports it are all downstream of a physical constraint, not of a strategic choice.

It is worth noting how Carnegie's competitors responded. They did not argue that the integrated mill was a passing fashion. They built integrated mills. The constraint propagated until the industry's topology matched the physics.

What Becomes True By 2028

If the analysis above is right, several things follow over the next thirty months.

The unit of frontier training settles at the single-site campus, at scales between two and five gigawatts of IT load. Sites below that scale serve the second tier of model development, regional inference, fine-tuning, and specialized workloads. Sites above that scale are the locations where the actual frontier moves.

The contiguous land required to support a multi-gigawatt single-site campus expands. A two-gigawatt campus with realistic power density, cooling envelope, substation footprint, and expansion buffer occupies roughly two thousand five hundred to five thousand acres of contiguous, developable land. The number of such parcels in the United States with viable transmission interconnection is small and shrinking. The scarcity is not theoretical. It is already showing up in the land market.

Underwriting models adapt. The category that emerges is not data center land in general but coherent-footprint land in particular. Parcels that can support a single-site multi-gigawatt build trade as one asset class. Parcels that cannot, however good they might be for distributed deployment, trade as a different asset class. The spread between the two categories widens as the physics becomes common knowledge.

The asynchronous research path continues. It may matter enormously for the workloads downstream of pretraining. It does not, in the period under discussion, change where the frontier campus has to be built.

The physics does not negotiate. The infrastructure follows.