You must overprovision to avoid latency from queueing. Modern latency expectations in a data center are around a microsecond transit time between any two machines. Less, if they are neighbors. Random fluctuations in demand require you to provision to handle peaks. If an application uses scatter-gather ratios of 100 (realistic for data-centric apps) then the slowest of those 100 requests will set the latency of the stage in an app. So, your capacity should handle peaks in the 99th percentile, or better.
The domain I come from has 10 million cores in a data center in a unified network (a single cloud data center). Networking within rack scale is uninteresting. The hardware for that is cheap, the perf problems are insignificant. A 25Tbps full duplex switch is a single chip and units built around them ship in volumes of 100's of thousands. Even a rough protocol like RoCE works smoothly on a single hop through those.
It starts to get interesting at cluster scale, which is generally about 20 racks and 50,000 cores. Though even that is fairly easy for general purpose compute. It does become challenging for supercomputers and ML clusters, which have orders of magnitude more traffic than general purpose compute. Hence the effort for TPUv4 to innovate at that level.
The design is to eliminate queuing so no overprovisioning required. Also - the goal is to keep everything to a single hop. For now, we constrained the design to 100% capacity so there is no tailing latency as there is a perfect balance between throughput and demand.
10 million cores is a real stretch. Cluster scale of 50,000 cores much more achievable with some level of aggregation.
Having said that, we would likely provision compute in a whole other manner.
Definitely not something we would tackle in the short-term. Smaller node implementations under a few hundred endpoints is more addressable for NOCs, LANs and AI/ML.