Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/google-says-its-ai-supercomputer-is-faster-greener-than-nvidia-a100-chip.17732/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Google says its AI supercomputer is faster, greener than Nvidia A100 chip

Daniel Nenni

Admin
Staff member
Google-cloud-tpu-pod-730x480.jpg


April 4 (Reuters) - Alphabet Inc's (GOOGL.O) Google released on Tuesday new details about the supercomputers it uses to train its artificial intelligence models, saying the systems are both faster and more power-efficient than comparable systems from Nvidia Corp (NVDA.O). Google has designed its own custom chip called the Tensor Processing Unit, or TPU. It uses those chips for more than 90% of the company's work on artificial intelligence training, the process of feeding data through models to make them useful at tasks such as responding to queries with human-like text or generating images.
Advertisement · Scroll to continue

The Google TPU is now in its fourth generation. Google on Tuesday published a scientific paper detailing how it has strung more than 4,000 of the chips together into a supercomputer using its own custom-developed optical switches to help connect individual machines. Improving these connections has become a key point of competition among companies that build AI supercomputers because so-called large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip.
Advertisement · Scroll to continue

The models must instead be split across thousands of chips, which must then work together for weeks or more to train the model. Google's PaLM model - its largest publicly disclosed language model to date - was trained by splitting it across two of the 4,000-chip supercomputers over 50 days. Google said its supercomputers make it easy to reconfigure connections between chips on the fly, helping avoid problems and tweak for performance gains.

"Circuit switching makes it easy to route around failed components," Google Fellow Norm Jouppi and Google Distinguished Engineer David Patterson wrote in a blog post about the system. "This flexibility even allows us to change the topology of the supercomputer interconnect to accelerate the performance of an ML (machine learning) model."

While Google is only now releasing details about its supercomputer, it has been online inside the company since 2020 in a data center in Mayes County, Oklahoma. Google said that startup Midjourney used the system to train its model, which generates fresh images after being fed a few words of text.

In the paper, Google said that for comparably sized systems, its chips are up to 1.7 times faster and 1.9 times more power-efficient than a system based on Nvidia's A100 chip that was on the market at the same time as the fourth-generation TPU. A Nvidia spokesperson declined to comment. Google said it did not compare its fourth-generation to Nvidia's current flagship H100 chip because the H100 came to the market after Google's chip and is made with newer technology. Google hinted that it might be working on a new TPU that would compete with the Nvidia H100 but provided no details, with Jouppi telling Reuters that Google has "a healthy pipeline of future chips."

 
Chip companies beware, systems companies are coming for you, absolutely! Bespoke silicon!

Bespoke chips are one of the great trends of 2022 that will have a lasting impact. In 2021, bespoke chips were designed by a wide range of forward looking, innovative systems companies, such as Apple, Meta, Amazon, Google, and Tesla. These companies view made-to-order, optimized silicon as critical to maintaining their competitive advantage, especially in the areas of big-data and machine learning (ML). The implications will forever change the semiconductor industry. The main impacts include:
  • Bespoke chips are now designed by systems companies with system design experience. The semiconductor and electronic design automation (EDA) companies will need to adjust their offerings to accommodate this new line of thinking.
  • Great demands are being placed on silicon technology. To meet the advanced integrated system demands required of the bespoke chips, great silicon and design technology will need to be developed and deployed. Companies like Intel are rising to the challenge as they continue to drive Moore’s Law and More-than-Moore techniques such as 2.5 and 3D-IC.
  • A greater demand for, and challenges of, multiphysics simulations. As bespoke chips take on more and more characteristics of systems, the multiphysics that systems companies are drawn into IC design. Physics such as electromagnetics, thermal, and stress will not only become required of design tools, but they must be solved concurrently, as these physics will impact each other.
  • The need for open multiphysics platforms will proliferate. No single EDA company will meet all the demands of bespoke silicon design. EDA companies should work together to provide solutions as any single vendor solution will be insufficient.
 
...
  • The need for open multiphysics platforms will proliferate. No single EDA company will meet all the demands of bespoke silicon design. EDA companies should work together to provide solutions as any single vendor solution will be insufficient.
...
Can you expand on this? It's not clear why a 'one-stop-shop' vendor can't exist.
 
Can you expand on this? It's not clear why a 'one-stop-shop' vendor can't exist.

It certainly can but innovation would suffer. I have seen it before and hope not to see it again. Competition spurs innovation and in the semiconductor ecosystem innovation is key to moving forward quickly and cost effectively.
 
Interoperability will be crucial as more problem specific bespoke silicon is produced. I think we will an explosion of chip interconnects but with it will come a new bottleneck at the edges of custom silicon. Exchanging data between custom silicon is the problem around the corner.

"Circuit Switching" is the future.... Still - there is lots of room left to optimize topology of interconnects..... As elegant as it might appear - it is still waaaaay too complicated and energy consuming.
 
Interoperability will be crucial as more problem specific bespoke silicon is produced. I think we will an explosion of chip interconnects but with it will come a new bottleneck at the edges of custom silicon. Exchanging data between custom silicon is the problem around the corner.
UCIe is intended to solve this problem.
"Circuit Switching" is the future.... Still - there is lots of room left to optimize topology of interconnects..... As elegant as it might appear - it is still waaaaay too complicated and energy consuming.
For streams of data, like AI/ML applications typically use, circuit switching can be a significant advantage. For smaller messages with very low latency requirements, packet-switching looks more efficient, especially for congestion management. Why do you think circuit switching is the future? Or is it just for targeted futures?
 
There will be alternatives to UCIe.. UCIe is only the beginning

Packet switching "looks more efficient" because there is an assumption of some cost/performance tradeoffs.... If you can get circuit switching at the same cost as packet (or even lower) then packet switching stops making sense. (a core aspect of what I am understanding as packet switching includes externally managed addressing, routing and transmission/retransmissions - are we aligned here?)

Congestion management benefits presume that there is congestion. If circuit designs can provision for 100% of possibilities then there is no congestion. Again - there is some fixation that eliminating congestion is contingent on cost, but if the cost of 100% throughput moves to zero, then it is hardly a thought.

I am focused on local / known data exchange - such as in semiconductors, in a data center rack and on a desk. Communications between distant / unknown systems need not be replicated to connect known and co-located systems. This is the greatest fallacy in engineering/computing today and it leads to all kinds of inefficiencies and overheads. Fundamentally, shifting away from "handling" data between unknown and unintelligent systems will unleash incredible breakthroughs is speed, power, flexibility and latency.
 
Can you expand on this? It's not clear why a 'one-stop-shop' vendor can't exist.
Hi M.Y. - The one-stop-shop idea assumes that a single company will give you the best solutions across all problem spaces. That is not realistic as 3D-IC is driving chip designers to take on more system-level analysis physics than traditionally appear in a monolithic chip design. Things like thermo-mechanical stress/warpage analysis are addressed by powerful and mature commercial solutions - just not from EDA companies. This analysis is crucial for multi-die assemblies like all the recent X86 microprocessor products on the market. Similar challenges arise from electromagnetic interactions, and thermal analysis.
These issues are not trivial and have decades of industry experience to inform them. It is extremely naive to think you can just hire a few guys, write some code, and think you will have a competent industrial solution for these problems. That is why all leading designers pull together specialized solutions from multiple companies with a proven track record in each technology. While there are benefits to integration, they do not outweigh the benefits from superior technology. And no one can be the best at everything.
 
There will be alternatives to UCIe.. UCIe is only the beginning
Maybe, but every significant chip company in the industry is working on UCIe and promising to deliver it. And that article is just marketing material, and Eliyan is a client of article's author.
Packet switching "looks more efficient" because there is an assumption of some cost/performance tradeoffs.... If you can get circuit switching at the same cost as packet (or even lower) then packet switching stops making sense. (a core aspect of what I am understanding as packet switching includes externally managed addressing, routing and transmission/retransmissions - are we aligned here?)
No, we're not aligned. Are you referring to transient circuits or persistent circuits? If you can support a very large number of persistent circuits, it is possible that latency could be lower than with packet switching, but then the question becomes... at what level does the circuit exist? What level are you referring to? Just the physical ports? Higher?

Ethernet does have an externally managed address, in that the endpoint address is the port's MAC address, which is assigned by the manufacturer. Other interconnects assign addresses at the endpoint, which are discovered, or through fabric managers, which can assign them centrally.
Congestion management benefits presume that there is congestion. If circuit designs can provision for 100% of possibilities then there is no congestion.
The only interconnects which don't congest are direct interconnects (point to point), like a dragonfly. Direct interconnects with intermediate routing, like a torus, can and usually do congest.
Again - there is some fixation that eliminating congestion is contingent on cost, but if the cost of 100% throughput moves to zero, then it is hardly a thought.
You're going to have to move beyond this sort of dreaminess for us to discuss it. I know what comes next... I'm just having a failure of imagination.
I am focused on local / known data exchange - such as in semiconductors, in a data center rack and on a desk.
Three problem spaces which have pretty much nothing in common now.
Communications between distant / unknown systems need not be replicated to connect known and co-located systems.
I don't know what this means. Can you explain it?
This is the greatest fallacy in engineering/computing today and it leads to all kinds of inefficiencies and overheads. Fundamentally, shifting away from "handling" data between unknown and unintelligent systems will unleash incredible breakthroughs is speed, power, flexibility and latency.
@cliff , are you paying Chris to post stuff like this to amuse yourself?
 
Last edited:
No, but I've thought about it.
ROFL

I completely understand that I am coming from far out in left field.
..... I can see that cliff is very open to encouraging someone that is even more left field than he is.

And as much as I confuse.. I do promise that there is a solid landing on the other side of imagination
(granted from an engineering perspective there are unicorns in that space)
 
Last edited:
If you are into learning more about the platform: https://arxiv.org/ftp/arxiv/papers/2304/2304.01433.pdf
Or more about the optical switching, https://research.google/pubs/pub51587/

Personally, I like SRD better for general purpose switching since it supports a more any-to-any, random access pattern needed for remote procedure calls to services all around a data center, including resources scaled out over thousands of machines: https://blog.ipspace.net/2022/12/quick-look-aws-srd.html
 
Maybe, but every significant chip company in the industry is working on UCIe and promising to deliver it. And that article is just marketing material, and Eliyan is a client of article's author.
Yes - the author of the paper is an investor / cofounder of Eliyan who is building upon the UCIe concept with an alternative implementation. I can see some value to the offering - but it will be hard to compete against an open standard - although there will be variations. Everyone wants to eek out an edge in the market.
.... as you might expect - I have an alternative in the works that is a lot less cumbersome compared to CXL and UCIe. (I know, I know - we can come back to it later)

No, we're not aligned. Are you referring to transient circuits or persistent circuits? If you can support a very large number of persistent circuits, it is possible that latency could be lower than with packet switching, but then the question becomes... at what level does the circuit exist? What level are you referring to? Just the physical ports? Higher?

Ethernet does have an externally managed address, in that the endpoint address is the port's MAC address, which is assigned by the manufacturer. Other interconnects assign addresses at the endpoint, which are discovered, or through fabric managers, which can assign them centrally.

The only interconnects which don't congest are direct interconnects (point to point), like a dragonfly. Direct interconnects with intermediate routing, like a torus, can and usually do congest.
The design provisions for the creation of point-to-point circuits for all ports simultaneously- on an on-demand/as-required basis. These are direct port-to-port circuits from system to system - exactly like dragonfly. So yes - this is exactly the idea/application. I dont use any intermediate routing like a torus or switched packet system like Ethernet. The closest concept is like RapidIO which pretends to be a peripheral Master/Slave connection by assigning slave addressing to systems along with master addresses. Cumbersome but works in some cases.

What I do is provision for 100% data transfer coverage between any number of systems. We can support fully bidirectional transmissions by binding two circuits ... this expands to permitting a one-to-many and many-to-one connection with congestion constraints at the "one" - but logically possible with multiple ingress ports configured.

The MAC address is leveraged by an Ethernet switch for the assignment an address - so I will add the precision that the network address is aligned with the MAC address and that the MAC address itself is not the address for the network. In a cross-over configuration, one must assign manually into the IP stack. Fabrics also assign a logical address to each physical connection - either by using a hardware signature or implementing one of their own. My point is - these "addresses" are for the benefit of the Network routing and switching....

You're going to have to move beyond this sort of dreaminess for us to discuss it. I know what comes next... I'm just having a failure of imagination.

Let me see if I can help draw a picture.... imagine a box that has a connection for every system (two if you want to talk bidirectional but lets imagine the data transfers are small and the transmission speed is fast so that we can flip directions on a single circuit)

So every system is connected to this box - and every system will either read or write to this box.

Now imagine that, in this box, there is a system delegate or proxy. Let me pull you into my dreamworld of AI and ask you to imagine that this delegate or proxy is like a delegate at the UN... Every country is present - insofar as they have a delegate at the discussion hall - and each delegate can find a table at which to negotiate a deal with every other delegate. So many papers are exchanged - in person - on behalf of the remote governments. Regardless of how each delegate is instructed - they act with the full authority of the home country. So - with local intelligence - each delegate finds the other and deals direct - resulting in seamless communication. And.. if we move back to a computer - we know that we can make an AI delegate more efficient than a human delegate - so if our Ai delegate can act as efficiently as a machine and as independently as a human - we can see how data exchanged across sufficient "tables" in out UN box - can exchange any amount of information without congestion: delegates find themselves, there are sufficient tables at which to sit, and the Ai delegate is machine efficient. Actually - NVIDIA kinda does this through its AI-trained switches... it trains its Ai-powered switches to learn what paths need to be created.

That is exactly what I put in a box.

The difference is that - unlike the NVIDIA Ai-trained switches - I have discovered how to train the delegates - "just in time" - so there is no upfront training. Very much like science fiction - but oh so much simpler (I am really not as smart or well trained as others on here).

So there is no congestion - there are always enough tables to support the creation of a conversation / link. And my Ai delegates are assigned a table at which to transact on a just-in-time basis.

Three problem spaces which have pretty much nothing in common now.
I don't know what this means. Can you explain it?

Computer networks were created to connect systems that were miles apart. They were connected in a way similar to the telephone. Telephones connected to a local switchboard and were manually plugged into another connection to create a circuit. When physically separated - an address is required to differentiate one thing versus another. If you live in a huge country, you rely on postal codes or zip codes. If you live a small rural town you say "the house with the big blue barn". When everything can ve aware if the other thing - each can self identify. I can introduce myself in a room and exchange a business card with anyone. This is a simple intelligent and ad-hoc circuit (biz card transfer). In an ethernet world, I have to go register, get a number, find someone else's number - then line up to provide my card to the room business card proctor and have that proctor stuff my card in the recipients 'inbox. What results is alot of junk - zero trust implementations of rejecting every card until each is validated and verified. etc etc. It is just silly to impose a business card proctor on me to exchange my information with another person in the room.

And our answer for computer systems that sit one beside the other on the same desk to do the same - just faster. Or between two IP blocks in the same electronic design to do that same. If I - as a user, designer or technician can commonly control two systems - why can't I make them transfer data directly?? In all cases I am in control of all three scenarios - I can set the rules for successful interaction.

In a first simulation - I wired two USB controllers together - directly wired them. Literally! I hit receive on the first system - and send on the second system. And the first 16 bits flew across at 457MBps. .... it took some time to scale the data, and scale the number of systems - but we did it.

It comes down to this simple axiom: "“if lines are cheap, use circuit switching; if computing is cheap, use packet switching” - Roberts, “The Evolution of Packet Switching.”, Proceedings of the IEEE (Volume 66, Issue: 11), Nov 1978.

I don't know what this means. Can you explain it?

It means I can train intelligent systems that are side by side, directly connect them to a common magic box - such that they transfer data without a network control or any external address. Each system and data exchange being fully secure and private. And by "intelligent system" I mean "any processor that can run logic" - so that means I treat semiconductor, laptops and servers all identically and interchangeably.

As to scaling - as long as I can connect to an external optical interconnect (available) I can conceivably connect 100,000 systems to a point-to-point switched interconnect at under 1 ns in a any-to-any and any-to-many configuration. Such a platform supports 1.4 petabytes per second simultaneous switching at under 1 KW. Using technology that exists today.

It really is all unicorns and rainbows.

I already have it working. If anyone would lie to peer review a paper illustrating it working - please DM me.
(and by peer review, I mean review a wannabe peer's work - I am truly working on better explaining why/how this really works)
 
Last edited:
Or more about the optical switching, https://research.google/pubs/pub51587/

Personally, I like SRD better for general purpose switching since it supports a more any-to-any, random access pattern needed for remote procedure calls to services all around a data center, including resources scaled out over thousands of machines: https://blog.ipspace.net/2022/12/quick-look-aws-srd.html

If I hadn't created a magic box that works with negligible power and completely eliminates tailing latencies - I would agree.
I will say that my design does tend to abuse the term "remote procedure calls" but let's say that delegated access to any / all connected resources is totally the objective!
 
There will be alternatives to UCIe.. UCIe is only the beginning
UCIe is several layers. You can swap in a PHY like Eliyan and still have a UCIe stack. It just needs both chips to agree, but would not disturb the OS or applications if done right.
Packet switching "looks more efficient" because there is an assumption of some cost/performance tradeoffs.... If you can get circuit switching at the same cost as packet (or even lower) then packet switching stops making sense. (a core aspect of what I am understanding as packet switching includes externally managed addressing, routing and transmission/retransmissions - are we aligned here?)
Packet switching assumes each packet is allowed a different destination. Modern switches assume this is only the outlier, and cache for MRU routing patterns, but they still need thousands of those. Circuit switching like Google is doing requires dominant flows to be reasonably stable, and it to be economic to over-provision capacity. The over-provisioning is already commonplace, packet switching is so unstable when low latencies are required (and the money making flows are all low-latency with long-pole sensitivity) that data center networks carry only a small fraction of the peak capacity, and not by accident.

Situations like machine learning may be ideal for this, when they can afford to have all-to-all links from per-rack routers and all the flows are balanced over long periods of time (configure a subset of the machines allocated to solving one problem for several days or even weeks). That is clover for circuit switching. But it is not so obviously practical when the task is to connect a million VMs to a million storage servers using balanced scale-out and random access.
Congestion management benefits presume that there is congestion. If circuit designs can provision for 100% of possibilities then there is no congestion. Again - there is some fixation that eliminating congestion is contingent on cost, but if the cost of 100% throughput moves to zero, then it is hardly a thought.
Classic WAN-style congestion is not a data center issue, because links are over-provisioned to ensure uniform low latency. See the SRD paper in my previous post.
I am focused on local / known data exchange - such as in semiconductors, in a data center rack and on a desk. Communications between distant / unknown systems need not be replicated to connect known and co-located systems. This is the greatest fallacy in engineering/computing today and it leads to all kinds of inefficiencies and overheads. Fundamentally, shifting away from "handling" data between unknown and unintelligent systems will unleash incredible breakthroughs is speed, power, flexibility and latency.
You are correct, and people tend to rush to scale-out too early. Up to about 100 terabytes of data you can generally solve problems on co-located tightly coupled machines. The devil in the details is how you survive failures ranging from ensuring your data is replicated, to how rapidly you can fail over to new hardware if your machine has a fatal fault. And then there are issues of usage density. Distributed machines can put many more resources into random and parallel access so that even if your problem fits in one machine, it might work a lot better scaled across 100 machines shared with 100 other clients in a cloud (at least the data side of things: computation tends to benefit from concentration in as few machines as possible). Which is why data storage scale out is enormous in the cloud.
 
@Tanj - Good points - and completely within the realm of current engineering approaches.

In addition to my earlier explanation: Our design does not over provision - only 100% of potential data exchange is implemented. Like the Google example - this is like Ai routing or circuit creation, but there is no training required. Our method lets connected systems create the circuit in our magic box as they want it, whenever they want it.

When I consider distributed systems - I am connecting them to themselves - and not to the middle. Like the two laptops on your desk.... a direct circuit using USb up to 120Gbps these days. ;-)
 
In addition to my earlier explanation: Our design does not over provision - only 100% of potential data exchange is implemented. Like the Google example - this is like Ai routing or circuit creation, but there is no training required. Our method lets connected systems create the circuit in our magic box as they want it, whenever they want it.
You must overprovision to avoid latency from queueing. Modern latency expectations in a data center are around a microsecond transit time between any two machines. Less, if they are neighbors. Random fluctuations in demand require you to provision to handle peaks. If an application uses scatter-gather ratios of 100 (realistic for data-centric apps) then the slowest of those 100 requests will set the latency of the stage in an app. So, your capacity should handle peaks in the 99th percentile, or better.
When I consider distributed systems - I am connecting them to themselves - and not to the middle. Like the two laptops on your desk.... a direct circuit using USb up to 120Gbps these days. ;-)
The domain I come from has 10 million cores in a data center in a unified network (a single cloud data center). Networking within rack scale is uninteresting. The hardware for that is cheap, the perf problems are insignificant. A 25Tbps full duplex switch is a single chip and units built around them ship in volumes of 100's of thousands. Even a rough protocol like RoCE works smoothly on a single hop through those.

It starts to get interesting at cluster scale, which is generally about 20 racks and 50,000 cores. Though even that is fairly easy for general purpose compute. It does become challenging for supercomputers and ML clusters, which have orders of magnitude more traffic than general purpose compute. Hence the effort for TPUv4 to innovate at that level.
 
The design provisions for the creation of point-to-point circuits for all ports simultaneously- on an on-demand/as-required basis. These are direct port-to-port circuits from system to system - exactly like dragonfly. So yes - this is exactly the idea/application. I dont use any intermediate routing like a torus or switched packet system like Ethernet.
You understand that a torus can be and usually is just a topological variation on a switched packet system, right? In a torus, or a dragonfly for that matter, the switch is just integrated into the nodes. Both direct interconnects and systems with discrete switches can use packet switching or circuit switching. The decision of which to use is independent of the networks' topologies.
The MAC address is leveraged by an Ethernet switch for the assignment an address - so I will add the precision that the network address is aligned with the MAC address and that the MAC address itself is not the address for the network.
This sentence is difficult to comprehend. Are you saying this magic box you're building works with Ethernet networks?
In a cross-over configuration, one must assign manually into the IP stack. Fabrics also assign a logical address to each physical connection - either by using a hardware signature or implementing one of their own. My point is - these "addresses" are for the benefit of the Network routing and switching....
I don't understand your point. What is a "cross-over configuration"? Are you really trying to describe a crossbar?

Yes, some interconnects, Fibre Channel and InfiniBand come to mind, use fabric managers to dynamically assign identifiers to ports to make switching tables more efficient. I can imagine what a pure point-to-point interconnect might look like, as in a crossbar, so you don't need addressing, but your terminology and your referencing of standard terms is not precise or easy to interpret.
Let me see if I can help draw a picture.... imagine a box that has a connection for every system (two if you want to talk bidirectional but lets imagine the data transfers are small and the transmission speed is fast so that we can flip directions on a single circuit)
Most people in the field call this a crossbar.
So every system is connected to this box - and every system will either read or write to this box.

Now imagine that, in this box, there is a system delegate or proxy. Let me pull you into my dreamworld of AI and ask you to imagine that this delegate or proxy is like a delegate at the UN... Every country is present - insofar as they have a delegate at the discussion hall - and each delegate can find a table at which to negotiate a deal with every other delegate. So many papers are exchanged - in person - on behalf of the remote governments. Regardless of how each delegate is instructed - they act with the full authority of the home country. So - with local intelligence - each delegate finds the other and deals direct - resulting in seamless communication. And.. if we move back to a computer - we know that we can make an AI delegate more efficient than a human delegate - so if our Ai delegate can act as efficiently as a machine and as independently as a human - we can see how data exchanged across sufficient "tables" in out UN box - can exchange any amount of information without congestion: delegates find themselves, there are sufficient tables at which to sit, and the Ai delegate is machine efficient.
I have no idea what you're talking about.
Actually - NVIDIA kinda does this through its AI-trained switches... it trains its Ai-powered switches to learn what paths need to be created.
Nvidia has two families of networks. NVLink, which they use to provide high-throughput, low-latency distributed shared virtual memory between GPUs, and the (formerly Mellanox) family of InfiniBand and Ethernet switches and adapters. I've never seen any evidence that either network family uses "AI-powered" switches to control routing. Can you point to evidence that they do?
That is exactly what I put in a box.

The difference is that - unlike the NVIDIA Ai-trained switches - I have discovered how to train the delegates - "just in time" - so there is no upfront training. Very much like science fiction - but oh so much simpler (I am really not as smart or well trained as others on here).

So there is no congestion - there are always enough tables to support the creation of a conversation / link. And my Ai delegates are assigned a table at which to transact on a just-in-time basis.
Same comment as above.
Computer networks were created to connect systems that were miles apart. They were connected in a way similar to the telephone. Telephones connected to a local switchboard and were manually plugged into another connection to create a circuit. When physically separated - an address is required to differentiate one thing versus another. If you live in a huge country, you rely on postal codes or zip codes. If you live a small rural town you say "the house with the big blue barn". When everything can ve aware if the other thing - each can self identify. I can introduce myself in a room and exchange a business card with anyone. This is a simple intelligent and ad-hoc circuit (biz card transfer). In an ethernet world, I have to go register, get a number, find someone else's number - then line up to provide my card to the room business card proctor and have that proctor stuff my card in the recipients 'inbox. What results is alot of junk - zero trust implementations of rejecting every card until each is validated and verified. etc etc. It is just silly to impose a business card proctor on me to exchange my information with another person in the room.
This is nonsense. Ethernet, for example, was originally a CSMA/CD bus. The POTS network was the original circuit-switching network. Your use of inappropriate and incorrect examples erodes your credibility.
And our answer for computer systems that sit one beside the other on the same desk to do the same - just faster. Or between two IP blocks in the same electronic design to do that same. If I - as a user, designer or technician can commonly control two systems - why can't I make them transfer data directly?? In all cases I am in control of all three scenarios - I can set the rules for successful interaction.
This is exactly how some interconnects, like PCIe, work. This is also my understanding of how NVLink works. So does Ethernet, in its native layer 2 form. I'd have to think about the various modes of FC and IB to compare them, but this discussion isn't worth stretching my brain like that.
In a first simulation - I wired two USB controllers together - directly wired them. Literally! I hit receive on the first system - and send on the second system. And the first 16 bits flew across at 457MBps. .... it took some time to scale the data, and scale the number of systems - but we did it.

It comes down to this simple axiom: "“if lines are cheap, use circuit switching; if computing is cheap, use packet switching” - Roberts, “The Evolution of Packet Switching.”, Proceedings of the IEEE (Volume 66, Issue: 11), Nov 1978.
I know what circuit switching is.
It means I can train intelligent systems that are side by side, directly connect them to a common magic box - such that they transfer data without a network control or any external address. Each system and data exchange being fully secure and private. And by "intelligent system" I mean "any processor that can run logic" - so that means I treat semiconductor, laptops and servers all identically and interchangeably.

As to scaling - as long as I can connect to an external optical interconnect (available) I can conceivably connect 100,000 systems to a point-to-point switched interconnect at under 1 ns in a any-to-any and any-to-many configuration. Such a platform supports 1.4 petabytes per second simultaneous switching at under 1 KW. Using technology that exists today.

It really is all unicorns and rainbows.

I already have it working. If anyone would lie to peer review a paper illustrating it working - please DM me.
(and by peer review, I mean review a wannabe peer's work - I am truly working on better explaining why/how this really works)
Since light or electricity travels about 1 foot per nanosecond, connecting 100,000 systems with 1ns latency is quite a trick. I agree, it does appear to be all unicorns and rainbows.
 
Last edited:
Ok - too much too fast. Got it.

A dragonfly cabling arrangement is a direct data topology ... I was trying to say that this is what we create - a direct point-to-point data path - implemented dynamically and temporarily. A cross-over cable is a cable that has had the internal wires crossed over so as to directly connect two systems. https://en.wikipedia.org/wiki/Ethernet_crossover_cable

Ethernet is one (very popular) network. And yes, Ethernet is a shared transmission medium based on a CSMA/CD bus. Networks - however - were created prior to Ethernet - they were a cumbersome arrangement of point-to-point circuits. Ethernet was a way to establish a local network by eliminating the set-up and tear down of the point-to-point data circuits that connected the earliest systems. This was expanded via Internet Protocol addressing to expand this "shared transmission". There is no misunderstanding. Ethernet (IMHO) works for the Internet but isn't appropriate for local connections (ironic).

The NVSwitch chip is the interconnect which allows multiple instances of NVLink—the direct chip-to-chip connection—to talk and enable lots of Nvidia graphics cards to work together. 25.1 billion transistors on the NVLINK Switch are required to support 256 connections - supporting 56TBps. You are right of course that the switches are not themselves Ai-enabled. I mixed up NVIDIA's AI marketing fluff with dynamically reconfiguring CLOS and BENES arrangements. Cray's BlackWidow processor is a more appropriate example. It effectively trains itself to establish the best 3-tier connection between systems to reduce latency and find the most efficient route.

The system to which I refer creates a bridge between any of the connected points in a switch box - a temporary data bridge - for however long its required. All of these bridges are non-blocking. Pretty simple. As 'creating a bridge' whenever required might seem a bit science fiction - I drew a parallel to delegated negotiations (data exchanges) between people. If you can consider people can enter a room and have a conversation without some network authority - then consider that systems that are made aware of each other can be logically programmed to do the same.

As to protocol: connection protocols are completely irrelevant to the design and functioning of the switching arrangement. I argue that a packet-based connection between system and switchbox is best left as a circuit.... packet overheads are useless in my world. With direct connections to/from each system - our magic box creates direct point-to-point / point-to-multipoint circuits on the fly - for as many connections are required. Each signal received by the box is unpacked from its medium and protocol. A preferred implementation between system and switchbox is PCIe - and an optical medium might be best for distance. The result is that this magicbox doesn't care it it is replacing a NOC, a desktop switch or a data center switching platform. Arguably it would intermediate any and all of these.

The intelligence (logic) of how to exchange data between known systems is put in the connected systems - not the switchbox. The whole paradigm of "rely an external network" aspect is overwrought and unnecessary when systems are "known". Network design is the cause of its multiple security issues.

I noted that the switch latency for the 100,000 systems is 1ns for each connection. Each connection happens in parallel and independent to the other as all are non-blocking and there is no congestion. I am simply proposing something way simpler than anything implemented thus far and therefor super confusing enough without twisting my statements. Each switched connection is delayed by a few clock cycles across the switch - and each cycle is constrained by at least propagation speeds. So yes - everything confirms to physical realities. Two systems each within 10 meters of the central box would exchange data in about 80ns....

Honestly I cannot explain things within a constrained view of data exchange when preponderant engineering views of data exchange include a lot of unnecessary steps and preclude a simpler truth.

Next time you talk to someone - ask yourself why it works.
- You talk - they listen ....
- if they didn't care to listen then they didn't hear you.
- If they did listen - they did hear you.

Not sure why a "data exchange" needs to be more complicated than that.
 
Last edited:
Back
Top