Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/google-says-its-ai-supercomputer-is-faster-greener-than-nvidia-a100-chip.17732/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Google says its AI supercomputer is faster, greener than Nvidia A100 chip

You must overprovision to avoid latency from queueing. Modern latency expectations in a data center are around a microsecond transit time between any two machines. Less, if they are neighbors. Random fluctuations in demand require you to provision to handle peaks. If an application uses scatter-gather ratios of 100 (realistic for data-centric apps) then the slowest of those 100 requests will set the latency of the stage in an app. So, your capacity should handle peaks in the 99th percentile, or better.

The domain I come from has 10 million cores in a data center in a unified network (a single cloud data center). Networking within rack scale is uninteresting. The hardware for that is cheap, the perf problems are insignificant. A 25Tbps full duplex switch is a single chip and units built around them ship in volumes of 100's of thousands. Even a rough protocol like RoCE works smoothly on a single hop through those.

It starts to get interesting at cluster scale, which is generally about 20 racks and 50,000 cores. Though even that is fairly easy for general purpose compute. It does become challenging for supercomputers and ML clusters, which have orders of magnitude more traffic than general purpose compute. Hence the effort for TPUv4 to innovate at that level.

The design is to eliminate queuing so no overprovisioning required. Also - the goal is to keep everything to a single hop. For now, we constrained the design to 100% capacity so there is no tailing latency as there is a perfect balance between throughput and demand.

10 million cores is a real stretch. Cluster scale of 50,000 cores much more achievable with some level of aggregation.

Having said that, we would likely provision compute in a whole other manner.

Definitely not something we would tackle in the short-term. Smaller node implementations under a few hundred endpoints is more addressable for NOCs, LANs and AI/ML.
 
Ok - too much too fast. Got it.

A dragonfly cabling arrangement is a direct data topology ... I was trying to say that this is what we create - a direct point-to-point data path - implemented dynamically and temporarily.
Ok, that answers one of my questions, are circuits transient or persistent? They're transient.
A cross-over cable is a cable that has had the internal wires crossed over so as to directly connect two systems. https://en.wikipedia.org/wiki/Ethernet_crossover_cable
I know what a crossover-cable is, but you left off the cable part. I get confused easily.
Ethernet is one (very popular) network. And yes, Ethernet is a shared transmission medium based on a CSMA/CD bus.
Ethernet began as a CSMA/CD bus. Each bus was a "segment", and the segments were interconnected by "bridges". Later, as people wanted to create larger LANs, Ethernet was extended to include point-to-point links, which replaced buses, though the term bridges remained, especially in the IEEE 802.1 specifications. For at least 20 years bridges have been referred to as switches, though still not by the IEEE. (When the bridging specification was enhanced to include traffic classes and other data center specific features, for example, the spec was called Data Center Bridging. Virtualization support for endpoints was called Edge Virtual Bridging. That's approaching 20 years ago though.)

I don't know of any Ethernet products which are still buses. In the 1990s, when I got my first tutorial on how Ethernet worked by a network expert colleague of mine, who was coincidentally named Chris, he schooled me a bit in Ethernet and IP. Back then integrated L2/L3 devices were called bridging routers. I remember rolling my eyes at the jargon.

We could spend hundreds of posts discussing Ethernet and its current state in the industry, but that's not relevant here. I only bring it up because you mentioned Ethernet technology and misstated its history and some aspects of how it works.
Networks - however - were created prior to Ethernet - they were a cumbersome arrangement of point-to-point circuits. Ethernet was a way to establish a local network by eliminating the set-up and tear down of the point-to-point data circuits that connected the earliest systems. This was expanded via Internet Protocol addressing to expand this "shared transmission". There is no misunderstanding. Ethernet (IMHO) works for the Internet but isn't appropriate for local connections (ironic).
You keep digging yourself in deeper and deeper... I recommend you stop digging. First of all, modern Ethernet is point-to-point. I was correcting your history. Second, Internet Protocol was created in a government agency called ARPA, by Vinton Cerf and Bob Kahn. They also created TCP. And, while there is a metropolitan area version of Ethernet, most of the internet is actually run over Wide Area Networks. The association between Ethernet and Internet Protocol is loose, and getting looser every year as WiFi and cellular are connecting the most devices by volume.
The NVSwitch chip is the interconnect which allows multiple instances of NVLink—the direct chip-to-chip connection—to talk and enable lots of Nvidia graphics cards to work together. 25.1 billion transistors on the NVLINK Switch are required to support 256 connections - supporting 56TBps. You are right of course that the switches are not themselves Ai-enabled. I mixed up NVIDIA's AI marketing fluff with dynamically reconfiguring CLOS and BENES arrangements. Cray's BlackWidow processor is a more appropriate example. It effectively trains itself to establish the best 3-tier connection between systems to reduce latency and find the most efficient route.
The Cray interconnect you're discussing is from 2007, or thereabouts. The latest Cray interconnect (Cray is owned by HPE now) is called Slingshot, and it is a very fine piece of work. I recommend you read about it.

The system to which I refer creates a bridge between any of the connected points in a switch box - a temporary data bridge - for however long its required. All of these bridges are non-blocking. Pretty simple. As 'creating a bridge' whenever required might seem a bit science fiction - I drew a parallel to delegated negotiations (data exchanges) between people. If you can consider people can enter a room and have a conversation without some network authority - then consider that systems that are made aware of each other can be logically programmed to do the same.
They don't sound like science fiction, they're called transient circuits, at the hardware level. At Layer 4 they're called connections, unless you're using IB, and then they're "queue-pairs".
As to protocol: connection protocols are completely irrelevant to the design and functioning of the switching arrangement. I argue that a packet-based connection between system and switchbox is best left as a circuit.... packet overheads are useless in my world. With direct connections to/from each system - our magic box creates direct point-to-point / point-to-multipoint circuits on the fly - for as many connections are required. Each signal received by the box is unpacked from its medium and protocol. A preferred implementation between system and switchbox is PCIe - and an optical medium might be best for distance. The result is that this magicbox doesn't care it it is replacing a NOC, a desktop switch or a data center switching platform. Arguably it would intermediate any and all of these.
I sort of agree, but some modern data center Ethernet switches are flow-aware, and programmable by the P4 language and its run-time environment.
The intelligence (logic) of how to exchange data between known systems is put in the connected systems - not the switchbox. The whole paradigm of "rely an external network" aspect is overwrought and unnecessary when systems are "known". Network design is the cause of its multiple security issues.

I noted that the switch latency for the 100,000 systems is 1ns for each connection. Each connection happens in parallel and independent to the other as all are non-blocking and there is no congestion. I am simply proposing something way simpler than anything implemented thus far and therefor super confusing enough without twisting my statements. Each switched connection is delayed by a few clock cycles across the switch - and each cycle is constrained by at least propagation speeds. So yes - everything confirms to physical realities. Two systems each within 10 meters of the central box would exchange data in about 80ns....

Honestly I cannot explain things within a constrained view of data exchange when preponderant engineering views of data exchange include a lot of unnecessary steps and preclude a simpler truth.

Next time you talk to someone - ask yourself why it works.
- You talk - they listen ....
- if they didn't care to listen then they didn't hear you.
- If they did listen - they did hear you.

Not sure why a "data exchange" needs to be more complicated than that.
You sound so sure of yourself, yet your understanding of networking science and the field seems lean.

Latencies in switches are usually measured in the time it takes to move a byte from one port to another in the switch. In the case of hardware circuit switching, there is an additional latency for circuit set-up time, and I find it difficult to believe anything in current hardware takes only 1ns.
 
Last edited:
Ok, that answers one of my questions, are circuits transient or persistent? They're transient.

I know what a crossover-cable is, but you left off the cable part. I get confused easily.

Ethernet began as a CSMA/CD bus. Each bus was a "segment", and the segments were interconnected by "bridges". Later, as people wanted to create larger LANs, Ethernet was extended to include point-to-point links, which replaced buses, though the term bridges remained, especially in the IEEE 802.1 specifications. For at least 20 years bridges have been referred to as switches, though still not by the IEEE. (When the bridging specification was enhanced to include traffic classes and other data center specific features, for example, the spec was called Data Center Bridging. Virtualization support for endpoints was called Edge Virtual Bridging. That's approaching 20 years ago though.)

I don't know of any Ethernet products which are still buses. In the 1990s, when I got my first tutorial on how Ethernet worked by a network expert colleague of mine, who was coincidentally named Chris, he schooled me a bit in Ethernet and IP. Back then integrated L2/L3 devices were called bridging routers. I remember rolling my eyes at the jargon.

We could spend hundreds of posts discussing Ethernet and its current state in the industry, but that's not relevant here. I only bring it up because you mentioned Ethernet technology and misstated its history and some aspects of how it works.

You keep digging yourself in deeper and deeper... I recommend you stop digging. First of all, modern Ethernet is point-to-point. I was correcting your history. Second, Internet Protocol was created in a government agency called ARPA, by Vinton Cerf and Bob Kahn. They also created TCP. And, while there is a metropolitan area version of Ethernet, most of the internet is actually run over Wide Area Networks. The association between Ethernet and Internet Protocol is loose, and getting looser every year as WiFi and cellular are connecting the most devices by volume.

The Cray interconnect you're discussing is from 2007, or thereabouts. The latest Cray interconnect (Cray is owned by HPE now) is called Slingshot, and it is a very fine piece of work. I recommend you read about it.


They don't sound like science fiction, they're called transient circuits, at the hardware level. At Layer 4 they're called connections, unless you're using IB, and then they're "queue-pairs".

I sort of agree, but some modern data center Ethernet switches are flow-aware, and programmable by the P4 language and its run-time environment.

You sound so sure of yourself, yet your understanding of networking science and the field seems lean.

Latencies in switches are usually measured in the time it takes to move a byte from one port to another in the switch. In the case of hardware circuit switching, there is an additional latency for circuit set-up time, and I find it difficult to believe anything in current hardware takes only 1ns.

Describing my understanding of networking science as lean is fair - and I would add "rushed' and "haphazard". Without the 20-40 years of network engineering experience, I am trying to explain how this platform works by relying on a 1990's data communications course from my MBA - a peripheral interest in networking (no pun intended) and a part-time dive into communications theory and practice over the last few years. I am trying to explain the operation of an exchange platform without a good analogue - and doing so by grasping at concepts in the dark. The fact is, the platform works - explaining how it works with respect to established understandings, and against preconceived constraints, is heady work. Perhaps the explanation might be better approached via a neuromorphic or synaptic approach.... again - loaded terms without much depth yet on my side. Another research dive scheduled.

Putting aside my miscommunications, misunderstandings, misquotes and malignments of network theory - we at least agree on the switch latency calculations and methods. What I can show is a 4 cycle data exchange circuit. Implemented with a 4Ghz clock - a safe speed within modern low density designs - translates being less than 1 nanosecond using current hardware.

Definitely going to stop digging into analogues that have proven to be confusing at best. Time to review, regroup and restate.
 
I noted that the switch latency for the 100,000 systems is 1ns for each connection. Each connection happens in parallel and independent to the other as all are non-blocking and there is no congestion.
This is the crux. If you have created this magical crossbar at such a scale, then clearly you have a winner. The rest of your text is superfluous clutter, since such powerful magic can be reused in many ways.

This grail has been sought for a century, and there are many physical realities working to trip up approaches, so this is where your credibility is stretched. If you have the IP protection to allow you to explain this then that would be interesting. If not, then I expect your audience will be left in doubt.
 
This is the crux. If you have created this magical crossbar at such a scale, then clearly you have a winner. The rest of your text is superfluous clutter, since such powerful magic can be reused in many ways.

This grail has been sought for a century, and there are many physical realities working to trip up approaches, so this is where your credibility is stretched. If you have the IP protection to allow you to explain this then that would be interesting. If not, then I expect your audience will be left in doubt.

@Tanj - sure - we do have a granted patent on the method. It is obscured in the fact we implemented first with USB - hardly a conspicuous Data Center implementation. I will have an updated kitchen sink filing within the month that has a more direct applications for data centers and semiconductor design.

I have published several IEEE papers demonstrating / explaining the method in very simplistic terms. Just this past week I completed a draft of a Vivado Verilog simulation demonstrating how we achieve transfers in 4 clock cycles - translating to 1ns and lower. I am also now working on a paper describing a working implementation on an FPGA - with dramatic power results. Describing how we would scale to 102,400 nodes is presently based on rough math and approximate transistor extrapolations so I will be looking to firm up my own maths/estimates with someone with alot more experience in the field.

The feedback/pushback here is particularly helpful for me to shape the explanation and establish compelling proof points.
 
How do you prove power savings on an FPGA? Aren't you better off synthesizing and proving power with simulations?
 
How do you prove power savings on an FPGA? Aren't you better off synthesizing and proving power with simulations?
The FPGA power estimation for the switch logic is provided by the FPGA Manufacturer - this would be a third-party estimate and a more reliable / independent assessment. I will have my engineer take a simulation reading as well.
 
Your going to turn this into an ASIC after you prove in the logic, correct?
Sure - as soon as I lock in a customer / investor.

Immediately we are implementing a 4-port USB consumer data exchange switch using the FPGA - 40ns cross-switch latency at 100MHz. It is developed using the eFinnix platform which has lots of edge IOs for IoT. Considering the USB controller latency of about 150nsx2 - the 5Gbps 4-port switch is still actually faster than the Tomahawk switch.

We are working with a customer who is looking for a PCIe switch and this would also be ASIC. A rather interesting Any-to-Any configuration.
 
I guess it didn't dawn on me that anybody would estimate the power that their ASIC would use with an FPGA, whose purpose is to emulate the functionality. It doesn't make sense to me.

Edit: Do they have an efficient built-in RISC-V CPU that you are using, or provide a CPU on their board? Would investors be interested in that proof of concept, or are you better off prototyping this on an EDA system using open source RISC-V cores?
 
Last edited:
I have published several IEEE papers demonstrating / explaining the method in very simplistic terms. Just this past week I completed a draft of a Vivado Verilog simulation demonstrating how we achieve transfers in 4 clock cycles - translating to 1ns and lower.
Which IEEE journals were your articles published in? I'm an IEEE member, and I'd like to read them.

Four clock cycles translated to 1ns... you're designing to a 4GHz clock speed?
 
I have published several IEEE papers demonstrating / explaining the method in very simplistic terms. Just this past week I completed a draft of a Vivado Verilog simulation demonstrating how we achieve transfers in 4 clock cycles
Mr Blue, let me get in on this. I like the idea of simulating a CPU (or 2) for 4 clock cycles. To tell you the truth, I was going to do something like this anyway, like add 2 numbers. Let's do Chris's stuff for free. Maybe he can have an easier time with an investor.

Note, I cannot distribute any foundry stuff, so we will use my standard cells. They are a little larger than what the foundry provides. We went with 3 fins.

CPU: The open source WARP-V (preferred 32 bit). I just called Steve Hoover. He is in.
Process: TSMC16 and GF14 (I want to do both)
FinFETS: Standard threshold
L1 cache: Gimme a number of SRAM cells. If not, I will pick.
L2: Let's just stick to L1 + HBM3 to keep it simpler. We can make the interposer on Skywater.
Synthesis: I will use Yosys, unless somebody wants to run Synopsys. I can supply liberty files on our stdcells
P&R: We have it
Analog blocks: We got it, including an NRZ SerDes.
Clock: 6.4GHz or 3.2GHz

Mr. Blue, you are going to love (or hate) this. I called @simguru (blast from the past) and asked him to show off his cosimulation abilities. I told him that he is not allowed to have an alias like that without being called out on it. Steve Hoover and I will handle the cosimulation in parallel.

No funds change hands and no advertising. This is a SemiWiki freebee.

We won't MPW it, but we will get everything else ready on it, both simulation and layout. No eFPGA (yet)
 
Mr Blue, let me get in on this. I like the idea of simulating a CPU (or 2) for 4 clock cycles. To tell you the truth, I was going to do something like this anyway, like add 2 numbers. Let's do Chris's stuff for free. Maybe he can have an easier time with an investor.

Note, I cannot distribute any foundry stuff, so we will use my standard cells. They are a little larger than what the foundry provides. We went with 3 fins.

CPU: The open source WARP-V (preferred 32 bit). I just called Steve Hoover. He is in.
Process: TSMC16 and GF14 (I want to do both)
FinFETS: Standard threshold
L1 cache: Gimme a number of SRAM cells. If not, I will pick.
L2: Let's just stick to L1 + HBM3 to keep it simpler. We can make the interposer on Skywater.
Synthesis: I will use Yosys, unless somebody wants to run Synopsys. I can supply liberty files on our stdcells
P&R: We have it
Analog blocks: We got it, including an NRZ SerDes.
Clock: 6.4GHz or 3.2GHz

Mr. Blue, you are going to love (or hate) this. I called @simguru (blast from the past) and asked him to show off his cosimulation abilities. I told him that he is not allowed to have an alias like that without being called out on it. Steve Hoover and I will handle the cosimulation in parallel.

No funds change hands and no advertising. This is a SemiWiki freebee.

We won't MPW it, but we will get everything else ready on it, both simulation and layout. No eFPGA (yet)
I think it's great that you're making this offer to Chris. Very cool, and I hope it works, and my skepticism is proven an over-reaction.
 
Which IEEE journals were your articles published in? I'm an IEEE member, and I'd like to read them.

Four clock cycles translated to 1ns... you're designing to a 4GHz clock speed?

Constructing secure peer data connectivity for mobile systems

Re-envisioning digital architectures connecting CE hardware for security, reliability and low energy

I am calculating to a 4GHz speed. The design is not constrained to a specific clock - connections will dictate. I use 4 GHz to reflect a "fair" implementation speed.
It is easily implemented in 6.4 GHz - although best matched to a PHY clock and 6.4 can be a bit high.
 
Last edited:
Mr Blue, let me get in on this. I like the idea of simulating a CPU (or 2) for 4 clock cycles. To tell you the truth, I was going to do something like this anyway, like add 2 numbers. Let's do Chris's stuff for free. Maybe he can have an easier time with an investor.

Note, I cannot distribute any foundry stuff, so we will use my standard cells. They are a little larger than what the foundry provides. We went with 3 fins.

CPU: The open source WARP-V (preferred 32 bit). I just called Steve Hoover. He is in.
Process: TSMC16 and GF14 (I want to do both)
FinFETS: Standard threshold
L1 cache: Gimme a number of SRAM cells. If not, I will pick.
L2: Let's just stick to L1 + HBM3 to keep it simpler. We can make the interposer on Skywater.
Synthesis: I will use Yosys, unless somebody wants to run Synopsys. I can supply liberty files on our stdcells
P&R: We have it
Analog blocks: We got it, including an NRZ SerDes.
Clock: 6.4GHz or 3.2GHz

Mr. Blue, you are going to love (or hate) this. I called @simguru (blast from the past) and asked him to show off his cosimulation abilities. I told him that he is not allowed to have an alias like that without being called out on it. Steve Hoover and I will handle the cosimulation in parallel.

No funds change hands and no advertising. This is a SemiWiki freebee.

We won't MPW it, but we will get everything else ready on it, both simulation and layout. No eFPGA (yet)

Totally Awesome Cliff

.... but it might be where I fear I lose everyone. ;-)

The switching fabric/arrangement itself does not require a CPU, nor SERDES, and we have no SRAM requirement.
The open digital CPO design from Ranovus is an ideal implementation as it would integrate direct from the OE without any SERDES.

It would be amazing to implement our switch between four CPUs to demonstrate a faster switching NOC.
.... or implement as a PCIe switch if you have access to standard IP Controller /PHY blocks (SERDES needed here to connect to analog interface).
 
Last edited:
I think it's great that you're making this offer to Chris. Very cool, and I hope it works, and my skepticism is proven an over-reaction.

I can share Architecture Specifications under NDA - It's a hard thing to dump into the world without a modicum of security.

The ASIC designer who wrote it is a bit of an expert in Peripheral Switching so its been tested.

My challenge has been implementing it in order to get the "AHA" moment.
 
Last edited:
Constructing secure peer data connectivity for mobile systems

Re-envisioning digital architectures connecting CE hardware for security, reliability and low energy

I am calculating to a 4GHz speed. The design is not constrained to a specific clock - connections will dictate. I use 4 GHz to reflect a "fair" implementation speed.
It is easily implemented in 6.4 GHz - although best matched to a PHY clock and 6.4 can be a bit high.
I just read the second paper a couple of times. Just so we're on the same page, that paper was not a peer-reviewed article for an IEEE Journal, it was a paper submitted for a presentation at 2018 IEEE Conference on Consumer Electronics. Correct?

This is going to take me a while to digest, partially because using USB 3.0 (which is now obsolete in a technological sense, replaced by USB 4.0, aka Thunderbolt) with a Cypress ASIC (which naturally has a bus at its core, because it is a USB device) as a testbed is confusing. That makes the implementation is a lot easier because there's electrically only one transmission going on at a time. I'm not immediately seeing how the concept relates to thousands of datacenter servers inter-communicating.

Push-Pull architecture the way it's defined in the paper reminds me a lot of RDMA Reads and Writes, though I'm not aware of RDMA being implemented on any circuit-switching interconnects. In fact, I'm only aware of one large-scale circuit-switching interconnect that ever made it to production, and that product is out of production.

Perhaps because I'm such a networking technology dinosaur, I'm having trouble deciphering the paper and what's so special about the concept.

I'll keep thinking about it. If you want to give me hints to keep me on track, go for it.
 
You already have patents and now you want me to sign an NDA to provide a free service? No Darn Advantage for me to risk signing a legal document. Offer withdrawn.
 
The Papers were presented at the Conference yes. We demonstrated the switching on site although it was only a bridge at the time.

I agree that the nature of the implementation is confusing. Particularly if you view it from traditional networking. (there is no network - only the "effect" of one)
I have the same issues with the patents. I cant tell you how many times I explained that this is not a traditional data exchange. I still have to tell the patent office that it is not another USB bridge lol. There is no control over data on the bus. There is no host to host data control (and yet cooperation between source and destination create the impact)

The closest comparison is the PLX non-transparent switch / bridge. (a traditional adaptation) Data is directed throughout the transfer from source to destination. To do that it has to be stopped at the bridge, at which time addressing is added via a protocol extension and manipulated so that a Host-to-Device protocol can be connected to another Host.

The presented diagram doesn't do it. Once data arrives at the end of a first USB 3.0 segment- the data is decompiled and dumped raw on a bus. Nothing else happens from that direction. On the opposing side, the USB 3.0 connection reads the raw data from the bus as it was a mass storage device. It doesn't matter if the protocols are the same or different at each side of the bus - only that the data rates are balanced. I learned to add buffers at either side of the bridging bus ensures stability when rates fluctuate. The result is nearly no time / energy spent at the transfer point.

The nature of the switch is that it doesn't care if systems send/receive the data as USB3, USB4, Thunderbolt, RS232 or whatever, it is fully decoded at the controller and only the raw data is exchanged. There is no protocol over the switch.

Extending from a 2-port switch to a multi-port demonstration has required us to wait for an affordable platform with sufficient IO (because I don't have the resources of Elon Musk). Basically a different end point reads. We have a way of determining which endpoint will read the data - so the switching is simply pre-coordinating reads. ;-)

We are finishing up on a multiport implementation ... over the next few weeks I may also implement a 6-port demonstration / simulation.
 
You already have patents and now you want me to sign an NDA to provide a free service? No Darn Advantage for me to risk signing a legal document. Offer withdrawn.
Sorry - have I had too many experiences with Synopsys and other IP companies?
Each time I discuss the detailed particulars of a semiconductor company's implementation, I agree not to share that information. The information is limited to my review and use and not for public dissemination?

We simply don't want detailed design documents floating on the internet. Not saying you will - but at some point there needs to be some control point that is agreed as to what is publicly disclosed.

What am I missing?
 
Back
Top