Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/any-thoughts-on-the-amd-microsoft-alliance-in-ai-ml.17921/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Any thoughts on the AMD/Microsoft alliance in AI/ML?

Arthur Hanson

Well-known member
Will the AMD/Microsoft alliance become a serious player and power in AI/ML. Microsoft has the money/software skills and AMD may have the processor skills. Any thoughts on the impact of this alliance appreciated.
 
Correct me if i am wrong, but MSFT is unique among the hyperscalers in not having any AI hardware/CPUs of their own. Given that AI looks to be a significant workload for Azure, it seems like something MSFT should want to consider… plus it is beneficial for MSFT for there to be competition with NVDA and to keep the Wintel/Azure platform very competitive. kk
 
Correct me if i am wrong, but MSFT is unique among the hyperscalers in not having any AI hardware/CPUs of their own.
Microsoft has a long history of using FPGAs for inferencing, and Altera added many features to their product line to support Azure/Bing use. Read up on the Catapult project which started with use in Bing around 2010 for inferencing used to rank search results. The advantage of FPGAs has been you can roll out new models with custom data flows as fast as they can be trained, while an ASIC has a minimum 2 year on ramp for fundamental changes, and the AI-customized FPGAs could keep up with throughput.

On the training side the SKUs you can lease have been mostly based on Nvidia GPUs. Azure was a lead developer of the OCP OAM module allowing 8 GPUs to be hosted on a server, and there are OAM modules not limited to Nvidia.

There are rumors of AI processor development in Azure but nothing clear about where this will play.
 
I don’t think AMD has any meaningful AI products/ecosystem comparing to the other two big boys. I don’t understand why MSFT is looking for AMD’s help at all
 
2 years for fundamental changes? Have you been talkin to Mr. Blue? Stop with the corporate travel meetings. Use the phone!
I thought my ears were ringing. When was the last time, Cliff, you led a design team which produced a new generation of a complex ASIC with new custom functionality, starting from the high-level design down to a customer PoC-ready A0?
 
I just do simple stuff. Would that complex code change also apply to the verilog that would be loaded into an FPGA?
 
I just do simple stuff. Would that complex code change also apply to the verilog that would be loaded into an FPGA?
That's often how the design changes are prototyped before spending money on backend chip development. You analog guys are expensive, not to mention mask sets and stuff like that.
 
FPGA for prototyping... yes, but also a top level (perhaps several) testbench SCHEMATIC that has blocks with several views (schematic, verilog, veriloga, and a simplified schematic view). Pictures are worth 1000 lines of specs. MPWs are a must!
Now this brings up the problem. 12 weeks is too long for an MPW run. This is Fred's fault. He never works a full day, just 18 hours maximum. Fred, work harder.

I heard that Amazon is primed to take over the foundry business. 24 hour gds2mask2wafer2test2slice2package for prime members only.
 
Pictures are worth 1000 lines of specs.
You need both.
MPWs are a must!
Yeah...
Now this brings up the problem. 12 weeks is too long for an MPW run.
I'm out of date, so I don't know what a typical shuttle turn-around time is anymore.
This is Fred's fault. He never works a full day, just 18 hours maximum. Fred, work harder.
Who is Fred?
I heard that Amazon is primed to take over the foundry business. 24 hour gds2mask2wafer2test2slice2package for prime members only.
Smoking pot in the morning is bad for your health.
 
Mr. Blue and Mr. Tanj, you guys are really pissing me off whenever you use that 4 letter F word. They are killing my ASIC business.

Sadly, we cannot beat them.We are adding an "e" to the front of that terrible word. We've got another 7 man team putting together beautiful 5 LUT input fabrics made with multiple colors and charming domino patterns.

I wish I could use pot as an excuse. It is my embedded personality.
 
FPGA for prototyping... yes,
But the algorithm guys crank out a new prototype every week and want to ship the best one every month. Hence they like FPGAs.

Of course, it is expensive and verification/validation is non-trivial, but the FPGAs were very competitive, you would need a leading edge moderately large ASIC to beat them last time I checked in on that stuff, which is a few years ago now. The rumors may be right that Azure is working on their own chips, things could change. Maybe they found an ASIC design that they are comfortable as being as flexible as FPGAs.

You should look at some of the FPGAs optimized for inferencing. There are a ton of hard blocks in them delivering 10s of terafllops of calculations with support for high performance memory, and they can be connected together to make a multichip pipeline. They are in effect already ASICs, just with a very flexible fabric.
 
FPGAs for high performance and flexible changes, sure, but...

2 years (less manufacturing time) to "adjust" most existing designs is ridiculous. I guess you are also used to working with bureaucracy, inefficiency, and incompetency, or perhaps you lost your original staff and didn't shrink wrap the designs (DUTs with TBs and lots of notes on the schematics), which is already covered by the incompetency statement.
 
FPGAs for high performance and flexible changes, sure, but...

2 years (less manufacturing time) to "adjust" most existing designs is ridiculous. I guess you are also used to working with bureaucracy, inefficiency, and incompetency, or perhaps you lost your original staff and didn't shrink wrap the designs (DUTs with TBs and lots of notes on the schematics), which is already covered by the incompetency statement.
Two years is not the time to “adjust” an ASIC. No one spends ASIC development dollars on “adjustments“ anymore. At least for advanced processes. This is for a significant new generation of an existing design of a complex ASIC, meaning new features. For a new design of a new architecture (say a CXL 3.0 switch ASIC with caching and embedded cores for running fabric management (among other things), two years would be an outstanding performance. I suspect most (all?) of the first generation designs appearing for CXL 3.0 will be realized in FPGAs.
 
Agree with that. So it isn't an apples to apples comparison. You overhaul the designs when you make the new chip. What you coulda, shoulda done, more calibration, and more testing.
 
What you coulda, shoulda done, more calibration, and more testing.
Not in any ASIC project I'm aware of. ASIC development is very expensive, as proven by your experience that TSMC doesn't think a small ASIC design company is serious unless they have $50M+ in the bank. To justify an ASIC all-layer spin you need a significant business case, like new major features that are necessary for competitive parity or leadership, much higher performance/throughput, support of new interconnects or radios (e.g. PCI Gen 5 or 5G), stuff like that. This is one reason why FPGAs are getting more attention; ASIC development is becoming so expensive it rules out marginal business cases. Also, as TanJ mentioned, the FPGA industry is getting smarter and adding additional capabilities to their FPGAs.
 
The $50M was for 5nm (EUV). For ASICs, you would use 16/14. I read in articles NREs of $20M and up, but we can do high performance for $3M to get to MPW. Of course, I am talking in generalities, and I already have the tools and lots of IP for the customers.

Edit: We are talking ASICs here. I took exception to the 2 year time frame, and for chips create by us practical types. 1ps jitter, 10 bits (0.1% error), 28Gsps, RISC-V, etc stuff.
 
Last edited:
Additional note: I assume ASICs are the way to go for custom chips on the edge, where DUV processes are used. The 2 year timeframe is unacceptable IMO for these IOT/edge devices. The goal is a TTM of less than 1 year, which includes 2 MPWs on 2 different process nodes, with NREs of < $3M. This is what we are preparing for anyway. This is far less than 2 years for a tweak. I gotta believe the spending will need to slow down, and automation will speed up. Also note that I consider a 50 man company massive, so consider the source.
 
The $50M was for 5nm (EUV). For ASICs, you would use 16/14. I read in articles NREs of $20M and up, but we can do high performance for $3M to get to MPW. Of course, I am talking in generalities, and I already have the tools and lots of IP for the customers.

Edit: We are talking ASICs here. I took exception to the 2 year time frame, and for chips create by us practical types. 1ps jitter, 10 bits (0.1% error), 28Gsps, RISC-V, etc stuff.
Remember, this thread began as speculation on Azure and ML chips. If they are doing that their benchmarks are alternatives like Google TPU, Nvidia H100, Graphcore Colossus. You are not going to beat those with a 16/12 ASIC, not even a large one. The stage for tinkering like that was a few years ago. Azure needs to be leading edge in this area.
 
Back
Top