Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/lisa-su-announces-ai-ryzen-chips-game-changer.17284/page-2
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Lisa Su announces AI Ryzen Chips, Game Changer?

FWIW I think Sapphire came out way behind the curve with 8 memory channels, you can see it choke vs. Genoa with 12, both DDR5. Intel did a few weeks ago announce support for high speed buffered DDR5 DIMMs which may tilt that, but for now the only configurations available cannot match Genoa throughput. And in benchmarks memory bandwidth on high core counts generally sets the ceiling.
On this point we agree.
 
For us followers, this is a good discussion.

IMO, Blueone's debate with Simguru was informative and hilarious and was one of the best (and memorable) threads that I read. It reminded me of an Abbot an Costello routine. Mr. Blue, you did very well in that debate. I am proud of you.
 
I've been taking a closer look at Intel's Sapphire Rapids integrated accelerators, namely AMX, DSA, and IAA. AMX is interesting, but the really curious ones are DSA and IAA. DSA and IAA are both based on the same proprietary on-die PCIe device, meaning their use is not defined by the x86 instruction set or architecture. Put another way, AMX, being operated by x86 instructions, will be part of AMD's license, so AMD could implement AMX. I haven't found any specific architecture information on the AMD AI accelerator, but links posted earlier imply that this is a proprietary Xilinx-based FPGA functional unit, which is probably not implemented in CPU instructions, and instead has hardware and software interfaces that are probably I/O-based, like DSA and IAA. I'm looking forward to seeing an in-depth discussion of the Ryzen AI architecture and implementation. Being a separate programmable functional unit could give Ryzen AI a richer set of acceleration capabilities, much like Intel's DSA and IAA, but if it is, Ryzen AI will also be a proprietary capability that requires specialized programming that will only be available on AMD CPUs. DSA and IAA are like that for Intel CPUs.

Since AMD and Intel appear to be taking very different and incompatible paths to AI acceleration, this could break one of the most important value propositions for x86 CPUs: you can choose between Intel and AMD CPU depending on which company's products happen to be ahead in a given product generation without substantially redesigning your software. Microsoft figures prominently in AMD's Ryzen AI announcement, which I find very curious indeed. Will there be two versions of Windows in the future? One for clients and servers with Intel CPUs and one for AMD CPUs? For Microsoft's development teams it may not be such a big deal, perhaps just conditional compilation, but for marketing, distribution, and support it looks a lot more complicated.

I'm currently reading the specs for Intel's DSA and IAA accelerators to see what I can ascertain about competitiveness, and for my own idle curiosity. One odd thing that popped up already, DSA has sophisticated hardware support for Optane DIMMs. Obviously, since Optane was cancelled that stuff has no future. Once I internalize and digest what I'm reading, I'll post a few thoughts, to see if I'm convinced they'll be competitive enough to force software developers to produce software versions specifically for Sapphire Rapids and later Intel CPUs. My first gut reaction isn't positive, but I'm still mulling it. On the one hand, proprietary features like this would create application "stickiness" to a brand's CPUs, on the other hand customers, even the big cloud providers, seem to have a negative view of these "architectural lock-in" tactics, unless the proprietary hardware is owned by the customers themselves. Given that I think all merchant chip vendors are in trouble with the big cloud providers and Apple (witness Apple's disengagement from Broadcom on 5G modems this week) given their increasing in-house chip development capabilities (and TSMC superior chip manufacturing), I'm not sure what the outcomes will be, but AMD and Intel are both raising my eyebrows with these announcements.
 
Aren’t servers already sticky and optimized for intel or AMD? With bulldozer you could get great MT threaded performance if you played to architectural strengths, and with zen EPYC market share didn’t really start to surge until AMD’s performance advantage could be counted as multiplies of skylake for cheaper prices. This leads me to assume that for many server and HPC applications there is already stickyness. To say nothing of the success of dedicated IP in intel iGPUs and the more mixed attempt of patching up FX and opteron FP performance with fusion.
 
Aren’t servers already sticky and optimized for intel or AMD? With bulldozer you could get great MT threaded performance if you played to architectural strengths, and with zen EPYC market share didn’t really start to surge until AMD’s performance advantage could be counted as multiplies of skylake for cheaper prices. This leads me to assume that for many server and HPC applications there is already stickyness. To say nothing of the success of dedicated IP in intel iGPUs and the more mixed attempt of patching up FX and opteron FP performance with fusion.
There will be performance differences, but so long as you stick to the x86 instruction set it's just a matter of tuning. Changing the number of software threads, for example, is typically easy in most products. The different "performance" and "efficiency" core strategy is more of an annoyance, but the code still runs. With these proprietary accelerators the code won't run without them.
 
Since AMD and Intel appear to be taking very different and incompatible paths to AI acceleration, this could break one of the most important value propositions for x86 CPUs: you can choose between Intel and AMD CPU depending on which company's products happen to be ahead in a given product generation without substantially redesigning your software. Microsoft figures prominently in AMD's Ryzen AI announcement, which I find very curious indeed. Will there be two versions of Windows in the future? One for clients and servers with Intel CPUs and one for AMD CPUs? For Microsoft's development teams it may not be such a big deal, perhaps just conditional compilation, but for marketing, distribution, and support it looks a lot more complicated.
AI users have transitioned to higher level modelling frameworks like PyTorch, with CUDA relegated to use as a backend language. This has meant there is a lot more portability than there was a few years ago, and all the new engines come with the necessary backend transformations. Of course, Nvidia still has some of the best optimized backends so the other AI chips have a high bar to compete, but you don't need to redesign your AI models. Windows has a long track record of portability - gigabytes of drivers and multiple alternative modules in the kernel binding to whatever CPU, GPU, or IO interfaces are there. The distribution and support needs have precedents.
I'm currently reading the specs for Intel's DSA and IAA accelerators to see what I can ascertain about competitiveness, and for my own idle curiosity. One odd thing that popped up already, DSA has sophisticated hardware support for Optane DIMMs. Obviously, since Optane was cancelled that stuff has no future. Once I internalize and digest what I'm reading, I'll post a few thoughts, to see if I'm convinced they'll be competitive enough to force software developers to produce software versions specifically for Sapphire Rapids and later Intel CPUs. My first gut reaction isn't positive, but I'm still mulling it. On the one hand, proprietary features like this would create application "stickiness" to a brand's CPUs, on the other hand customers, even the big cloud providers, seem to have a negative view of these "architectural lock-in" tactics, unless the proprietary hardware is owned by the customers themselves. Given that I think all merchant chip vendors are in trouble with the big cloud providers and Apple (witness Apple's disengagement from Broadcom on 5G modems this week) given their increasing in-house chip development capabilities (and TSMC superior chip manufacturing), I'm not sure what the outcomes will be, but AMD and Intel are both raising my eyebrows with these announcements.
The stickiness comes if the customers see the difference. If the accelerators mostly work invisibly, boosting the infrastructure, the clouds will not see that as sticky. They will want to have the advantages. Oh, they might grumble privately about extra work, but it comes with the territory.

They get much more cautious about anything that can create customer lockin, which leaves them hostage to one vendor's pricing. Not that it stops the CPUs from trying, because for vendors those margins are essential and they have no problem telling the end users about their new features, which the cloud cannot block, things like AMX. But the hypervisor (cloud) can hide IO devices. That probably creates some interesting architecture discussions in CPU design.
 
I've been looking over IAA, the Sapphire Rapids In-Memory Analytics Accelerator. (The acronym IAA bugs me, because IAA always referred to the Intel Achievement Award, which is the company-wide recognition award for great achievements. I can't read IAA and not think of that. All of the Intel-internal jokes years ago about running out of TLAs appears to be coming true.)

IAA looks very interesting for in-memory columnar relational databases. It could be used for other types of databases, like Cassandra for example (Cassandra is a so-called NoSQL database, but it's also columnar), but I suspect the real targets were the relational products. Namely, SAP HANA, Oracle Database In-Memory, Microsoft In-Memory SQL Server, and Snowflake, to name the big guys. Since Intel is not exactly rich in database technology experts, I think it's a safe guess that IAA was defined by asking the big database companies what they would like in CPU features. The hardware features like data compression, filtering, encryption, aggregation, etc., look very much like what DBMS development engineers would want. But my educated guess is that using these features will be quite invasive to the DBMS code. So I wonder, how many products will be ready to use this accelerator?

Well, here's the Intel announcement for the Sapphire Rapids accelerators, and the quotes from the DBMS companies looks pretty mushy to me.


I think IAA has significant potential as a differentiator, but I suspect it'll be a while before we see any commercial products using it. I suspect we will see Intel contributing to various Apache projects (open source) to have versions which use IAA, but that's not going to make IAA a real differentiator over AMD EYPC for what could be a couple of years or more. Databases also like a lot of DRAM throughput, especially columnar databases, so Sapphire Rapids shortfall in memory channels could dull the attraction of this feature. On the other hand, Sapphire Rapids does multi-socket systems a lot better than anything AMD has, so that could be a balancing factor if server vendors invest in 4-8 socket scale-up servers. Relational database products typically scale well on scale-up 4+ socket servers.
 
So I wonder, how many products will be ready to use this accelerator?
You want to look at PostgreSQL and Clickhouse, probably the most active and flexible OSS databases at the moment, see if IAA shows up in commits. I believe RedShift is based on PostgreSQL, too. Database accelerators have been around for a while. Oracle had them on its final generations of SUN servers, and IBM has database acceleration in Power. I would expect Intel hired a few folks with experience, as well as talked to customers.
I suspect we will see Intel contributing to various Apache projects (open source) to have versions which use IAA, but that's not going to make IAA a real differentiator over AMD EYPC for what could be a couple of years or more.
Yes, Intel tends to commit reference code for all their new gadgets.
Databases also like a lot of DRAM throughput, especially columnar databases, so Sapphire Rapids shortfall in memory channels could dull the attraction of this feature. On the other hand, Sapphire Rapids does multi-socket systems a lot better than anything AMD has, so that could be a balancing factor if server vendors invest in 4-8 socket scale-up servers. Relational database products typically scale well on scale-up 4+ socket servers.
DRAM is expensive. A couple of GB of DRAM costs about the same to manufacture as a core, and you could say that CPUs are a tool to sell memory. The flip side of that is if you have a lot of DRAM you want a lot of cores to make sure the memory can earn its keep. Memory is only earning value when some core is processing it. An 8-socket Sapphire fully kitted out with https://news.skhynix.com/sk-hynix-develops-mcr-dimm/ will probably have a very lucrative balance of big memory and big value.
 
You want to look at PostgreSQL and Clickhouse, probably the most active and flexible OSS databases at the moment, see if IAA shows up in commits. I believe RedShift is based on PostgreSQL, too. Database accelerators have been around for a while. Oracle had them on its final generations of SUN servers, and IBM has database acceleration in Power. I would expect Intel hired a few folks with experience, as well as talked to customers.
Redshift is based on the technology from a company Amazon acquired, called ParAccel.

Aurora, their more general purpose RDBMS has "two heads", one for Postgres, one for MySQL.
DRAM is expensive. A couple of GB of DRAM costs about the same to manufacture as a core, and you could say that CPUs are a tool to sell memory. The flip side of that is if you have a lot of DRAM you want a lot of cores to make sure the memory can earn its keep. Memory is only earning value when some core is processing it. An 8-socket Sapphire fully kitted out with https://news.skhynix.com/sk-hynix-develops-mcr-dimm/ will probably have a very lucrative balance of big memory and big value.
This is an astute observation. Intel has been worried about CPU versus DRAM "share of wallet" for a long time now. Perhaps two decades, and CPUs have lost that battle. Optane DIMMs were an attempt to turn that around a bit, but Optane never had the performance required, or the ease of scalability of flash and DRAM.

As for memory only earning value when some core is processing using it, that is correct too. And the cost problem, sequestered DRAM in servers, is the target value proposition for CXL 3.0, which aims at creating a shared pool of memory, both in specialized devices and in sequestered server memory, and allowing it to be shared datacenter-wide. CXL sharing is also claimed to work for other stranded resources in physical servers, like GPUs. The problem remains for the DRAM manufacturers though, they mostly still have a non-differentiated product to sell, and selling more of it won't increase their gross margins much.
 
Optane DIMMs were an attempt to turn that around a bit
Actual pricing showed the goal was to take a slice of the wallet on both sides.
sequestered DRAM in servers, is the target value proposition for CXL 3.0, which aims at creating a shared pool of memory, both in specialized devices and in sequestered server memory, and allowing it to be shared datacenter-wide.
An 8-socket huge memory server often is the whole pool.

CXL 3 may scale to racks but not data center. The latency through networks plus the management of sharing puts SSD in the game at data center scale for pooling. And the pooling of DDR will not change the cost structure I alluded to, indeed risks making expensive DRAM less valuable by reducing its access to cores. The real breakthrough, a practical DRAM competitor at fundamentally lower cost of production, is as yet not visible. I do believe CXL's most valuable aspect is that, unlike DDR channels, CXL is open to any technology. So, the incentive to find something new is revitalized as they now have a market which is not captive to the DRAM state machine in the DDR channel.
 
Redshift is based on the technology from a company Amazon acquired, called ParAccel.

Aurora, their more general purpose RDBMS has "two heads", one for Postgres, one for MySQL.
Redshift was based on PostgreSQL 8.0.2. That work may have begun at ParAccel, sure. The attraction at Amazon of PGS is that it clones the Oracle dialect and semantics of SQL, and Amazon was an Oracle shop interested both in moving themselves to something cheaper, and in leveraging the AWS market for folks fluent with Oracle.
 
CXL 3 may scale to racks but not data center. The latency through networks plus the management of sharing puts SSD in the game at data center scale for pooling. And the pooling of DDR will not change the cost structure I alluded to, indeed risks making expensive DRAM less valuable by reducing its access to cores. The real breakthrough, a practical DRAM competitor at fundamentally lower cost of production, is as yet not visible. I do believe CXL's most valuable aspect is that, unlike DDR channels, CXL is open to any technology. So, the incentive to find something new is revitalized as they now have a market which is not captive to the DRAM state machine in the DDR channel.
I share your skepticism about CXL memory sharing scalability, but I have a lot of friends working on CXL, and they tell me I'm mistaken. You do know that CXL 3.0 includes its own fabric definition, and that it uses very skinny protocol layers that are more akin to PCIe than Ethernet/IP and InfiniBand? The problem as I see it is that DRRx has latencies to memory of less than 100ns, and even single PCIe switch port-to-port latencies alone are in that range, so a multi-stage switched fabric of CXL switches seems like it'll need caches for latency hiding. CXL fabrics of greater radix than a top of rack switch look suspect too high-latency or complex to me.
 
Redshift was based on PostgreSQL 8.0.2. That work may have begun at ParAccel, sure. The attraction at Amazon of PGS is that it clones the Oracle dialect and semantics of SQL, and Amazon was an Oracle shop interested both in moving themselves to something cheaper, and in leveraging the AWS market for folks fluent with Oracle.
Here we go again. You're incorrect, but if you want to argue about this send me a PM. We shouldn't be polluting a thread about AMD CPUs with a discussion about relational database technology.
 
I share your skepticism about CXL memory sharing scalability, but I have a lot of friends working on CXL, and they tell me I'm mistaken. You do know that CXL 3.0 includes its own fabric definition, and that it uses very skinny protocol layers that are more akin to PCIe than Ethernet/IP and InfiniBand? The problem as I see it is that DRRx has latencies to memory of less than 100ns, and even single PCIe switch port-to-port latencies alone are in that range, so a multi-stage switched fabric of CXL switches seems like it'll need caches for latency hiding. CXL fabrics of greater radix than a top of rack switch look suspect too high-latency or complex to me.
The routing is based on GenZ work and should be low latency. The transition to CPO will help with latency and power, making rack a likely sweet spot.

But that rack is likely 2k to 4k cores. I don't see any sign of managing even that level of complexity yet. The rack level switch will need to be managed and smart, even for the PCIe flows.
 
Alright, you just can't help yourself, can you? Unfortunately, now I have to defend myself.

I have no doubt that the Redshift SQL parser and other SQL statement processing logic is highly based on open source Postgres, partially for application compatibility, partially because Postgres is the best open source DBMS (IMO), and because Postgres is distributed using a very liberal license which allows Amazon to modify and distribute products based on all or portions of Postgres code without paying any fees or commitments to contribute their modifications to the Postgres project:


BTW, I made a mistake in a previous post. Amazon did not acquire ParAccel, they purchased a license to the code a long time ago. I keep forgetting that.

Redshift has four competitive advantages over Postgres I'm aware of:

1. Scale-out architecture. ParAccel was a scale-out product, Postgres still doesn't support scale-out. And ParAccel supported native scale-out processing, not the less efficient SQL-to-SQL rewriting with database instances that are not scale-out aware, like some scale-out databases do. Redshift has native scale-out.

2. Compiled queries. This was a key feature of ParAccel. Rather than interpreting the steps of a query plan, Redshift produces a custom program and compiles it to assembler code. The savings in CPU time can be considerable for complex queries. Postgres doesn't compile queries either.

3. Amazon uses a custom-written storage subsystem for its databases, which takes advantage of its cloud storage architecture and Amazon hardware. Even the Aurora version of Postgres does this.

4. Custom Graviton-based processing hardware. RA3 nodes. Note the reference to AGUA.


I'm sure the current version of the Redshift code has little resemblance if any to the original ParAccel code from ten years ago, but "based on Postgres" is a misstatement, even if it is on the Amazon site.
 
The routing is based on GenZ work and should be low latency. The transition to CPO will help with latency and power, making rack a likely sweet spot.

But that rack is likely 2k to 4k cores. I don't see any sign of managing even that level of complexity yet. The rack level switch will need to be managed and smart, even for the PCIe flows.
CXL, like PCIe, is not aware of cores, only of root complexes.
 
The core count indicates the complexity of VMs and clients that will need to be supported. That managed switch is doomed to need to provide security and isolation on a matching scale.
 
The core count indicates the complexity of VMs and clients that will need to be supported. That managed switch is doomed to need to provide security and isolation on a matching scale.
CXL.memory works at the layer of physical addresses, so virtual addresses are not visible to the fabric switches or the fabric management software. CXL.io is essentially PCIe protocol, so VMs are visible to the root complexes, but PCIe switch fabrics (like GigaIO) already do this today. I don't see a scaling issue due to cores and VMs.
 
Each CPU port to the switch in such a system is in effect its own physical address space, including secure IDs in the high order bits and PIDs elsewhere in the packet. Other CPUs do not trust them and the managed switch has effectively a huge virtual mapping problem to sort out what is allowed to see what and how to transform the address on the way through the switch. You might be able to build a supercomputer with a single humongous client based on naive physical addresses, but that is not how clouds are built. They are built for paranoid, isolated clients. It is a big, complex problem already for 2-socket 128-core servers. It will be awesomely challenging at rack level.

And no, this is not off topic on a "game changer" thread. Both AMD and Intel are deeply aware of this and working on their own solutions. The game has barely started.
 
Back
Top