Lisa Su announces AI Ryzen Chips, Game Changer?

Arthur Hanson · Jan 5, 2023

AMD announces Ryzen 7040 series processors with on-chip AI

AMD announces its Ryzen 7040 series processors, finally bringing the 7000 line to laptops. The first products will launch in March.

www.androidauthority.com

AI/ML is about to come on strong with a rapid growth cycle and I feel this will be a game changer not only for the semi-industry, but the beginning of a world changing shift in everything. Combined with massive cheap memory, AI/ML will be a massive game changer in everything and we a just about to see the tip of the spear. I see automation, education and many other fields and areas going through more changes in a generation than has occurred in centuries. This will radically change many professions in the next decade and especially education which in its current form will become totally obsolete from the Einstein rule of never memorize anything you can look up. Combined with AI/ML, this will make much of our current education and training systems totally obsolete by leveraging the best and the brightest massively by being able to recreate themselves in a program that will be in everything from supercomputers to smartphones. Any thoughts or comments sought and appreiciated.

nghanayem · Jan 5, 2023

Since M1 macs and many cellphone socs have also had these for years, as well as icelake or tigerlake intel gpus initially adding acceleration for these workloads (with meteor lake seeming to double down on this). Undoubtedly AMD having this hardware in consumer platforms will further spread adoption in the windows/linux x86 ecosystem. However given intel has had the hardware for a while now, it seems the biggest hurdle is software guys using it (as always). The nature of windows and linux software development (in comparison to the more walled gardens of mac ios and android)is also a big inhibitor. Hopefully AMD’s marketing (god knows intel can’t) and intel’s install base can allow for these platforms to utilize this hardware. As you mentioned TPUs on consumer devices are one of those things that are just awesome to have but you can’t really say exactly what it can do since the usecases are so varied and can impact everything by at least a little bit.

blueone · Jan 5, 2023

nghanayem said:
Since M1 macs and many cellphone socs have also had these for years, as well as icelake or tigerlake intel gpus initially adding acceleration for these workloads (with meteor lake seeming to double down on this). Undoubtedly AMD having this hardware in consumer platforms will further spread adoption in the windows/linux x86 ecosystem. However given intel has had the hardware for a while now, it seems the biggest hurdle is software guys using it (as always). The nature of windows and linux software development (in comparison to the more walled gardens of mac ios and android)is also a big inhibitor. Hopefully AMD’s marketing (god knows intel can’t) and intel’s install base can allow for these platforms to utilize this hardware. As you mentioned TPUs on consumer devices are one of those things that are just awesome to have but you can’t really say exactly what it can do since the usecases are so varied and can impact everything by at least a little bit.

I'm not sure what the Ryzen AI processor is, but it looks like it's based on an FPGA, since Su mentioned Xilinx in her presentation. If that's the case, you are completely correct that software enabling, especially in Windows, is going to be critical. There's nothing on AMD's web site about how it works or how it is enabled by Windows. More likely, and I'm just guessing, it'll be enabled by an AMD-supplied kernel driver or library functions that have to be specifically used by applications. Since the only application mentioned regards graphics, it could be used by the AMD graphics driver. But I'm just guessing and I could be completely wrong. Since AMD's business depends on being instruction-level compatible with Intel, and this is a separate "processor" (more likely what the industry calls an "accelerator"), I would further guess than any application outside of AMD's internal stuff will have to be specifically enabled, which is not one of AMD's strong points (I'm being generous too).

AMD debuts AI-infused Ryzen 7040 chip

AMD CEO and Chair Lisa Su unveiled the company’s Ryzen 7040 series processor at CES 2023 that she claimed was faster than Apple and Intel.

www.mobileworldlive.com

What features or hardware are you referring to in Intel CPUs? GPUs? Only Apple seems to have a specific ML processor, the Neural Engine, in the M1/M2 family. It sounds like a different thing than what AMD is doing, and Apple claims it accelerates voice recognition, image processing, and video analysis. There isn't enough information I've found on the design details of either product to determine if one is more useful than the other.

Apple unleashes M1

Apple today announced the biggest leap forward for the Mac with M1, the first system on a chip designed specifically for the Mac.

www.apple.com

I agree that Apple has an overwhelming advantage in the integration of special purpose processors or accelerators in its products, due to owning the OS on every platform.

As for the stupidly named Ryzen 7040 being more power efficient than the M1, that should be quite a trick. AMD's memory is on the motherboard in DIMMs, while Apple's is in the CPU package, which sure looks like a power savings, especially at DDR5 speeds.

nghanayem · Jan 5, 2023

blueone said:
I'm not sure what the Ryzen AI processor is, but it looks like it's based on an FPGA, since Su mentioned Xilinx in her presentation. If that's the case, you are completely correct that software enabling, especially in Windows, is going to be critical. There's nothing on AMD's web site about how it works or how it is enabled by Windows. More likely, and I'm just guessing, it'll be enabled by an AMD-supplied kernel driver or library functions that have to be specifically used by applications. Since the only application mentioned regards graphics, it could be used by the AMD graphics driver. But I'm just guessing and I could be completely wrong. Since AMD's business depends on being instruction-level compatible with Intel, and this is a separate "processor" (more likely what the industry calls an "accelerator"), I would further guess than any application outside of AMD's internal stuff will have to be specifically enabled, which is not one of AMD's strong points (I'm being generous too).

AMD debuts AI-infused Ryzen 7040 chip

AMD CEO and Chair Lisa Su unveiled the company’s Ryzen 7040 series processor at CES 2023 that she claimed was faster than Apple and Intel.

www.mobileworldlive.com

What features or hardware are you referring to in Intel CPUs? GPUs? Only Apple seems to have a specific ML processor, the Neural Engine, in the M1/M2 family. It sounds like a different thing than what AMD is doing, and Apple claims it accelerates voice recognition, image processing, and video analysis. There isn't enough information I've found on the design details of either product to determine if one is more useful than the other.

Apple unleashes M1

Apple today announced the biggest leap forward for the Mac with M1, the first system on a chip designed specifically for the Mac.

www.apple.com

I agree that Apple has an overwhelming advantage in the integration of special purpose processors or accelerators in its products, due to owning the OS on every platform.

As for the stupidly named Ryzen 7040 being more power efficient than the M1, that should be quite a trick. AMD's memory is on the motherboard in DIMMs, while Apple's is in the CPU package, which sure looks like a power savings, especially at DDR5 speeds.

Per https://www.anandtech.com/show/16084/intel-tiger-lake-review-deep-dive-core-11th-gen/3

AI Acceleration: AVX-512, Xe-LP, and GNA2.0
One of the big changes for Ice Lake last time around was the inclusion of an AVX-512 on every core, which enabled vector acceleration for a variety of code paths. Tiger Lake retains Intel’s AVX-512 instruction unit, with support for the VNNI instructions introduced with Ice Lake.

It is easy to argue that since AVX-512 has been around for a number of years, particularly in the server space, we haven’t yet seen it propagate into the consumer ecosphere in any large way – most efforts for AVX-512 have been primarily by software companies in close collaboration with Intel, taking advantage of Intel’s own vector gurus and ninja programmers. Out of the 19-20 or so software tools that Intel likes to promote as being AI accelerated, only a handful focus on the AVX-512 unit, and some of those tools are within the same software title (e.g. Adobe CC).

There has been a famous ruckus recently with the Linux creator Linus Torvalds suggesting that ‘AVX-512 should die a painful death’, citing that AVX-512, due to the compute density it provides, reduces the frequency of the core as well as removes die area and power budget from the rest of the processor that could be spent on better things. Intel stands by its decision to migrate AVX-512 across to its mobile processors, stating that its key customers are accustomed to seeing instructions supported across its processor portfolio from Server to Mobile. Intel implied that AVX-512 has been a win in its HPC business, but it will take time for the consumer platform to leverage the benefits. Some of the biggest uses so far for consumer AVX-512 acceleration have been for specific functions in Adobe Creative Cloud, or AI image upscaling with Topaz.

Intel has enabled new AI instruction functionality in Tiger Lake, such as DP4a, which is an Xe-LP addition. Tiger Lake also sports an updated Gaussian Neural Accelerator 2.0, which Intel states can offer 1 Giga-OP of inference within one milliwatt of power – up to 38 Giga-Ops at 38 mW. The GNA is mostly used for natural language processing, or wake words. In order to enable AI acceleration through the AVX-512 units, the Xe-LP graphics, and the GNA, Tiger Lake supports Intel’s latest DL Boost package and the upcoming OneAPI toolkit.

and for meteorlake

I am not a software dev or architect but to me it seems like intel got to AI/ML for client a few years too late for Apple's taste (similar time to market as Apple silicon) and AMD is playing catch up with the both of them. Although this is not super surprising given AMD's MUCH smaller/more focused engineering staff.

blueone · Jan 5, 2023

I've read that announcement, but it says nothing really, and Intel hasn't elaborated that I've seen. So I have no idea what "Integrated AI Acceleration" is. AVX512 is in the x86 instruction set, so AMD has it. I have no doubt that everyone with a GPU, which is everyone

, is working on AI/ML optimizations or extensions or special code. So I'm not convinced Intel has anything special yet. Nor am I convinced AMD's strategy is better than Apple's. We'll see what the in-depth analysis shows, eventually. I'm not even convinced AMD's chiplet technology lead is maintainable. I think the days of broad-market merchant chips, like Intel and AMD sell, are going to be less exciting. AMD's strategy of being a better Intel only works if Intel keeps being less than effective in design and execution, and Microsoft stays uncompetitive in client CPU SiP/SoC development.

nghanayem · Jan 5, 2023

blueone said:
I've read that announcement, but it says nothing really, and Intel hasn't elaborated that I've seen. So I have no idea what "Integrated AI Acceleration" is. AVX512 is in the x86 instruction set, so AMD has it. I have no doubt that everyone with a GPU, which is everyone , is working on AI/ML optimizations or extensions or special code. So I'm not convinced Intel has anything special yet. Nor am I convinced AMD's strategy is better than Apple's. We'll see what the in-depth analysis shows, eventually. I'm not even convinced AMD's chiplet technology lead is maintainable. I think the days of broad-market merchant chips, like Intel and AMD sell, are going to be less exciting. AMD's strategy of being a better Intel only works if Intel keeps being less than effective in design and execution, and Microsoft stays uncompetitive in client CPU SiP/SoC development.

Never said anything special. More so that they have the hardware to accelerate those workloads (a big first step as the software has to run on something). As for GPUs it doesn't seem like RDNA has these capabilities (unlike NVIDIA and Intel's GPU architectures) hence why they need a separate accelerator. As for meteorlake it seems like it might be a discrete unit on the SOC die since they are drumming up the fact that this is different then the built in accelerators already in Xe slices. The rumor mill also seems to point in this direction, but time will tell if it is doubling down or just a sizable gen over gen improvement to the Xe LP architecture's AI/ML capabilities.

As a side note hopefully Qualcomm can figure out how to implement their AI/ML silicon in windows, because while I don't need an ultra portable 2-1 anymore; back when I was in college I would have killed for a nice Qualcomm one if windows could actually fully utilize it's capabilities/I didn't have to deal with software comparability issues. Super long battery life, silent operation, light, cellier connectivity (not super useful given the excellent wifi network at ISU but a cool thing to have), and AI/ML hardware yes please! In the meantime I can't complain about how efficient modern x86 laptops are getting either (with AMD's superior node and better sleep states/power saving tricks or intel's hybrid architecture) as they are starting to give ARM makers (including Apple) a run for their money.

blueone · Jan 10, 2023

nghanayem said:
Never said anything special. More so that they have the hardware to accelerate those workloads (a big first step as the software has to run on something). As for GPUs it doesn't seem like RDNA has these capabilities (unlike NVIDIA and Intel's GPU architectures) hence why they need a separate accelerator. As for meteorlake it seems like it might be a discrete unit on the SOC die since they are drumming up the fact that this is different then the built in accelerators already in Xe slices. The rumor mill also seems to point in this direction, but time will tell if it is doubling down or just a sizable gen over gen improvement to the Xe LP architecture's AI/ML capabilities.

As a side note hopefully Qualcomm can figure out how to implement their AI/ML silicon in windows, because while I don't need an ultra portable 2-1 anymore; back when I was in college I would have killed for a nice Qualcomm one if windows could actually fully utilize it's capabilities/I didn't have to deal with software comparability issues. Super long battery life, silent operation, light, cellier connectivity (not super useful given the excellent wifi network at ISU but a cool thing to have), and AI/ML hardware yes please! In the meantime I can't complain about how efficient modern x86 laptops are getting either (with AMD's superior node and better sleep states/power saving tricks or intel's hybrid architecture) as they are starting to give ARM makers (including Apple) a run for their money.

I was wrong about lack of described AI (really ML) assists in Sapphire Rapids. The assists are called Advanced Matrix Extensions (AMX), and are for matrix manipulation functions. Lots of information on GitHub:

GitHub - mikeroyal/AMX-Guide: Advanced Matrix Extensions (AMX) Guide

Advanced Matrix Extensions (AMX) Guide. Contribute to mikeroyal/AMX-Guide development by creating an account on GitHub.

github.com

With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids

All processor designs are the result of a delicate balancing act, perhaps most touchy in the case of a high performance CPU that needs to be all things to

www.nextplatform.com

A big miss on my part.

Tanj · Jan 10, 2023

blueone said:
I was wrong about lack of described AI (really ML) assists in Sapphire Rapids. The assists are called Advanced Matrix Extensions (AMX), and are for matrix manipulation functions.

It is easy to overlook, since superficially the "512 bit" looks just like tinkering with the vector instructions that have been around for a while. But the 512 is just the natural size of the data path since it matches cache width, while the AMX engine is likely a whole different block of IP. Different FP formats, matrix operations, and other future acceleration is likely to also be -512 but that does not mean it is unimportant.

These blocks are effectively accelerators closely coupled to the memory path and synchronized to the instruction pipeline. This can be quite effective for inferencing of small to medium sized models.

Intel also introduces an interrupt-free, user-mode model of interacting with accelerators that can be off-chip in Sapphire Rapids. The commands are issued in a memory-mapped FIFO using an uncached write, which returns a status if the FIFO overflows, and the responses from the accelerators can be watched using memory coherency instructions just like locks. Essentially this allows accelerators to be part of the user space, no kernel or hypervisor pathways needed, and none of the awful headaches in the interrupt vectoring machinery. It allows acceleration to scale to millions of ops per second per core at minimal overhead.

blueone · Jan 10, 2023

Tanj said:
It is easy to overlook, since superficially the "512 bit" looks just like tinkering with the vector instructions that have been around for a while. But the 512 is just the natural size of the data path since it matches cache width, while the AMX engine is likely a whole different block of IP. Different FP formats, matrix operations, and other future acceleration is likely to also be -512 but that does not mean it is unimportant.

No, I just wasn't paying attention. Intel had become... uninteresting... for awhile. If the power consumption of AMX is similar to AVX512 it could be a controversial feature.

Tanj said:
These blocks are effectively accelerators closely coupled to the memory path and synchronized to the instruction pipeline. This can be quite effective for inferencing of small to medium sized models.

That appears to be Intel's target market. Focus on the apps that can be addressed by general purpose CPU extensions. The marketing strategy makes sense.

Tanj said:
Intel also introduces an interrupt-free, user-mode model of interacting with accelerators that can be off-chip in Sapphire Rapids. The commands are issued in a memory-mapped FIFO using an uncached write, which returns a status if the FIFO overflows, and the responses from the accelerators can be watched using memory coherency instructions just like locks. Essentially this allows accelerators to be part of the user space, no kernel or hypervisor pathways needed, and none of the awful headaches in the interrupt vectoring machinery. It allows acceleration to scale to millions of ops per second per core at minimal overhead.

This sounds a lot like what CXL 1.0 was supposed to accomplish. I wonder if that's how AMX actually works?

Tanj · Jan 10, 2023

blueone said:
This sounds a lot like what CXL 1.0 was supposed to accomplish. I wonder if that's how AMX actually works?

The user-mode interacton is supported by S-IOV (not SR- ) which I believe is in PCIe gen 5 not requiring CXL, but I could be wrong.

AMX is incorporated into the instruction set. The mechanism I described is found in the QAT-3 (2?) accelerators on Sapphire. The memory mapped FIFO might be used for other things in future. Interestingly, it somewhat resembles how the matrix acceleration in Apple M-1 and M-2 is programmed.

lefty · Jan 10, 2023

nghanayem said:
As for GPUs it doesn't seem like RDNA has these capabilities (unlike NVIDIA and Intel's GPU architectures) hence why they need a separate accelerator.

RDNA 3 has 4 AI matrix units built into each CU.

blueone · Jan 10, 2023

Tanj said:
The user-mode interacton is supported by S-IOV (not SR- ) which I believe is in PCIe gen 5 not requiring CXL, but I could be wrong.

I don't think it's SIOV. SIOV is an I/O virtualization mechanism, and AMX uses instruction set extensions, which also means the AMX acceleration needs to be in the coherent domain.

Tanj said:
AMX is incorporated into the instruction set. The mechanism I described is found in the QAT-3 (2?) accelerators on Sapphire. The memory mapped FIFO might be used for other things in future. Interestingly, it somewhat resembles how the matrix acceleration in Apple M-1 and M-2 is programmed.

I may be out of date, but QAT is an I/O-based accelerator, making it quite different from AMX.

blueone · Jan 10, 2023

lefty said:
RDNA 3 has 4 AI matrix units built into each CU.

Yeah, but that's a GPU, not a CPU.

Tanj · Jan 10, 2023

blueone said:
I don't think it's SIOV. SIOV is an I/O virtualization mechanism, and AMX using instruction set extensions, which also means the AMX acceleration needs to be in the coherent domain.

S-IOV unifies the address space so both CPU and accelerator are in the user address space. Effectively it creates a VM slice transparently across several chips.

blueone said:
I may be out of date, but QAT is an I/O-based accelerator, making it quite different from AMX.

The new QAT is on die (each of the 4 dies on the server socket) and considerably better. My favorite function is the Swiss-knife compression-encryption pipeline which can be reconfigured for multiple algorithms and runs at many GB/s with microsecond latency, but there are also functions for SSL acceleration and smart DMA. It is integrated with the on-die fabric for data (including cache), similar to what IBM does on Power or Apple on M-series.

The bus fabric is the lynchpin of the next CPUs. Crossing efficiently between chiplets, plenty of bandwidth, low latency, coherently integrated with cache, and supporting a distributed VM image for users.

blueone · Jan 10, 2023

Tanj said:
S-IOV unifies the address space so both CPU and accelerator are in the user address space. Effectively it creates a VM slice transparently across several chips.

SIOV is in the world of interrupts and I/O drivers. Instructions aren't. Every shared memory symmetric multi-processor creates transparency across cores on different chips.

Tanj said:
The new QAT is on die (each of the 4 dies on the server socket) and considerably better. My favorite function is the Swiss-knife compression-encryption pipeline which can be reconfigured for multiple algorithms and runs at many GB/s with microsecond latency, but there are also functions for SSL acceleration and smart DMA. It is integrated with the on-die fabric for data (including cache), similar to what IBM does on Power or Apple on M-series.

Yeah, but I think accessing the QAT logic is still done as an I/O operation. There aren't compression instruction set extensions, correct? This Hot Chips conference presentation implies that AMX and QAT (and DSA) are two completely different classes of accelerators, and AMX is the only one with instruction set extensions.

https://hc33.hotchips.org/assets/program/conference/day1/HC2021.C1.4%20Intel%20Arijit.pdf

Tanj said:
The bus fabric is the lynchpin of the next CPUs. Crossing efficiently between chiplets, plenty of bandwidth, low latency, coherently integrated with cache, and supporting a distributed VM image for users.

This explanation on intel.com seems to say SIOV works like I think it does:

4.4. Scalable IOV

www.intel.com

It's a software-based capability.

Tanj · Jan 11, 2023

blueone said:
SIOV is in the world of interrupts and I/O drivers. Instructions aren't. Every shared memory symmetric multi-processor creates transparency across cores on different chips.

You are mistaking the setup for the usage. From the application POV, SIOV is transparent memory across heterogenous chips. CXL.cache adds coherency for external chips but QAT is already coherent.

S-IOV is primarily about how to share the virtual memory addressing, and mostly runs automatically in the page tables of compliant devices, not software. Drivers are not needed at the user level.

blueone said:
Yeah, but I think accessing the QAT logic is still done as an I/O operation. There aren't compression instruction set extensions, correct? This Hot Chips conference presentation implies that AMX and QAT (and DSA) are two completely different classes of accelerators, and AMX is the only one with instruction set extensions.

Correct AMX has instructions while QAT has commands (wide instructions) written directly into a FIFO. AMX is synchronized by dependency tracking on defined register sets, while QAT results are synchronized by semaphores in coherent memory. The QAT approach is far more extensible, less silicon, and just as efficient (and very different to the user compared to doing classic IO). It points to the future of acceleration.

blueone said:
This explanation on intel.com seems to say SIOV works like I think it does:

4.4. Scalable IOV

www.intel.com

That is a very superficial summary. IMO there is a part of Intel who understands where they are going with S-IOV (along with the non-temporal writes used for command queuing and the coherency-conditional waits for synch) but Intel as a whole has not made a clear external story about it. Maybe they will now Sapphire is out and they will want to show its strengths.

Apple is somewhere in between with its accelerators. It uses command FIFOs in memory mapped space, but moves data through register sets instead of shared memory, at least for the M-series accelerators that have been reverse-engineered so far.

blueone · Jan 11, 2023

Tanj said:
You are mistaking the setup for the usage. From the application POV, SIOV is transparent memory across heterogenous chips. CXL.cache adds coherency for external chips but QAT is already coherent.

S-IOV is primarily about how to share the virtual memory addressing, and mostly runs automatically in the page tables of compliant devices, not software. Drivers are not needed at the user level.

Tanj, you are mistaken. Scalable I/O Virtualization (SIOV) is about I/O virtualization, not virtual memory addressing. I/O virtualization in servers using PCIe was based on the PCIe capabilities for Single Root I/O Virtualization (SRIOV), and Multi-Root I/O Virtualization (MRIOV). "Root" here referring to a PCIe root complex (for those here not familiar with PCIe architecture). Both of these strategies were intended for OS virtualization virtual machines. Each virtual machine runs a separate OS image. SRIOV was the only version that became really popular; MRIOV allows I/O devices to be shared by multiple servers with separate PCIe Root Complexes, and was very complex to implement. PCIe IOV supports Physical Functions (PFs) which are "owned" by the VMM, and Virtual Functions (VFs), which are the virtualized PCIe device instantiations as seen by the Virtual Machines, which might be running different operating systems. (For example, Windows and Linux on the same physical server.)

PCIe IOV strategies do not directly support OS containers, which have increased dramatically in popularity after PCIe SR/MR IOV was defined. Containers require much lower hardware and software overhead than VMMs/VMs, and if the objective is inter-application security and isolation, containers are considered a better strategy by the IT community. Each container runs as an Operating System process in a single parent OS instance, which is invisible to PCIe IOV. No VMM is required. SIOV was designed to allow PCIe IOV to be used by OS containers, and it does this by a software-based abstraction layer that is visible to the container management software, like Docker or Kubernetes.

SIOV stands for Scalable I/O Virtualization because that is what it is about. I/O virtualization. You are misunderstanding how it works and why it exists. If you don't think so, provide evidence from Intel. If I'm proven wrong, I'll gladly admit it.

Tanj said:
Correct AMX has instructions while QAT has commands (wide instructions) written directly into a FIFO. AMX is synchronized by dependency tracking on defined register sets, while QAT results are synchronized by semaphores in coherent memory. The QAT approach is far more extensible, less silicon, and just as efficient (and very different to the user compared to doing classic IO). It points to the future of acceleration.

There isn't any public information I can find on Sapphire Rapids QAT programming. On intel.com this is what they say about QAT on that CPU:

Intel® QuickAssist Technology Quick Start Guide

Quick guide to get started fast with the Intel® QuickAssist Technology (QAT)

www.intel.com

"Contact your Intel Field Representative". I'm retired, so that's not helpful.

The latest available Intel QAT Programmer's Guide (it's a PDF download only) still references a driver model. This also got me thinking. QAT has been available for many years (more than ten), in discrete ASICs, PCIe cards, and logic integrated into server chipsets (PCH). There must be a lot of software out there using QAT. If Sapphire Rapids QAT had a different programming model than the previous versions, all of that software would have to be rewritten to use Sapphire Rapids QAT, and I don't believe Intel would go that way. Perhaps I'm wrong, but Intel, especially x86 Intel, is a company resolute about backward compatibility. For example, the CPU word length in the x86 programmer's guide is still 16 bits. 32 bit words are called "doublewords", and 64 bit words are "quadwords". I find it difficult to believe that a company which still defines a memory word as 16 bits in its assembler language would require reprogramming of user software to take advantage of a new feature which does the exact same thing as a previous feature (albeit in a chipset). Just saying.

Tanj · Jan 11, 2023

Modern security architecture goes far deeper than containers. The state of the art is VMs in hardware supporting blind hypervisors - HVs which are unable to read or write clear-text memory belonging to clients. Both Intel and AMD have good schemes for this, though they differ on how they implement it. I prefer what AMD is doing, but both are competent and clients do not need to know the details it is just annoying for HV writers to have to support both. But security architecture is a separate discussion.

The Intel S-IOV spec is here:

https://www.intel.com/content/www/us/en/develop/download/intel-virtualization-technology-for-directed-io-architecture-specification.html

The key thing for user-space work actually looks like just a minor detail: the PASID and the IOMMU support. These allow user-space addresses to be used by IO devices. The IOMMU page tables are modified to support 3 levels of translation: host physical, guest physical, and guest virtual. In SR-IOV the IOMMU only supported the first two levels which meant that IO either went through a guest kernel transition, or the PCIe-ATS hack was used to allow the IO device to supply the missing level of guest virtual translation. In S-IOV the CPU supports the missing level and it is integrated with memory space security. From the user point of view the device shares the same guest virtual address space so you can write the program in a unified way on both sides, with no kernel transitions in the paths.

You can get the flavor of QAT programming here:

30. Shared Virtual Addressing (SVA) with ENQCMD — The Linux Kernel documentation

blueone · Jan 11, 2023

Tanj said:
Modern security architecture goes far deeper than containers. The state of the art is VMs in hardware supporting blind hypervisors - HVs which are unable to read or write clear-text memory belonging to clients. Both Intel and AMD have good schemes for this, though they differ on how they implement it. I prefer what AMD is doing, but both are competent and clients do not need to know the details it is just annoying for HV writers to have to support both. But security architecture is a separate discussion.

The Intel S-IOV spec is here:

https://www.intel.com/content/www/us/en/develop/download/intel-virtualization-technology-for-directed-io-architecture-specification.html

The key thing for user-space work actually looks like just a minor detail: the PASID and the IOMMU support. These allow user-space addresses to be used by IO devices. The IOMMU page tables are modified to support 3 levels of translation: host physical, guest physical, and guest virtual. In SR-IOV the IOMMU only supported the first two levels which meant that IO either went through a guest kernel transition, or the PCIe-ATS hack was used to allow the IO device to supply the missing level of guest virtual translation. In S-IOV the CPU supports the missing level and it is integrated with memory space security. From the user point of view the device shares the same guest virtual address space so you can write the program in a unified way on both sides, with no kernel transitions in the paths.

You can get the flavor of QAT programming here:

30. Shared Virtual Addressing (SVA) with ENQCMD — The Linux Kernel documentation

Tanj, you seem like a really nice person, but this conversation is giving me a feeling of deja vu. It reminds me of one I had recently with another member in a thread about RISC-V, who used my mental weaknesses to take the thread so far off track I remember having to reread the first post to remind myself what the thread was about. In this thread we were diverted into discussing SIOV, which is not really related to the accelerators in Sapphire Rapids or Intel's competitiveness with AMD's 7040, and now you want to divert to nested hypervisors which have nothing to do with SIOV. And, IMO, you're still misunderstanding or misrepresenting SIOV, I'm not sure which, with concepts like "In S-IOV the CPU supports the missing level...", implying that there's hardware support specifically for SIOV, and there isn't. Yes, SIOV uses memory protection in the CPU, but so does a game of solitaire. And user-mode I/O too? Standard PCIe devices can also support user-mode I/O, as RDMA network adapters have been doing for two decades. So I give up. You win, and I'll take the blame for helping you take this thread off course.

Tanj · Jan 11, 2023

Did not mean to do that.

My recent background is in cloud servers, which is a main focus of the Sapphire Rapids SKUs just announced. And I've been watching AMD Genoa for a long time too, and comparing them. The thread topic was whether Genoa is a game changer. Hence my analysis of features where Intel was ahead for server loads, where making customer VMs work effortlessly at near-zero overhead is a big deal.

So I don't think we are far off course. But yeah, thrashed it to death.

FWIW I think Sapphire came out way behind the curve with 8 memory channels, you can see it choke vs. Genoa with 12, both DDR5. Intel did a few weeks ago announce support for high speed buffered DDR5 DIMMs which may tilt that, but for now the only configurations available cannot match Genoa throughput. And in benchmarks memory bandwidth on high core counts generally sets the ceiling.

Genoa IS a game changer in having AMD still ahead on launch day for Intel with available configurations. But the advantage will likely switch back and forth.

Lisa Su announces AI Ryzen Chips, Game Changer?

Well-known member

Banned

Well-known member

Banned

Well-known member

Banned

Well-known member

Well-known member

Well-known member

Well-known member

Active member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member