Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/index.php?threads/apple%E2%80%99s-new-m1-ultra-packs-a-revolutionary-gpu.15637/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2021370
            [XFI] => 1050270
        )

    [wordpress] => /var/www/html
)

Apple’s New M1 Ultra Packs a Revolutionary GPU

M. Y. Zuo

Active member

Apple’s new M1 Ultra SoC, announced yesterday, appears to be a genuine breakthrough. The new M1 Ultra is made from two M1 Max chips and features a new GPU integration approach not seen in-market before. While the SoC contains two GPUs — one per M1 Max — games and applications running on an Apple M1 Ultra see a single GPU.



During its unveil, Apple acknowledged that the M1 Max SoC has a feature the company didn’t disclose last year. From the beginning, the M1 Max was designed to support a high-speed interconnect via silicon interposer. According to Apple, this inter-chip network, dubbed UltraFusion, can provide 2.5TB/s of low-latency bandwidth. The company claims this is “more than 4x the bandwidth of the leading multi-chip interconnect technology.”



Apple doesn’t appear to be referencing CPU interconnects here. AMD’s Epyc CPUs use Infinity Fabric, which supports a maximum of 204.8GB/s of bandwidth across the entire chip when paired with DDR4-3200. Intel’s Skylake Xeons use Ultra Path Interconnect (UPI), with a 41.6GB/s connection between two sockets. Neither of these is anywhere close to 625GB/s. Apple may be referencing Nvidia’s GA100, which can offer ~600GB/s of bandwidth via NVLink 3.0. If we assume NVLink 3.0 is the appropriate comparison, Apple is claiming its new desktop SoC offers 4x the inter-chip bandwidth of Nvidia’s top-end server GPU.



According to Apple, providing such a massive amount of bandwidth allows the M1 to behave and be recognized in software as a single chip with a unified 128GB memory pool it shares with the CPU. The company claims there’s never been anything like it. They might be right. We know Nvidia and AMD have both done some work on the concept of dis-aggregated GPUs, but neither company has ever brought a product to market.

The Long Road to GPU Chiplets​

The concept of splitting GPUs into discrete blocks and aggregating them together on-package predates common usage of the word “chiplet”, even though that’s what we’d call this approach today. Nvidia performed a study on the subject several years ago.


Image by Nvidia

GPUs are some of the largest chips manufactured on any given iteration of a process node. The same economy of scale that makes CPU chiplets affordable and efficient could theoretically benefit GPUs the same way. The problem with GPU chiplets is that scaling workloads across multiple cards typically requires a great deal of fabric bandwidth between the chips themselves. The more chiplets you want to combine, the more difficult it is to wire all of them together with no impact on sustained performance.

Memory bandwidth and latency limitations are part of why AMD, Intel, and Nvidia have never shipped a dual graphics solution that could easily take advantage of the integrated GPU built into many CPUs today. Apple has apparently found a way around this problem where PC manufacturers haven’t. The reason for that may be explained more by the company’s addressable market than by design shortcomings at Intel or AMD.

Apple Has Unique Design Incentives​

Both Intel and AMD manufacture chips for other people to build things with. Apple builds components only for itself. Intel and AMD maintain and contribute to manufacturing ecosystems for desktops and laptops, and its customers value flexibility.

Companies like Dell, HP, and Lenovo want to be able to combine CPUs and GPUs in various ways to hit price points and appeal to customers. From Apple’s perspective, however, the money customers forked over for a third-party GPU represents revenue it could earned for itself. While both Apple and PC OEMs earn additional profits when they sell a system with a discrete GPU, sharing those profits with AMD, Nvidia, and Intel is the price OEMs pay for not doing the GPU R&D themselves.


Apple did not disclose the test it used to make this determination.

A PC customer who builds a 16-core desktop almost certainly expects the ability to upgrade the GPU over time. Some high core count customers don’t care much about GPU performance, but for those that do, the ability to upgrade a system over time is a major feature. Apple, in contrast, has long downplayed system upgradability.

The closest x86 chips to the M1 Ultra would be the SoCs inside the Xbox Series X and PlayStation 5. While neither console features on-package RAM, they both offer powerful GPUs integrated directly on-package in systems meant to sell for $500. One reason we don’t see such chips in the PC market is because OEMs value flexibility and modularity more than they value the ability to standardize on a handful of chips for years at a time.

It may be that one reason we haven’t seen this kind of chip from AMD, Intel, or Nvidia is because none of them have had particular incentive to build it.

How Apple’s M1 Max Uses Memory Bandwidth​

When Apple’s M1 Max shipped, tests showed that the CPU cores can’t access the system’s full bandwidth. Out of the 400GB/s of theoretical bandwidth available to the M1 Max, the CPU can only use ~250GB/s of it.


Graph by Anandtech.

The rest of the bandwidth is allocated to other blocks of the SoC. Anandtech measured the GPU as pulling ~90GB/s of bandwidth and the rest of the fabric at 40-50GB/s during heavy use.

Given these kind of specs, slapping down two chips side-by-side, with duplicate RAM pools, doesn’t automatically sound like an enormous achievement. AMD ships eight chiplets mounted on a common interposer in a 64-core Epyc CPU. But this is where Apple’s scaling claims have weight.

In order for the M1 Ultra GPU to work as a unified solution, it means both GPUs share data and memory addresses across the two physical dies. In a conventional multi-GPU solution, a pair of cards with 16GB of VRAM each will appear as 2x16GB cards, not a single card with 32GB of VRAM. Nvidia’s NVLink allows two or more GPUs to pool VRAM, but the degree of performance improvement varies considerably depending on the workload.


As with CPUs, Apple did not disclose its test criteria.

As far as what kind of GPU performance customers should expect? That’s unclear. The M1 Max performs well in video processing workloads but is a mediocre gaming GPU. The M1 Ultra should see strong scaling thanks to doubling up its GPU resources, but Apple’s lackluster support for Mac gaming could undercut any performance advantage the hardware can deliver.

Apple’s big breakthrough here is in creating a GPU in two distinct slices that apparently behaves like a single logical card. AMD and Nvidia have continued working on graphics chiplets over the years, implying we’ll see discrete chiplet solutions from both companies in the future. We’ll have more to say about the performance ramifications of Apple’s design once we see what benchmarks show us about scaling.


From: https://www.extremetech.com/extreme/332546-apples-new-m1-ultra-packs-a-revolutionary-gpu
 
Another interesting but challenging issue is what Microsoft is going to do on the desktop and notebook platforms?

Using Intel's CPUs is definitely Microsoft's plan A. But what's the plan B for Microsoft in case Intel CPUs can't compete?
 
The Apple fine print can be found here: https://www.apple.com/newsroom/2022...s-most-powerful-chip-for-a-personal-computer/

For the CPU and GPU tests indicating superior to 3090 performance, (notes #2 and 3), Apple discloses that "Performance was measured using select industry‑standard benchmarks. "
= could literally be any piece of software.

FWIW Apple uses the 12600K +DDR5 to compare vs. M1 Max and for M1 Ultra it's 12900K + DDR5.

For the claim of "In fact, the new Mac Studio with M1 Ultra can play back up to 18 streams of 8K ProRes 422 video — a feat no other chip can accomplish." - note #4 literally says they tested on preproduction software, on the Mac, without mentioning if they even tried this on any other device.

Lastly for the claim of using 1,000 kWh less than a high end PC over a year; they specifically chose an Alienware PC without an iGPU (12900KF). A lot of applications don't need a discrete GPU so this definitely influences the results somewhat.

....

IMO M1 Ultra is still a huge engineering achievement (and no doubt TSMC 5nm helps a lot), but the usual Apple superlative claims without sufficient details apply as usual.
 
Apple's CPUs will remain behind until it develops multi-issue
pipelining. Here is a story of problems with the new Apple M1
chip. I read most of the slowness compared to X86_64 desktop
CPUs as lack of pipelining. John von Nuemann in the 1950s
understood that complex CPUs have more computing power
than simple CPU architectures because the more indexing
instruction types the more computing power.

Here is the article:

https://wccftech.com/apple-m1-ultra...-show-intel-amd-cpus-still-ahead/?&beta=1hind until Apple develops multi-issue
 
What do You mean by multi-issue?

Almost all processors today are superscalar.

Problem that these benchmarks can't utilise full preformance of M1. We will have to wait for updates...

Another problme, is, that some people (mostly "silicongang") is jumping into conclusion too quickly without considering all details and they they are spreading lot of misinformations...
 
We obviously disagree. Read the Wccftech article comments
questioning Apple benchmarking.
 
Apple's CPUs will remain behind until it develops multi-issue pipelining.
I also don't understand your comment here. Apple's Performance design CPUs are exceptionally wide parallel out-of-order architectures, and their ROBs are also very large. They are clearly superscalar designed, and they have been since their first ARMv7 Swift architecture. They are also clearly optimized to have a large amount of combinatorial logic per pipeline stage in order to reduce the number of flops in the architecture and thus reduce the dynamic power consumed. It is not a bug, it is a feature of the architecture that it only operates up to ~3.0-3.2 GHz, and delivers single-core performance contemporary to x86 cores operating 5.0-5.5 GHz.

That aside, Apple's marketing claims are picked to highlight what they want to highlight, which is the power efficiency. Are their solutions better in absolute terms? no. Do I believe that if capped to a fixed power budget, their architecture that is a node ahead and optimized for low dynamic power will shine against the competition? yes. The quoted benchmark results don't include any power figures, and as such people lean towards the TDP of the solution, but I suspect that Apple's "equivalent system" power figures are the result of spreadsheets and not measurements - the DVFS of x86 systems doesn't seem to allow for a fixed target power envelope.
 
The comments to the Wcftech article are that M1s are optimized for
a few specific applications (video?) but are slow for other (compiled)
applications. Apple maybe is too concerned with low power and now
that Apple no longer has shrinkage advantage, its CPUs are not
fast. Why is the M1 die and cache so large?
 
The comments to the Wcftech article are that M1s are optimized for a few specific applications (video?) but are slow for other (compiled) applications. Apple maybe is too concerned with low power and now
that Apple no longer has shrinkage advantage, its CPUs are not fast. Why is the M1 die and cache so large?
If the benchmarks are running under emulation that will impact the results more than if running natively, same for specific apps. That will change with time, but it is a legitimate concern in the now. Architecturally, the large L2$ for the performance cores are to reduce latency when they miss on the large OoO back-end, and the large System caches are there to manage coherency between the various subsystems and reduce off-chip BW to the DRAM. I suspect that the system cache benefits their media and graphics more than compute though. They have opted to use LPDDR5 for higher bandwidth as well as lower switching energy on the interface, but the "cost" is lack of upgrade (no DIMMs). I think that architecture is fine for a mobile product but for (Professional!) desktops it is really sad to see.
There was a teardown video for the Mac studio, and the take-away for me was: physical ports can be repaired easily, if that is a failure point on the product. SSD can be replaced, but it is proprietary to Apple anyhow, so not something you will be DIY. And any electrical component failure and the whole thing goes in the bin.
 
I thought of a question. Does anyone know the IPC (instructions per
clock cycle) of the Apple M chips? My impression is that both AMD
and Intel IPCs have significantlly ingreassed from their competition.
I think for my Verilog compiler application more cache does not help.
 
I thought of a question. Does anyone know the IPC (instructions per
clock cycle) of the Apple M chips? My impression is that both AMD
and Intel IPCs have significantlly ingreassed from their competition.
I think for my Verilog compiler application more cache does not help.
In practice IPC is dominated by memory latency. Cores get little done because they are waiting, not because they can't run parallel instructions. Apple's choice of LPDDR5 is probably motivated by:
- much lower energy per bit compared to DDR5 (keeps power down under high loads)
- more bandwidth per chip (LPDDR5 is around 6.4GB/s per GB, DDR5 is around 1.3)
- more parallelism (LPDDR5 sustains 4 independent memory ops per GB, DDR5 is at 0.5)

The latter 2 make a huge improvement in IPC under heavy loads. Even though static latency of LP is a little slower than DDR5, when you put it under load the queues build up far more slowly. The bandwidth simply allows more work to be done, and the independent ops allow parallel computation (including parallel operations of a single core, and competing threads in GPU) to not get in each other's way so much.

This week we could see NVidia reached much the same conclusion about the superiority of LPDDR5.

To paraphrase Cray from long ago, it is all about the memory.
 
I think for Verilog compilation cache misses are
not important. Verilog generated machine code has
very high density of jump instructions with many
sequences staying in X86_64 pipelines. Reason
is that Verilog operations are really mostly 2 by 2
vector cross products.
 
I just checked the die size for M1 Max at ~20 mm x 21.6mm according to AnandTech (or 19 mm x 22 mm according to TechInsights), so the two of these making up M1 Ultra cannot possibly fit into a single exposure field (26 mm x 33 mm). So the Ultra size must be defined by the CoWoS interposer. M1 Ultra should be considered a type of SiP (System-in-Package), not SoC. It's two chips packaged together.
 
Last edited:
I just checked the die size for M1 Max at ~20 mm x 21.6mm according to AnandTech (or 19 mm x 22 mm according to TechInsights), so the two of these making up M1 Ultra cannot possibly fit into a single exposure field (26 mm x 33 mm). So the Ultra size must be defined by the CoWoS interposer. M1 Ultra should be considered a type of SiP (System-in-Package), not SoC. It's two chips packaged together.
They could in theory dice the wafer up into pairs of M1 die; if you look at the layout one reticle on the M1 Ultra appears to be rotated 180 degrees compared to the other, and there could be interconnect on the silicon between the two crossing what would be the (non-scribed) scribe channel. But this would definitely cause some interesting manufacturing problems since either the reticle would have to be rotated or dual reticles used, one for each way up.

Or they could butt two M1 Max dies up to each other on the silicon interposer, where the gap is too small to see on the photos (usually less than 100um). Which would also help yield, so this is a more likely option.
 
M1 Ultra uses two M1 Max dies, which use 432sqmm of a wafer(total 864sqmm) with TSMC N5. 12900K uses Intel 7(former Intel 10nm++) 215sqmm, and Geforce 30(628sqmm) uses Samsung 8nm(+TDP includes DRAM power). So if we put these together, the difference is not something they can overcome with better microfabrication. More transistors can help clock reduction, resulting in better performance per power. So at least from a silicon design perspective, both NVIDIA and Intel are quite OK I think.
 
Back
Top