Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

Recent Article Comments

TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
If the TSMC Media Statement is true I feel it is a betrayal. If not, as I said, jumping companies…

— Daniel Nenni on November 29, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
"Not to mention the shame of betraying Taiwan’s most valued company?" This statement is somewhat questionable. Changing an employer, obviously,…

— lilo777 on November 29, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
Daniel, This kind of reminds me of IBM suing Mark Papermaster when he left IBM to join Apple. Seems like…

— Rahul Razdan on November 28, 2025
An Insight into Building Quantum Computers
Understood, thanks!

— Fred Chen on November 26, 2025
An Insight into Building Quantum Computers
You're keeping me on my toes Fred. Thanks! I'll expand further in an upcoming benchmarking blog, but briefly technologies are…

— Bernard Murphy on November 26, 2025
An Insight into Building Quantum Computers
How does the cost model work? Is there a drive toward higher qubits/mm2 or is this not possible?

— Fred Chen on November 25, 2025
An Insight into Building Quantum Computers
Thanks! Glad to hear you find this helpful!

— Bernard Murphy on November 22, 2025
An Insight into Building Quantum Computers
Nice and very informative, and thank you for writing!

— yzhanguncc on November 22, 2025
Live Webinar: Considerations When Architecting Your Next SoC: NoC with Arteris and Aion Silicon
Can we stop using terms like "ultra-low latency" without stating just what we mean by that? In some cases, sub-millisecond…

— dwisehart on November 22, 2025
I Have Seen the Future with ChipAgents Autonomous Root Cause Analysis
Hello Horace, thanks for your question. ChipAgents prefers not to publish all the details of their demos for competitive reasons.…

— Mike Gianfagna on November 22, 2025

Banner Electrical Verification The invisible bottleneck in IC design updated 1

WP_Term Object
(
    [term_id] => 50
    [name] => Events
    [slug] => events
    [term_group] => 0
    [term_taxonomy_id] => 50
    [taxonomy] => category
    [description] => 
    [parent] => 0
    [count] => 1459
    [filter] => raw
    [cat_ID] => 50
    [category_count] => 1459
    [category_description] => 
    [cat_name] => Events
    [category_nicename] => events
    [category_parent] => 0
)

August 30, 2025September 27, 2025 by Admin

How AI Workloads Shape Hardware Architecture

How AI Workloads Shape Hardware Architecture
by Admin on 08-30-2025 at 10:00 am
Categories: AI, Events

Key Takeaways

The evolution of AI workloads has transitioned from single-GPU systems to large rack-based clusters designed for parallelism and efficiency.
Foundational models like AlexNet (2012) and BERT (2018) have driven the need for data parallelism and more advanced hardware capabilities in AI training.
New forms of parallelism, such as pipeline parallelism and tensor slicing, have emerged to accommodate the growing size and complexity of AI models.
Mixture of Experts (MoE) architecture introduced sparsity, allowing only subsets of experts to be activated during training, thus reducing power and computation requirements.
Future trends in AI hardware include the adoption of microscaling formats (FP4/FP8) for efficiency and the continued growth of rack-based systems to support increasingly complex AI workloads.

The evolution of AI workloads has profoundly influenced hardware design, shifting from single-GPU systems to massive rack-based clusters optimized for parallelism and efficiency. As outlined in this Hot Chips 2025 tutorial, this transformation began with foundational models like AlexNet in 2012 and continues with today’s multi-trillion-parameter behemoths.

Early AI breakthroughs relied on affordable compute. AlexNet, trained on ImageNet using two NVIDIA GTX 580 GPUs, demonstrated convolutional neural networks’ potential for image classification. Priced at $499, the GTX 580 offered 1.58 TFLOPs of FP32 performance and 192 GB/s memory bandwidth. Training took 5-6 days on FP32 precision, necessary for convergence due to its range and precision advantages over formats like Int8.

By 2015, ResNet-50 introduced data parallelism, distributing training across multiple GPUs like those in Facebook’s “Big Sur” system with eight K80 cards. Data parallelism splits datasets across model copies, using AllReduce operations to synchronize weights, accelerating training without increasing batch size.

The 2018 BERT model marked the transformer era, emphasizing natural language processing. Google’s TPUv2, with BF16 precision (matching FP32 range but lower precision), enabled faster training. A 256-TPUv2 cluster in a 2D torus topology trained BERT-large in four days using 64 ASICs. Reduced precision like BF16 and FP16, accelerated by NVIDIA’s Tensor Cores in V100 GPUs, balanced accuracy and throughput.

As models grew, new parallelism forms emerged. Pipeline parallelism, introduced in GPipe (2018), divides layers across nodes, though limited by inter-node bandwidth (e.g., PCIe or Ethernet at ~8 GB/s vs. intra-node links). NVIDIA’s DGX-1 V100 systems exemplified this, connecting eight GPUs via NVLink for efficient communication.

Tensor slicing (model parallelism), added in Megatron-LM (2019), splits matrix operations across GPUs within a node, offering near-linear scaling. For instance, the 2021 Megatron-Turing NLG (530B parameters) used 280 A100 GPUs per replica: 8-way tensor slicing per node, pipeline across 35 nodes, and data parallelism across replicas. Training took three months, highlighting the need for larger clusters.

Mixture of Experts (MoE), from GShard (2020), introduced sparsity. Unlike dense models requiring full compute, MoE activates only expert subsets, reducing power and computation. DeepSeekMoE (2024) maximized this by using fine-grained experts fitting single GPUs, minimizing overlap and inter-GPU communication. Intra-GPU bandwidth (e.g., 3.35 TB/s HBM) far exceeds scale-up links like NVLink (900 GB/s aggregate).

Hardware adapted accordingly. Scale-up domains like NVLink or AMD’s Infinity Fabric enable high-bandwidth, low-latency intra-rack connectivity. NVIDIA’s GB200 NVL72 racks 72 GPUs with 1.8 TB/s uni-directional bandwidth, using copper for short reaches and optics for longer. AMD’s MI300X supports 32 GPUs per rack in Azure setups.

Cluster scaling poses challenges: SERDES rates limit copper reach, driving optical adoption; power density (e.g., 100 kW+ racks) mandates liquid cooling; and bandwidth doubling every two years strains infrastructure. Taxonomy distinguishes scale-up (e.g., UALink) from scale-out (Ethernet/InfiniBand) and front-side networks.

Future trends include microscaling formats like FP4/FP8 for efficiency, with computational throughput soaring— from 1 PFLOP FP16 in 2018 DGX V100 to projected 100 PFLOP FP4 in 2027 single-GPU packages. Racks as AI building blocks, from Google’s TPU toruses to Meta’s data centers, underscore this shift toward continent-scale systems.

Bottom line: AI’s progression from narrow tasks to AGI/ASI demands hardware innovations in parallelism, precision, and interconnects, balancing compute density with energy constraints. As clusters grow to hundreds of thousands of GPUs, racks optimize for these workloads, enabling unprecedented scale.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
If the TSMC Media Statement is true I feel it is a betrayal. If not, as I said, jumping companies…

— Daniel Nenni on November 29, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
"Not to mention the shame of betraying Taiwan’s most valued company?" This statement is somewhat questionable. Changing an employer, obviously,…

— lilo777 on November 29, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
Daniel, This kind of reminds me of IBM suing Mark Papermaster when he left IBM to join Apple. Seems like…

— Rahul Razdan on November 28, 2025
An Insight into Building Quantum Computers
Understood, thanks!

— Fred Chen on November 26, 2025
An Insight into Building Quantum Computers
You're keeping me on my toes Fred. Thanks! I'll expand further in an upcoming benchmarking blog, but briefly technologies are…

— Bernard Murphy on November 26, 2025
An Insight into Building Quantum Computers
How does the cost model work? Is there a drive toward higher qubits/mm2 or is this not possible?

— Fred Chen on November 25, 2025
An Insight into Building Quantum Computers
Thanks! Glad to hear you find this helpful!

— Bernard Murphy on November 22, 2025
An Insight into Building Quantum Computers
Nice and very informative, and thank you for writing!

— yzhanguncc on November 22, 2025

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

Key Takeaways

Comments

Recent Forum Threads

Recent Article Comments