WP_Term Object
(
    [term_id] => 14
    [name] => Synopsys
    [slug] => synopsys
    [term_group] => 0
    [term_taxonomy_id] => 14
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 699
    [filter] => raw
    [cat_ID] => 14
    [category_count] => 699
    [category_description] => 
    [cat_name] => Synopsys
    [category_nicename] => synopsys
    [category_parent] => 157
)

Ultra Ethernet and UALink IP solutions scale AI clusters

Ultra Ethernet and UALink IP solutions scale AI clusters
by Don Dingee on 12-19-2024 at 6:00 am

Key Takeaways

  • AI infrastructure is experiencing rapid growth due to the demands of larger models requiring substantial training loads and low inference latency.
  • Synopsys has launched Ultra Ethernet and UALink IP solutions to address the interconnect challenges faced by data centers supporting AI acceleration clusters.
  • The Ultra Ethernet Consortium aims to enhance Ethernet technology for scalable interconnect solutions, while UALink focuses on intelligent connections among accelerator nodes to improve data movement efficiency.

AI infrastructure requirements are booming. Larger AI models carry hefty training loads and inference latency requirements, driving an urgent need to scale AI acceleration clusters in data centers. Advanced GPUs and NPUs offer solutions for the computational load. However, insufficient bandwidth or latency between servers can limit AI performance, faster interconnects tend to chew up massive amounts of power, and scale magnifies these issues rapidly. Two new initiatives, Ultra Ethernet and UALink, target the scale out and scale up needs, respectively, of AI acceleration clusters. Synopsys brings proven Ethernet and PCIe IP, including its 224G Ethernet PHY, to its new Ultra Ethernet and UALink IP solutions to take on efficient, scalable data center interconnects.

Bringing Ultra Ethernet and UALink technology to data centers

“Moving all the data in and around an AI cluster running a large language model like Llama 3 and its successors poses interconnect challenges,” says Priyank Shukla, Principal Product Manager for Interface IPs at Synopsys. “By 2030, just the interconnects for training these models may consume 70% of data center power.” (See more AI data center 2030 insights from McKinsey.)

Advanced AI infrastructure cluster interconnects aren’t optional – LLM needs are already well beyond what a single GPU can accomplish. GPUs such as NVIDIA’s H100 are at the reticle limits, meaning the design consumes the largest fabricable die size, even in an advanced process, making adding more functionality on one chip difficult. Meta’s anecdotes on its Llama 3 training projects indicate 16,000 H100 nodes at work for 70 days. They also suggest that the model size doubles every four to six months, which will soon drive node counts to hundreds of thousands.

Two distinct aspects of the interconnect problem require different solutions. First is the classic bisectional bandwidth challenge of moving copious amounts of data between many nodes in a cluster, some within a rack and some several racks away, at low latency. Second is bringing potentially millions of endpoints in and out of the cluster, again at high speeds with low latency. Two recently formed consortiums, each with Synopsys as a member, worked quickly on new architectures to meet these exact challenges.

The Ultra Ethernet Consortium was formed with the backing of the Linux Foundation, seeking a supercomputing-ready scalable interconnect as an evolutionary path for Ethernet. Along with a faster PHY layer, Ultra Ethernet adds some twists to the technology to achieve scale out, including remote direct memory access (RDMA), packet spraying and out-of-order recovery, sender and receiver-based congestion control, link layer retry, switch offload, and more.

The UALink Consortium (short for Ultra Accelerator Link) provides an intelligent interconnect between accelerator nodes, including direct load, store, and atomic operations for software coherency. Based on the IEEE P802.3dj PHY layer, the initial release of the UALink specification defines a 200Gbps connection for up to 1024 accelerators. Scale up capability comes through a switched architecture, which connects nodes at low latency.

A simplified view of a few racks shows these concepts at work, with UALink between nodes and Ultra Ethernet as the broadside interface to the AI infrastructure cluster.

UALink and Ultra Ethernet roles in AI infrastructure clusters

Proven Synopsys IP solutions speed implementation timeline

“Enabling scale up and scale out at once is a big story for our customers,” says Shukla. Keeping a foot in both consortia and aligning specification development with IP solution capability, Synopsys is already up with its Ultra Ethernet and UALink IP solutions. “Our efforts flowed from 25 years of IP solution development in Ethernet and PCIe technology and over 5000 customer tape outs.”

  • The Synopsys UALink IP solution comprises PHY, controller, and verification IP. The PHY is engineered for a 200Gbps per lane transfer rate. The controller implements memory-sharing capabilities in connecting up to 1024 nodes, and the verification suite provides protocol checking.
  • The Synopsys Ultra Ethernet IP solution starts with its proven 224G Ethernet PHY IP. It adds MAC and PCS controller layers to deliver 1.6Tbps SERDES links with minimal congestion, again with verification IP for advanced protocol features.

Both IP solutions optimize for power efficiency. “Think of it this way: if we save one pico joule of energy per bit, a data center may be able to save a gigawatt of interconnect power,” concludes Shukla. “We have low-risk IP solutions for Ultra Ethernet and UALink that are ready now for customer SoC designs.”  With ecosystem interoperability established and engagements underway, he expects that Ultra Ethernet and UALink chipsets should emerge in the next 18 to 24 months.

A three-year window from consortia formation through specification release to chipset products would be blazing fast. Still, Synopsys is confident because of its deep involvement in specification development and reuse of crucial IP elements. More details on the Ultra Ethernet and UALink IP solutions are available online from Synopsys.

News: Synopsys Announces Industry’s First Ultra Ethernet and UALink IP Solutions to Connect Massive AI Accelerator Clusters

Video: Industry First Ultra Ethernet and UALink IP

Blog post: Enabling Massive AI Clusters with the Industry’s First Ultra Ethernet and UALink IP Solutions

Also Read:

Synopsys Brings Multi-Die Integration Closer with its 3DIO IP Solution and 3DIC Tools

Enhancing System Reliability with Digital Twins and Silicon Lifecycle Management (SLM)

A Master Class with Ansys and Synopsys, The Latest Advances in Multi-Die Design
by Mike Gianfagna on 12-04-2024 at 6:00 am

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.