We are currently in the hockey stick growth phase of AI. Advances in artificial intelligence (AI) are happening at a lightning pace. And, while the rate of adoption is exploding, so is model size. Over the past couple of years, we’ve gone from about two billion parameters to Google Brain’s recently announced trillion-parameter AI language model, the largest yet. But with model size and complexity growing at a faster rate than hardware compute capability, AI hardware is running out of steam and OEMs are looking for new solutions.
Compute challenges are nothing new to the technology industry so, what is different this time around? To date, the tried-and-true means to increase performance has been to multiply the number of tiles, or cores or chips. However, more cores and larger chips amplify a long-standing problem for designers. Chip Developers need to continuously battle the innate physics of large chips, combating skew, process variation, and aging effects. And, these effects are only multiplied as companies transition to smaller process geometries. As engineers add more processors to a design, enacting a synchronous design at high frequencies becomes an almost insurmountable task. Physical design engineers are forced to overdesign these massive chips, leading to unnecessary clocking overhead and a decrease in inference rates or increase in training times.
At the recent Linley Spring Processor Conference, Movellus’ Aakash Jani presented on the challenges and opportunities of scaling performance in very large, many-core chip designs. He shows us how an innovative approach to clocking allows greater synchronization throughout the design and enables more efficient and scalable performance for emerging AI applications.
Requirements Driving Scalable AI
Data centers, autonomous vehicles and computer vision are some of the applications that are pushing the limits of scalable AI. The old way of throwing more chips and/or processors at the problem does not lead to a scalable solution. Refer to the Figure below.
In the era of AI, big multicore chips are the new normal. More tiles or cores require more area. More area leads to more power consumption, more interconnect, more latency, and more skew. All these chip infrastructure overhead problems are amplified on larger area designs. All these problems are impacted by the clock network. The above graph shows some well know AI processors that use a multicore approach to increase performance. The problem is that as more cores are added the performance per core decrease. This is due to chip infrastructure overhead and, to a large degree, inefficiencies in the clock network.
Today, designers address clocking issues with a divide and conquer approach. They may tackle the biggest offenders first and make incremental changes until they meet design requirements. But if we approach the problem holistically, there is an opportunity for major gains in power-efficiency and performance. Additionally, we can open the door to creating large synchronous clock domains, allowing engineers to scale their systolic arrays for the next generation of multi-trillion parameter models.
Movellus presented its holistic clocking solution: intelligent clock networks. What, exactly, is an Intelligent Clock Network? Every chip begins with a perfect clock signal. However, as the signal travels through the chip, it is often delayed and distorted because of process variation and the physics of the chip. Intelligent clock networks bypass most of these problems to help clock architects deliver an ideal clock signal to every flop. These networks achieve this lofty goal using strategically placed smart clock IP modules throughout the chip. Smart clock modules use Movellus’ intelligent clock network technology to actively compensate for skew, process variation, and aging. Smart clock modules are also aware of other smart clock modules and can synchronize with them to create large synchronous clock domains via a closed feedback loop. The beauty of this approach is that it eliminates the need for a multitude of retiming flops and cross domain clocking (CDC) buffers and thereby avoids a ton of clocking overhead and system latency. It also reduces design complexity and greatly eases timing closure.
The above chart compares Movellus’ intelligent clock network approach with today’s popular solutions, including a tool driven methodology with clock tree synthesis (CTS) and a semi-custom strategy that implements a mesh. The chart shows design tradeoffs regarding fmax, useful clock period, process flexibility, power and area efficiency, and ease of timing closure. Intelligent clock networks can bring the combined advantages of today’s solutions by offering the performance of a mesh at the power consumption of a tree.
Movellus shows how an intelligent clock network that takes a holistic approach to clocking delivers a significant performance enhancement compared to individual clock network component optimizations. The company introduces its new product, Maestro AI, an intelligent clock network IP platform. Maestro AI enables SoC designers to remove unwanted and accumulating system-level latency for larger chips and chiplets. Maestro intelligent clock network solutions occupy a much smaller area compared to alternative solutions. The solution enables designers to expand the size of synchronous clock domains. Since the solution is offered in soft IP form, it is easily configurable to customer application requirements and portable to any process technology.
On-Demand Access to Aakash’s talk and presentation
You can listen to Aakash’s talk, “Advantages of Large-Scale Synchronous Clocking Domains in SoCs and Chiplets” here, under Session 4. You will find his presentation slides here, under Day 2- AM Sessions.