WP_Term Object
(
    [term_id] => 18057
    [name] => Movellus
    [slug] => movellus
    [term_group] => 0
    [term_taxonomy_id] => 18057
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 5
    [filter] => raw
    [cat_ID] => 18057
    [category_count] => 5
    [category_description] => 
    [cat_name] => Movellus
    [category_nicename] => movellus
    [category_parent] => 178
)
            
Movelus Header Banner 800x100 1
WP_Term Object
(
    [term_id] => 18057
    [name] => Movellus
    [slug] => movellus
    [term_group] => 0
    [term_taxonomy_id] => 18057
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 5
    [filter] => raw
    [cat_ID] => 18057
    [category_count] => 5
    [category_description] => 
    [cat_name] => Movellus
    [category_nicename] => movellus
    [category_parent] => 178
)

Performance, Power and Area (PPA) Benefits Through Intelligent Clock Networks

Performance, Power and Area (PPA) Benefits Through Intelligent Clock Networks
by Kalar Rajendiran on 12-10-2021 at 10:00 am

One of the sessions at the Linley Fall Processor Conference 2021 was the SoC Design session. With a horizontal focus, it included presentations of interest to a variety of different market applications. The talk by Mo Faisal, CEO of Movellus, caught my attention as it promises to solve a chronic issue relating to synchronizing clock networks. While clock synchronization reduces the chance of signal hazards, the act of synchronization leads to performance, power and area inefficiencies. Over the years, many different approaches have been deployed to reduce these inefficiencies. But most of these techniques still depend on clock mesh and/or clock tree trunks and traces and use clock buffers for fanning out the clock signals.

While Mo’s talk was titled “Clock Networks in a multi-core AI SoC, the solution he presented is applicable to all SoCs. The following is a synthesis of what I gathered from his presentation.

Drawbacks of Traditional Solutions

Traditional clock networks are either a mesh or a tree implemented with wires and buffers. The buffers don’t have intelligence into what is going on with the SoCs. The implementation is typically over designed with clock buffers. Movellus claims that SoCs lose about 30%-50% of their performance due to inefficiencies introduced by clock networks. In addition, there is a significant power overhead on the SoC total dynamic power (TDP) budget and introduction of latencies. Improving the quality of clock distribution networks can improve the PPA of the entire SoC.

Movellus’ Solution

Through its intelligent clock network technology named Maestro, Movellus can ameliorate or eliminate the inefficiencies introduced by traditional clock networks.  Maestro technology consists of multiple components to achieve this. In his presentation, Mo shows a smart clock module (SCM) which senses and compensates for on-chip variation (OCV) effects and skew across an entire SoC. The SCM has awareness of on-chip variation (OCV), skew and temperature drift and dynamically aligns the clock network across the entire SoC. It pushes the common clock point very close to the flops on which the clocks are operating.

4 What is Maestro ICN

Movellus’ architectural innovation drives the delivery of the following three benefits.

      • Latency Reduction
      • Energy Efficiency
      • Max Throughput

While the above attributes are typical requirements for most applications, these are particularly critical for today’s AI driven edge applications.

The Maestro solution is offered in soft IP form and fits into any EDA tool flow, making it easy to integrate into any SoC.

Some Use Cases

The Maestro technology can bring benefits to both heterogeneous SoCs and homogeneous SoCs. A heterogeneous SoC consists of many different subsystems with different care abouts, whether speed, power or timing closure. Refer to Figure below.

5 Maestro Applications in a SoC

While Mo showcases the value of Maestro technology using a homogeneous SoC example through the bulk of his presentation, the insights gained can be directly applied to the different subsystems of a heterogeneous SoC such as the one shown above. For example, the ability to do multi-rate communication without clock-domain-crossing (CDC) FIFOs:  A SoC with a compute core running at a higher frequency with the rest of the chip running at half clock rate. With the Maestro solution, data can be moved from I/O flop to I/O flop without having to add retiming flops and CDC FIFOs. With an AI SoC where the data bus width is very wide, the maestro solution will save lot of retiming flops, reducing latency and improving PPA.

Mo calls the Maestro solution a very high-quality large-scale synchronization method at the lowest power possible.

Higher Speed

With Maestro, the common clock point is pushed very close to the flops by using SCM. Refer to Figure below for the intra-core example used. The core is a 3 sq.mm in N7 node, running at 2.5GHz. The divergent insertion delay was reduced from 750ps to 200 psec. Even with the 5ps Maestro overhead, the OCV-driven speed sacrifice is driven down from 26% to 8.3%, delivering about 18% gain is useful cycle time.

11 Intra Core Clock Network Increasing Fmax

Lower Power

Traditional global clock networks typically use some variation of a clock mesh to bring the clock to all the cores and is always-on and consuming power. Refer to the Figure below for the example used. In this example, the traditional approach burns 2.5W all the time, independent of the SoC run time utilization level. The total dynamic power (TDP) of the example SoC is 50W. Under the traditional approach, the global clock distribution power at 2.5W is at 5% of the TDP. At a 20% utilization level, the 2.5W is 25% of the 10W dynamic power consumption. Generally speaking, average utilization levels are well below 100%.

For this example, a Maestro implementation helps keep the global clock distribution power at or below 2.5% of the TDP under various utilization levels.

15 Global Clock Network Maestro Benefits

Resultant Benefits

While the above examples quantified the efficiency gains along speed and energy dimensions, there are other tangible benefits from using the Maestro technology. For example, the ease of handling multi-rate clocks in a heterogeneous SoC. Another example is the ease of implementing the global level clock network. Once the intra-core clock network is fixed, the global clock network gets automatically corrected. All that is needed is to hook it up with a normal global level clock tree straight out of clock tree synthesis. There is no need to balance the global clock distribution. The die area savings and latency reduction through the avoidance of a large number of buffers and/or retiming flops could be significant too.

New Opportunities to Innovate

Mo encourages SoC architects and implementation specialists to think of new use cases Maestro technology could enable in their designs. What can one do with a large-scale synchronization capability like this? Does this help with simplification of software? What can you do with extra timing margin?

Mo closes his talk with the following teaser. He suggests that the amount of performance that is sacrificed to accommodate for OCV effects is only 1/3 of the performance gain that Maestro solution can deliver to an SoC. There are other details of the Maestro architecture which were not disclosed during the presentation. For more details, contact Movellus.

Also Read:

Advantages of Large-Scale Synchronous Clocking Domains in AI Chip Designs

It’s Now Time for Smart Clock Networks

CEO Interview: Mo Faisal of Movellus

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.