WP_Term Object
(
    [term_id] => 16208
    [name] => eInfochips
    [slug] => einfochips
    [term_group] => 0
    [term_taxonomy_id] => 16208
    [taxonomy] => category
    [description] => 
    [parent] => 386
    [count] => 1
    [filter] => raw
    [cat_ID] => 16208
    [category_count] => 1
    [category_description] => 
    [cat_name] => eInfochips
    [category_nicename] => einfochips
    [category_parent] => 386
)
            
eInfochips Semiwiki Profile Banner
WP_Term Object
(
    [term_id] => 16208
    [name] => eInfochips
    [slug] => einfochips
    [term_group] => 0
    [term_taxonomy_id] => 16208
    [taxonomy] => category
    [description] => 
    [parent] => 386
    [count] => 1
    [filter] => raw
    [cat_ID] => 16208
    [category_count] => 1
    [category_description] => 
    [cat_name] => eInfochips
    [category_nicename] => einfochips
    [category_parent] => 386
)

Techniques to Reduce Timing Violations using Clock Tree Optimizations in Synopsys IC Compiler II

Techniques to Reduce Timing Violations using Clock Tree Optimizations in Synopsys IC Compiler II
by eInfochips on 08-27-2020 at 10:00 am

The semiconductor industry growth is increasing exponentially with high speed circuits, low power design requirements because of updated and new technology like IOT, Networking chips, AI, Robotics etc.

In lower technology nodes the timing closure becomes a major challenge due to the increase in on-chip variation effect and it leads to changes in interconnect delay and cell delay. It is a difficult task for the clock to reach every flop at almost the same instance of time to avoid timing violations, as most of the power is consumed by clock structure in circuit design the effect of ocv is more in the clock network as compared to signal and other paths. So it is important to minimize the effect of ocv in the clock network by creating proper clock structure to effectively reduce timing violations and variation effects as well as meeting the clock skew and latency requirements.

This article will include the information and techniques to reduce timing violations using optimized mesh clock tree structure with different optimization switches to reduce timing violation and power consumption. We have used Mesh clock tree structure because it provides low skew and has less ocv effect for high performance vlsi designs as compared to conventional clock tree structure.

Keywords: clock tree synthesis (CTS), clock tree optimization, clock concurrent optimization (CCD), On-Chip Variation(OCV), Design Rule Violation Checks(DRVs), Lower technology nodes , Place and Route flow.

Introduction
Clock Tree Synthesis is a process which makes sure that the clock gets distributed evenly to all sequential elements in a design to meet the clock tree design rule violations (DRVs)Vs such as max Transition, Capacitance and max Fanout, balancing the skew and minimizing insertion delay.

There are many types of clock structures namely H-Tree, X-Tree, Conventional clock tree, Multi source clock tree, Mesh Tree etc. In this article, we will focus on clock tree optimization of a mesh clock tree.

Mesh Tree Structure
Mesh tree has clock nets in grid pattern that are driven by clock inverters and buffers. With this structure we can have minimum skew, latency and On-chip Variation as compared to other clock structures. The network of inverter and buffer drivers from clock port to clock mesh drivers is known as Pre-mesh clock structure. An example of a clock mesh tree is shown in figure 1 below.

Mesh tree structure has high power consumption and requires high routing resources because the whole layer is consumed by the clock tree structure. Generally mesh is created at the top layer to acquire the advantage of less resistance in metals and to save routing resources for signal nets in lower layers. A design can consist of one mesh tree or multiple mesh trees.

Mesh terminals are created at a particular pitch in X and Y direction based on various experiments. First step is to create a mesh terminal as shown in fig 2, then clock tree synthesis where skew groups are created according to flop distribution in design. First level routing is done from the mesh terminal to the first buffer to reserve routing resources for first level clock nets. Inverter is connected to the clock gating cell. Then the network of clock inverters and buffers are created upto the clock sinks as shown in figure 2.

These clock gating cells are cloned as per the number of fanout sink points. In first level cloning, it looks for the sink points and checks whether the number of fanout exceeds a certain limit. If it exceeds the limit then this clock gaters are again cloned according to Design rule violation checks, RVs (Max fanout, Max capacitance and Max Transition). After cloning, clock tree synthesis is executed and followed by clock_opt which performs timing, power and area optimizations.

eInfochips clock flow

Figure 2 : Clock Flow

Block configuration

eInfochips Mesh Tree Structure

Mesh Layer: M13 (Mesh Terminals)

Target Latency: 250ps

Target Skew: 35ps

Mesh Terminal pitch X: 40.128 microns

Mesh Terminal pitch Y: 40.128 microns

For each experiment I have provided the table for comparison of results of the same block with optimization switch and without optimization switch.

My comparison points are skew, setup slack, buffer count, inverter count, launch path and capture path latency and power consumption by clock network.

For this I have checked the pattern of violating paths in each design and picked one high violating path from each design. All these switches are executed at clock_opt stage.

eInfochips helps in m2m IoT application development with low power clock tree synthesis (CTS) optimization in ASIC back-end solution platform. Watch this video to know,

  • Why is CTS needed?
  • How is CTS helpful?
  • How to optimize CTS?
  • How to overcome challenges while implementing CTS

 

Experiments
1) Enabling Global routing for timing and skew optimization.

Default : set_app_options –name cts.compile.enable_global_route –value false

Exp1 : set_app_options –name cts.compile.enable_global_route –value true

During clock tree synthesis these options enable a global router at its initial stage. By default this option is false and instead of global router, virtual router is enabled during initial synthesis.

Virtual routers are used at pre pre-optimization stage for fast prediction of the wire pattern. It does not contain a layer assignment. Does not consider whether there are enough routing resources.

Global routing is used for the first step of the actual wire implementation. Tries to avoid global congestion. It takes longer time for optimization but has accurate timing results.

So the advantage of a global router is that we have accurate timing results and the optimization is done based on the estimation of the routability and congestion in the design.

Results Default Using Switch
Setup slack -46.1ps 9ps
Launch path latency 247.7ps 222.3ps
Capture path latency 172.5ps 191.02ps
Skew 75.2ps 31.3ps
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X32, X8, X12, X12
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X32, X4, X8, X12
CKBUF Count 6841 7273
CKINV Count 844 864
CKBUFF Power 42.2mw 42.9mw
CKINV Power 4.08mw 4.11mw

Due to enabling global routing during the clock tree synthesis the synthesis was based on the actual wire implementation. Launch path, capture path and skew is decreased. And we got a margin of 9ps in setup timing at the cost of increased buffer and inverter count. The total power consumed by the clk buffer and inverter in the whole design is increased by 0.7mw and 0.3mw respectively. If we have relaxation for clock buffer count and power then this switch is useful to reduce timing violations.

2) Concurrent clock and data optimization(CCD)
set_app_options -name clock_opt.flow.enable_ccd -value true

This app option performs clock concurrent and data (CCD) optimization when it is set to true. In clock concurrent optimization technique, it optimizes both data and clock path concurrently.

When this option is set to true, At clock_opt stage the CCD optimization is performed.

This attribute also performs area and power optimization at clock_opt stage.

Results Default Switch
Setup slack -46.1ps 7ps
Launch path latency 247.7ps 232.9ps
Capture path latency 172.5ps 187.133ps
Skew 75.2ps 45.67
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X12, X32, X32
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X32, X8, X20
CKBUF Count 6841 5829
CKINV Count 844 703
CKBUFF Power 42.2mw 41.4mw
CKINV Power 4.08mw 3.79mw

From the above table, we can see that the default experiment had -46.1ps setup slack and in CCD optimization we got a margin of 7ps. On observing 10 to 15 most violating paths it is concluded that CCD is applying useful skew techniques during datapath optimization to improve the timing QoR. To solve the setup violation, tool is adjusting the launch and capture path in such a way that the launch clock path plus data path delay is reduced and capture path delay is increased. The overall clock buffer and inverter count is less than the default experiment. Hence the power consumption and area is reduced.

3) Appling NDR
Default : set_app_options –name clock_opt.flow.optimize_ndr –value false

Exp : set_app_options -name clock_opt.flow.optimize_ndr -value true

Tool applies non-default-routing rules on long timing critical nets during clock_opt optimization to improve timing, by applying NDR on timing critical nets the width of the net increases due to which resistance in the nets decreases which results in a decrease in net delay.

Results Default Switch
Setup slack -46.1ps -21ps
Launch path latency 247.7ps 228.5ps
Capture path latency 172.5ps 175.8ps
Skew 75.2ps 52.7ps
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X24, X32, X20
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X32, X24, X8
CKBUF Count 6841 6912
CKINV Count 844 854
CKBUFF Power 42.2mw 42.5mw
CKINV Power 4.08mw 4.16mw

From the above table, WNS in default experiment is -46.1ps slack and with NDR optimization is -21ps, Here launch path latency is less than the default experiment because the NDR is applied on timing critical nets due to which the net delays is decreased. But the total no of clock buffer, inverter count and power consumption is increased. Here the power consumption has increased because after applying NDR on timing critical nets still the setup is slack is negative but it is better than the default experiment as there was no margin available if we didn’t see any power optimization.

4) Enabling Area Recovery
set_app_options -name clock_opt.flow.enable_clock_power_recovery -value area

This option turns on power recovery in clock_opt optimization. The valid values are: auto, none, power, area. By default, it is auto when CCD flow is enabled. In non-CCD flow, auto means none. Area recovery mode is turned on by area, where the optimization is driven by area.

Results Default Switch
Setup slack -46.1ps 2ps
Launch path latency 247.7ps 229ps
Capture path latency 172.5ps 177.8ps
Skew 72.5ps 51.2ps
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X18, X4, X8
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X32, X24, X4
CKBUF Count 6841 6894
CKINV Count 844 811
CKBUFF Power 42.2mw 42.1mw
CKINV Power 4.08mw 4.15mw

Here we can see from the above table that the total no of clock buffer and inverters are greater than the default experiment due to which the total area is greater in this experiment. The clock_opt first tries to fix timing violations and then it optimises the area if the margin is available. After optimizing timing the setup margin for area recovery is not sufficient so area optimization didn’t take place. So for performing area recovery timing margin is required.

 5) Enabling power recovery
set_app_options -name clock_opt.flow.enable_clock_power_recovery -value power

As explained in experiment 4 it has power value where the tool optimizes the design in terms of power consumption.

Results Default Switch
Setup slack -46.1ps 9ps
Launch path latency 247.7ps 229.3ps
Capture path latency 172.5ps 179.6ps
Skew 72.5ps 49.7ps
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X32, X24, X8
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X18, X4, X8
CKBUF Count 6841 6980
CKINV Count 844 885
CKBUFF Power 42.2mw 42.1mw
CKINV Power 4.08mw 4.11mw

Here we see that again the priority is given to timing not to power, Here also margin was not available so power recovery is not done.

6) Disabling Path groups for optimization if margin is available
set_app_options -name ccd.skip_path_groups   -value {reg2mem mem2reg}

set_app_options -name clock_opt.flow.enable_ccd -value true

This app option skips the path groups which are mentioned in the list. We can skip those path groups which are not timing critical. So the tool can put most of its effort on those path which are timing critical

Results Default Switch
Setup slack -46.1ps 6ps
Launch path latency 247.7ps 221.9ps
Capture path latency 172.5ps 175ps
Skew 72.5ps 46.9ps
CK capture path BUF/INV Buff : X8, X24, X8, X4, X8 Buff : X8, X32, X24, X24, X8
CK launch path BUF/INV Buff : X8, X8, X12, X12 Buff : X8, X8, X12, X12
CKBUF Count 6841 5968
CKINV Count 844 674
CKBUFF Power 42.2mw 41.3mw
CKINV Power 4.08mw 3.09mw

In this block my two path groups have margin in timing so the tool will not use its resources to optimize those paths and enable the CCD optimization. By doing these the tool will give emphasis on the paths which are timing critical and hence we get a positive margin in timing and no clock buffer, inverter count and power is reduced.

7) Hold Fixing
set_app_options -name ccd.hold_control_effort   -value high

set_app_options -name clock_opt.enable_ccd -value true

The first app options control the hold optimization effort.  It has five values: none, low, medium, high and ultra. By default it is set to low.

Here the hold slack is given in the below table.

Results Default Switch
Hold slack -89ps 35ps
Launch path latency 238.76ps 230.34ps
Capture path latency 194.57ps 206.46ps
Skew 44.19ps 23.88ps
CK capture path BUF/INV Buff : X8, X4, X16, X12, X8 Buff : X8, X28, X12, X32, X4
CK launch path BUF/INV Buff : X8, X12, X32, X8 Buff : X8, X18, X32, X4

Delay Buff:DLX2

CKBUF Count 6841 6124
CKINV Count 844 756
CKBUFF Power 42.2mw 40.3mw
CKINV Power 4.08mw 4.05mw

In this experiment the hold timing is met by 35ps margin and the skew difference is also decreased and one SVT delay buffer DLX2 is added in the launch path, to increase the data path delay. The total number of clock buffer and inverter count is reduced and the total power consumption is reduced by enabling CCD optimization.

Conclusion
All the Optimization switches are used to optimize power, area or timing. Timing is given high priority and after that if timing margin is available then it will try to optimize the chip design based on power and area. It is not necessary that these switches will reduce the timing violations, it depends on the block complexity. So using the above switches we can optimize the timing but after performing optimization using these switches we need to check the target latency and target skew.

eInfochips (An Arrow Company) can help tech companies to solve CTS implementation challenges in their ASIC design requirement by leveraging a highly-efficient and skilled ASIC design processes. We have subject matter experts to work on a highly challenging product design and development requirements. Our expertise helps semiconductor & product companies to shorten their Time-to-Market, even while addressing challenges related to Power, Timing, and Area. For more information contact us today.

Authors
Haswant Kumar (ASIC Physical Design Engineer)
Bhavik Balwani (ASIC Physical Design Engineer)


Comments

There are no comments yet.

You must register or log in to view/post comments.