The semiconductor industry growth is increasing exponentially with high speed circuits, low power design requirements because of updated and new technology like IOT, Networking chips, AI, Robotics etc.
In lower technology nodes the timing closure becomes a major challenge due to the increase in on-chip variation effect and it leads to changes in interconnect delay and cell delay. It is a difficult task for the clock to reach every flop at almost the same instance of time to avoid timing violations, as most of the power is consumed by clock structure in circuit design the effect of ocv is more in the clock network as compared to signal and other paths. So it is important to minimize the effect of ocv in the clock network by creating proper clock structure to effectively reduce timing violations and variation effects as well as meeting the clock skew and latency requirements.
This article will include the information and techniques to reduce timing violations using optimized mesh clock tree structure with different optimization switches to reduce timing violation and power consumption. We have used Mesh clock tree structure because it provides low skew and has less ocv effect for high performance vlsi designs as compared to conventional clock tree structure.
Keywords: clock tree synthesis (CTS), clock tree optimization, clock concurrent optimization (CCD), On-Chip Variation(OCV), Design Rule Violation Checks(DRVs), Lower technology nodes , Place and Route flow.
Introduction
Clock Tree Synthesis is a process which makes sure that the clock gets distributed evenly to all sequential elements in a design to meet the clock tree design rule violations (DRVs)Vs such as max Transition, Capacitance and max Fanout, balancing the skew and minimizing insertion delay.
There are many types of clock structures namely H-Tree, X-Tree, Conventional clock tree, Multi source clock tree, Mesh Tree etc. In this article, we will focus on clock tree optimization of a mesh clock tree.
Mesh Tree Structure
Mesh tree has clock nets in grid pattern that are driven by clock inverters and buffers. With this structure we can have minimum skew, latency and On-chip Variation as compared to other clock structures. The network of inverter and buffer drivers from clock port to clock mesh drivers is known as Pre-mesh clock structure. An example of a clock mesh tree is shown in figure 1 below.
Mesh tree structure has high power consumption and requires high routing resources because the whole layer is consumed by the clock tree structure. Generally mesh is created at the top layer to acquire the advantage of less resistance in metals and to save routing resources for signal nets in lower layers. A design can consist of one mesh tree or multiple mesh trees.
Mesh terminals are created at a particular pitch in X and Y direction based on various experiments. First step is to create a mesh terminal as shown in fig 2, then clock tree synthesis where skew groups are created according to flop distribution in design. First level routing is done from the mesh terminal to the first buffer to reserve routing resources for first level clock nets. Inverter is connected to the clock gating cell. Then the network of clock inverters and buffers are created upto the clock sinks as shown in figure 2.
These clock gating cells are cloned as per the number of fanout sink points. In first level cloning, it looks for the sink points and checks whether the number of fanout exceeds a certain limit. If it exceeds the limit then this clock gaters are again cloned according to Design rule violation checks, RVs (Max fanout, Max capacitance and Max Transition). After cloning, clock tree synthesis is executed and followed by clock_opt which performs timing, power and area optimizations.
Figure 2 : Clock Flow
Block configuration
Mesh Layer: M13 (Mesh Terminals)
Target Latency: 250ps
Target Skew: 35ps
Mesh Terminal pitch X: 40.128 microns
Mesh Terminal pitch Y: 40.128 microns
For each experiment I have provided the table for comparison of results of the same block with optimization switch and without optimization switch.
My comparison points are skew, setup slack, buffer count, inverter count, launch path and capture path latency and power consumption by clock network.
For this I have checked the pattern of violating paths in each design and picked one high violating path from each design. All these switches are executed at clock_opt stage.
eInfochips helps in m2m IoT application development with low power clock tree synthesis (CTS) optimization in ASIC back-end solution platform. Watch this video to know,
- Why is CTS needed?
- How is CTS helpful?
- How to optimize CTS?
- How to overcome challenges while implementing CTS
Experiments
1) Enabling Global routing for timing and skew optimization.
Default : set_app_options –name cts.compile.enable_global_route –value false
Exp1 : set_app_options –name cts.compile.enable_global_route –value true
During clock tree synthesis these options enable a global router at its initial stage. By default this option is false and instead of global router, virtual router is enabled during initial synthesis.
Virtual routers are used at pre pre-optimization stage for fast prediction of the wire pattern. It does not contain a layer assignment. Does not consider whether there are enough routing resources.
Global routing is used for the first step of the actual wire implementation. Tries to avoid global congestion. It takes longer time for optimization but has accurate timing results.
So the advantage of a global router is that we have accurate timing results and the optimization is done based on the estimation of the routability and congestion in the design.
Results | Default | Using Switch |
Setup slack | -46.1ps | 9ps |
Launch path latency | 247.7ps | 222.3ps |
Capture path latency | 172.5ps | 191.02ps |
Skew | 75.2ps | 31.3ps |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X32, X8, X12, X12 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X32, X4, X8, X12 |
CKBUF Count | 6841 | 7273 |
CKINV Count | 844 | 864 |
CKBUFF Power | 42.2mw | 42.9mw |
CKINV Power | 4.08mw | 4.11mw |
Due to enabling global routing during the clock tree synthesis the synthesis was based on the actual wire implementation. Launch path, capture path and skew is decreased. And we got a margin of 9ps in setup timing at the cost of increased buffer and inverter count. The total power consumed by the clk buffer and inverter in the whole design is increased by 0.7mw and 0.3mw respectively. If we have relaxation for clock buffer count and power then this switch is useful to reduce timing violations.
2) Concurrent clock and data optimization(CCD)
set_app_options -name clock_opt.flow.enable_ccd -value true
This app option performs clock concurrent and data (CCD) optimization when it is set to true. In clock concurrent optimization technique, it optimizes both data and clock path concurrently.
When this option is set to true, At clock_opt stage the CCD optimization is performed.
This attribute also performs area and power optimization at clock_opt stage.
Results | Default | Switch |
Setup slack | -46.1ps | 7ps |
Launch path latency | 247.7ps | 232.9ps |
Capture path latency | 172.5ps | 187.133ps |
Skew | 75.2ps | 45.67 |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X12, X32, X32 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X32, X8, X20 |
CKBUF Count | 6841 | 5829 |
CKINV Count | 844 | 703 |
CKBUFF Power | 42.2mw | 41.4mw |
CKINV Power | 4.08mw | 3.79mw |
From the above table, we can see that the default experiment had -46.1ps setup slack and in CCD optimization we got a margin of 7ps. On observing 10 to 15 most violating paths it is concluded that CCD is applying useful skew techniques during datapath optimization to improve the timing QoR. To solve the setup violation, tool is adjusting the launch and capture path in such a way that the launch clock path plus data path delay is reduced and capture path delay is increased. The overall clock buffer and inverter count is less than the default experiment. Hence the power consumption and area is reduced.
3) Appling NDR
Default : set_app_options –name clock_opt.flow.optimize_ndr –value false
Exp : set_app_options -name clock_opt.flow.optimize_ndr -value true
Tool applies non-default-routing rules on long timing critical nets during clock_opt optimization to improve timing, by applying NDR on timing critical nets the width of the net increases due to which resistance in the nets decreases which results in a decrease in net delay.
Results | Default | Switch |
Setup slack | -46.1ps | -21ps |
Launch path latency | 247.7ps | 228.5ps |
Capture path latency | 172.5ps | 175.8ps |
Skew | 75.2ps | 52.7ps |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X24, X32, X20 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X32, X24, X8 |
CKBUF Count | 6841 | 6912 |
CKINV Count | 844 | 854 |
CKBUFF Power | 42.2mw | 42.5mw |
CKINV Power | 4.08mw | 4.16mw |
From the above table, WNS in default experiment is -46.1ps slack and with NDR optimization is -21ps, Here launch path latency is less than the default experiment because the NDR is applied on timing critical nets due to which the net delays is decreased. But the total no of clock buffer, inverter count and power consumption is increased. Here the power consumption has increased because after applying NDR on timing critical nets still the setup is slack is negative but it is better than the default experiment as there was no margin available if we didn’t see any power optimization.
4) Enabling Area Recovery
set_app_options -name clock_opt.flow.enable_clock_power_recovery -value area
This option turns on power recovery in clock_opt optimization. The valid values are: auto, none, power, area. By default, it is auto when CCD flow is enabled. In non-CCD flow, auto means none. Area recovery mode is turned on by area, where the optimization is driven by area.
Results | Default | Switch |
Setup slack | -46.1ps | 2ps |
Launch path latency | 247.7ps | 229ps |
Capture path latency | 172.5ps | 177.8ps |
Skew | 72.5ps | 51.2ps |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X18, X4, X8 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X32, X24, X4 |
CKBUF Count | 6841 | 6894 |
CKINV Count | 844 | 811 |
CKBUFF Power | 42.2mw | 42.1mw |
CKINV Power | 4.08mw | 4.15mw |
Here we can see from the above table that the total no of clock buffer and inverters are greater than the default experiment due to which the total area is greater in this experiment. The clock_opt first tries to fix timing violations and then it optimises the area if the margin is available. After optimizing timing the setup margin for area recovery is not sufficient so area optimization didn’t take place. So for performing area recovery timing margin is required.
5) Enabling power recovery
set_app_options -name clock_opt.flow.enable_clock_power_recovery -value power
As explained in experiment 4 it has power value where the tool optimizes the design in terms of power consumption.
Results | Default | Switch |
Setup slack | -46.1ps | 9ps |
Launch path latency | 247.7ps | 229.3ps |
Capture path latency | 172.5ps | 179.6ps |
Skew | 72.5ps | 49.7ps |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X32, X24, X8 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X18, X4, X8 |
CKBUF Count | 6841 | 6980 |
CKINV Count | 844 | 885 |
CKBUFF Power | 42.2mw | 42.1mw |
CKINV Power | 4.08mw | 4.11mw |
Here we see that again the priority is given to timing not to power, Here also margin was not available so power recovery is not done.
6) Disabling Path groups for optimization if margin is available
set_app_options -name ccd.skip_path_groups -value {reg2mem mem2reg}
set_app_options -name clock_opt.flow.enable_ccd -value true
This app option skips the path groups which are mentioned in the list. We can skip those path groups which are not timing critical. So the tool can put most of its effort on those path which are timing critical
Results | Default | Switch |
Setup slack | -46.1ps | 6ps |
Launch path latency | 247.7ps | 221.9ps |
Capture path latency | 172.5ps | 175ps |
Skew | 72.5ps | 46.9ps |
CK capture path BUF/INV | Buff : X8, X24, X8, X4, X8 | Buff : X8, X32, X24, X24, X8 |
CK launch path BUF/INV | Buff : X8, X8, X12, X12 | Buff : X8, X8, X12, X12 |
CKBUF Count | 6841 | 5968 |
CKINV Count | 844 | 674 |
CKBUFF Power | 42.2mw | 41.3mw |
CKINV Power | 4.08mw | 3.09mw |
In this block my two path groups have margin in timing so the tool will not use its resources to optimize those paths and enable the CCD optimization. By doing these the tool will give emphasis on the paths which are timing critical and hence we get a positive margin in timing and no clock buffer, inverter count and power is reduced.
7) Hold Fixing
set_app_options -name ccd.hold_control_effort -value high
set_app_options -name clock_opt.enable_ccd -value true
The first app options control the hold optimization effort. It has five values: none, low, medium, high and ultra. By default it is set to low.
Here the hold slack is given in the below table.
Results | Default | Switch |
Hold slack | -89ps | 35ps |
Launch path latency | 238.76ps | 230.34ps |
Capture path latency | 194.57ps | 206.46ps |
Skew | 44.19ps | 23.88ps |
CK capture path BUF/INV | Buff : X8, X4, X16, X12, X8 | Buff : X8, X28, X12, X32, X4 |
CK launch path BUF/INV | Buff : X8, X12, X32, X8 | Buff : X8, X18, X32, X4
Delay Buff:DLX2 |
CKBUF Count | 6841 | 6124 |
CKINV Count | 844 | 756 |
CKBUFF Power | 42.2mw | 40.3mw |
CKINV Power | 4.08mw | 4.05mw |
In this experiment the hold timing is met by 35ps margin and the skew difference is also decreased and one SVT delay buffer DLX2 is added in the launch path, to increase the data path delay. The total number of clock buffer and inverter count is reduced and the total power consumption is reduced by enabling CCD optimization.
Conclusion
All the Optimization switches are used to optimize power, area or timing. Timing is given high priority and after that if timing margin is available then it will try to optimize the chip design based on power and area. It is not necessary that these switches will reduce the timing violations, it depends on the block complexity. So using the above switches we can optimize the timing but after performing optimization using these switches we need to check the target latency and target skew.
eInfochips (An Arrow Company) can help tech companies to solve CTS implementation challenges in their ASIC design requirement by leveraging a highly-efficient and skilled ASIC design processes. We have subject matter experts to work on a highly challenging product design and development requirements. Our expertise helps semiconductor & product companies to shorten their Time-to-Market, even while addressing challenges related to Power, Timing, and Area. For more information contact us today.
Authors
Haswant Kumar (ASIC Physical Design Engineer)
Bhavik Balwani (ASIC Physical Design Engineer)
Also read:
Sign Off Design Challenges at Cutting Edge Technologies
Digital Filters for Audio Equalizer Design
Certitude: Tool that can help to catch DV Environment Gaps
Understanding BLE Beacons and their Applications
Share this post via:
Comments
One Reply to “Techniques to Reduce Timing Violations using Clock Tree Optimizations in Synopsys IC Compiler II”
You must register or log in to view/post comments.