Perhaps you’ve heard the term “clock gating” and you’re wondering how it works, or maybe you know what clock gating is and you’re wondering how to best implement it. Either way, this post is for you.
Why Power Matters
I can’t help but laugh when I watch a movie where the main characters are shrunk down to the size of grains of sand and they have to fight off ants the size of T-Rexes. It’s not that I’m amused by giant ants; it’s the unquestioned assumption that our bodies would act the same at such a small scale that gets to me. I know, it’s just a movie, but it still makes me cringe. Things just wouldn’t scale that way. If I’ve got my physics right, our mass would scale cubically, while our surface area would scale quadratically as would our strength. As a result, we’d be super strong, but it wouldn’t matter since we’d freeze to death almost instantly from heat loss. Plenty of other factors come into play, and I’m sure I’d get them wrong if I tried, but you get the idea.
But, what does “Honey I Shrunk the Kids” have to do with clock gating? Well, I began designing silicon in the 90s. At that time the only thing that mattered was performance. Since then, transistors have shrunk a bit–a lot actually–just like Rick Moranis. And their properties scale by different factors. One that’s getting out of control is power. According to “The Dark Silicon Problem and What it Means for CPU Designers”, heat generation per unit of silicon area is “somewhere between the inside of a nuclear reactor and the surface of a star,” and that was in 2013. Power is now a first-order concern. In fact, we find ourselves in a new situation where we have more transistors available to us than we can afford to use. In a very real sense, the best way to get more performance is now to save more power.
I was motivated to write this post today by a Linkedin notification I received this morning, letting me know that I had been quoted in a post by Brian Bailey of Semiconductor Engineering entitled “Taking Power More Seriously”. Bailey provides an excellent high-level overview of the myriad challenges of designing for power. One of those challenges is to implement fine-grained clock gating. As an EDA developer myself, of tools that, among other things, automate clock gating, I felt it timely to dive deeper into the topic.
What is clock gating?
Several factors contribute to a circuit’s power consumption. The logic gates have static or leakage power that is roughly constant as long as a voltage is applied to them, and they have dynamic or switching power resulting from toggling wires. Flip-flops are rather power-hungry, accounting for maybe ~20% of total power. Clocks can consume even more, perhaps ~40%! Global clocks go everywhere, and they toggle twice each cycle. As we’ll see, clock gating avoids toggling the clock when clock pulses are not needed. This reduces the power consumption of clock distribution and flip-flops, and it can even reduce dynamic power for logic gates.
Even in a busy circuit, when you look closer, most of the logic is not doing meaningful work most of the time. In this trace of a WARP-V CPU core, for example, the CPU is executing instructions nearly every cycle. But the logic computing branch targets isn’t busy. It is only needed for branch instructions. And floating-point logic is only needed for floating-point instructions, etc. Most signal values in the trace below are gray, indicating that they aren’t used.
As previously noted, a significant portion of overall power is consumed by driving clock signals to flip-flops so the flip-flops can propagate their input values to their outputs for the next cycle of execution. If most of these flip-flop input signals are meaningless, there’s no need to propagate them, and we’re wasting a lot of power.
Clock gating cuts out clock pulses that aren’t needed. (Circuits may also be designed to depend on the absence of a clock pulse, but let’s not confuse matters with that case.) The circuit below shows two clock gating blocks (in blue) that cut out unneeded clock pulses and only pulse the clock when a meaningful computation is being performed.
In addition to reducing clock distribution and flip-flop power, clock gating also guarantees that flip-flop outputs are not wiggling when there are no clock pulses. This reduces downstream dynamic power consumption. In all, clock gating can save a considerable amount of power relative to an ungated circuit.
Implementing Clock Gating
A prerequisite for clock gating is knowing when signals are meaningful and when they are not. This is among the aspects of higher-level awareness inherent in a Transaction-Level Verilog model. The logic of a “transaction” is expressed under the condition that indicates its validity. Since a single condition can apply to all the logic along a path followed by transactions, the overhead of applying validity is minimal.
Validity is not just about clock gating. It helps to separate the wheat from the chaff, so to speak. The earlier CPU waveform, for example, is from a TL-Verilog model. Debugging gets easier as we have automatically filtered out the majority of signal values, having identified them as meaningless. And we know they are meaningless because of automatic checking that ensures that these values are not inadvertently consumed by meaningful computations.
With this awareness in our model, default fine-grained clock gating comes for free. The valid conditions are used by default to enable our clock pulses.
The full implications of having clock gating in place from the start may not be readily apparent. I’ve never been on a project that met its goals for clock gating. We always went to silicon with plenty of opportunity left on the table. This is because power savings is always the last thing to be implemented. Functionality has to come first. Without it, verification can’t make progress, and verification is always the long pole. Logic designers can’t afford to give clock gating any real focus until they have worked through their functional bug backlogs, which doesn’t happen until the end is in sight. At this point, many units have already been successfully implemented without full clock gating. The project is undoubtedly behind schedule, and adding clock gating would necessitate implementation rework including the need to address new timing and area pressure. Worse yet, it would bring with it a whole new flood of functional bugs. As a result, well, let’s just say we’re heating the planet faster than necessary. Getting clock gating into the model from the start, at no cost, completely flips the script.
Conclusion
Power is now a first-order design constraint, and clock gating is an important part of an overall power strategy. Register transfer level modeling does not lend itself to the successful use of clock gating. A transaction-level design can have clock gating in place from the start, having a shift-left effect on the project schedule and resulting in lower-power silicon (and indirectly higher performance and lower area as well). If you are planning to produce competitive silicon, it’s important to have a robust clock-gating methodology in place from the start.
Also Read:
Clock Aging Issues at Sub-10nm Nodes
Analyzing Clocks at 7nm and Smaller Nodes
Methodology to Minimize the Impact of Duty Cycle Distortion in Clock Distribution Networks
Share this post via:
The Intel Common Platform Foundry Alliance