In this first part of a 2-part blog series, we look at defining worst case conditions, focusing specifically on device power.
With great power, comes great responsibility…
With each new technology node especially FinFET, the dynamic conditions within a chip are changing and becoming more complex in terms of process speeds, thermal activity and supply variation. Dennard Scaling brought about the ability for power to be scaled down with each successive node so that power per unit area stayed roughly constant. However, as highlighted by John Hennessy at last year’s AI Hardware summit, since the mid-2000s this is no longer the case and we have seen the steady increase in power density per unit silicon area. Hennessy made the point that with Dennard scaling ending and Moore’s Law slowing down, transistor power and costs were no longer heading in the right direction and there’s no free ride for future performance just from process developments.
Worst case is getting worse!
What this means, is that chips have the propensity to run hotter and in-chip voltage drops are getting bigger. These two factors of increased process variation and the end of Dennard Scaling combine to mean the worst case is definitely getting worse! In addition to worst case performance which we will cover in the second part of this blog, SoC designers are being forced to focus on worst case power and voltage drop scenarios. To address these issues, it is no coincidence that the majority of FinFET SoC designs include a fabric of sensors for in-chip process, voltage and temperature monitoring.
Worst case power is not just about the maximum power dissipation although that is naturally a good starting point. It is also about bursts of activity which cause temperature cycling and power differences which cause temperature gradients across the chip. FinFET processes require particular attention for potential hotspots as not only do they offer fantastic logic densities with the associated increased power per unit area, but their 3D fin type structures are not great at dissipating heat. Ideally, strategies need to be implemented to reduce maximum hotpot Tj (junction temperatures), as these impact lifetime and leakage current, they are also needed to reduce temperature gradients and cycling which impacts reliability. The trend with very large FinFET SoCs is to embed tens of temperature sensors to monitor potential hotspots around the chip, or alternatively, to use the recently launched Distributed Thermal Sensor (DTS) from Moortec.
Strategies employed for thermal management range from simple thermal cut–off where some, or worst case, all of the circuitry is switched off or ramped down if a certain temperature is reached, to more sophisticated DFS and DVFS schemes where the operating point and power in terms of clock frequency and supply voltage can be controlled and dropped to a lower level. Thermal load balancing involves allocating up-coming tasks to processors based on the level of their free processing capacity and their temperature. In all these cases an accurate temperature sensor provides the benefit of delaying the point at which action needs to be taken and therefore ensuring maximum processing power is maintained as long as possible. Less accurate temperature sensors require a larger temperature guard band (Check out our previous blog to learn more) which means for AI chips the processors will be switched off or to a slower throughput mode, at an earlier time and that’s not good for AI.
Associated with worst case power are worst case currents which cause IR voltage drops on chip. Particularly difficult to predict in advance are changes in voltage drops due to step changes in workload. The large SoCs are invariably software driven, but how the end customer will program these chips and how their worst case workload profile looks, is not always clear. Including voltage and temperature monitors on–chip especially for critical blocks gives visibility of the on-chip conditions and how these change with different workload profiles.
Multiple potential hotspots & temperature gradients ?
SoC development teams are faced not just with resolving traditional worst case timing issues but also worst case power. The latter can lead to multiple potential hotspots, temperature gradients and also difficult to predict voltage drops across large SoCs. Embedding a fabric of accurate in-chip monitors on SoCs provides excellent visibility of on-chip conditions.
This is seen as an essential tool for bring up, characterization and optimization on a per die basis especially for SoC development teams who are pushing the limits on advanced FinFET nodes but who want to stay on the right side in worst case conditions. As the old saying goes…’with great power, comes great responsibility’ and this is certainly the case when it comes to managing power conditions on advanced node devices.
Look out for the second part of this blog series entitled “Staying on the right side in worst case conditions – Performance” which will be available early July, where we will look at defining the worst case in terms of chip performance where timing analysis is key!