Webinars are a quick way to come up to speed with emerging trends in our semiconductor world, so I just finished watching an interesting one from Moortec about the benefits of embedded in-chip monitoring for Data Center and AIchip design. My first exposure to a data center was back in the 1960s during an elementary school class where they wheeled in a Teletype machine connected to a telephone line, and at the other end was a centralized computer system located in some air-conditioned room that ran a Civil War game app that had us students choosing how to run a campaign with our resources and then predict the outcome of the battle. In the 1970s at the University of Minnesota our data center was powered by machines from Control Data Corporation, and then at my first job with Intel in 1978 the data center was powered by IBM mainframes in a remote location that we accessed from Oregon.
Living in Oregon we know something about data centers because of the low cost of electricity from our plentiful hydro power generators, moderate climate, and generous tax breaks for companies like Googleto locate. In 2018 the data centers in the US consumed some 90 billion kilowatt-hours of electricity, while globally that power consumption was 416 terawatts, which was 3% of the total electrical output. This growing trend for data center power consumption causes heat-induced reliability issues for each of the semiconductor components mounted on boards, stuffing racks of equipment.
Much new VC money in 2018 has poured into AI chip startups, so let’s just summarize both the data center and AI chip design challenges:
· Reliability and long MTBF(Mean Time Between Failures)
· Low service interruption
· Big die sizes at advanced nodes
· High volume with high manufacturing yield required
· Fine grain DVFS (Dynamic Voltage and Frequency Scaling) control
· Chip supply voltage noise
· High data throughput
· Intense and bursty computations
· Constrained power
· Variable CPU core usage, or utilisation
· Continual optimisation of algorithms for data analysis and manipulation
One method to deal with all of these chip design challenges is to place PVT (Process, Voltage, Temperature) monitors in your AI or data center chips, allowing you to measure in real time what’s happening deep within each chip, then use that info to make decisions about changing the Vdd values or local clock speeds to ensure chip reliability and meet MTBF goals. Take the example of a typical AI chip which may have CPU clusters with thousands of cores being used, as shown below where 16 cores form each cluster and then placed around each cluster are PVT blocks sensor (colored blocks):
The temperature monitors will let you know if the Junction Temperatures are within specifications, for example 110C. Thermal monitors can be used to:
· Avoid Electrical Over Stress (EOS)
· Mitigate Electromigration effects
· Limit hot carrier aging
· Prevent thermal runaway
Semiconductor processes are not uniform, so you cannot expect that Silicon will be centered on the TT corner, instead you can expect:
· Process variability across each die
· Variation caused by lithography
· Reliability effects like aging
· FinFET variations
IC designers start out with an ideal power supply concept like a Vdd value of 1.1V, but then you have to deal with the non-ideal physical realties with on-chip voltages like:
· Interconnect resistance causing dynamic IR drops along Vdd paths
· Dynamic versus static power
· Electromigration effects on Power, clock and interconnect
Static Timing Analysis (STA) tools are run on chips before tapeout to ensure that your design meets speed criteria across all PVT corners, but with actual physical local variations on advanced nodes it’s conceivable that one die region has a temperature of 50C, Vdd of 0.8V and SS corner, while another region has a slightly different temperature of 65C, Vdd of 0.9V and TT corner. Your STA tool needs to handle these on-chip variations (OCV) while calculating path delays.
Not all thermal monitors are created equal, so if Moortec provides a thermal monitor with +/- 2C accuracy, and another vendor has a +/- 5C accuracy thermal monitor, go with the 2C monitor in order to provide tighter control to your thermal throttling system, which in turn provides greater power savings and allows for the highest data throughput.
Consider the power savings for a data center with 100,000 servers (Facebook having ~400,000 for example) and you could save 2W per chip by using a Moortec PVT approach versus a less accurate monitor that requires 6C more thermal guard-banding. The webinar provided a case study with calculations, showing if this saving per chip were scaled upward then a data center could save around $2M per year in electricity costs.
Just like tighter thermal guard-banding is beneficial to data center chips and systems, the same can be said for voltage guard-banding with highly accurate 1% values with Moortec mean fewer watts wasted on a system compared with less accurate voltage guard-banding. An example system using 0.8V for Vdd and a 20W target and using Moortec voltage monitors shows a worst-case value of 20.4W, while a less accurate voltage monitor has a worst-case value of 22.1W which is 10% more wasted power than what Moortec provides. Again, Moortec outlined that there were material cost savings to the data center operators.
SoCs that use Adaptive Voltage Scaling (AVS) in closed loop benefit from using embedded Process or Voltage Monitors that tell the PMIC (Power Management IC) what the actual silicon values are.
Voltage scaling optimization
There’s only one IP vendor dedicated 100% to PVT monitoring for ICs and that’s Moortec, they started in the UK back in 2005 and have customers now around the globe using the most popular nodes from the major foundries. You can take the next step and contact one of their offices nearest to your timezone: UK, USA, China, Taiwan, Israel, Europe, South Korea, Russia, Japan.
Watch the entire 35 minute webinar recording online, after a brief registration process.