Machine learning is transforming how information processing works and what it can accomplish. The push to design hardware and networks to support machine learning applications is affecting every aspect of the semiconductor industry. In a video recently published by Synopsys, Navraj Nandra, Sr. Director of Marketing, takes us on a comprehensive tour of these changes and how many of them are being used to radically drive down the power budgets for the finished systems.
According to Navraj, the best estimates that we have today put the human brain’s storage capacity at 2.5 Petabytes, the equivalent of 300 years of continuous streaming video. Estimates are that the computation speed of the brain is 30X faster than the best available computer systems, and it only consumes 20 watts of pwoer. These are truly impressive statistics. If we see these as design targets, we have a long way to go. Nevertheless, there have been tremendous strides in what electronic system can do in the machine learning arena.
Navraj highlighted one such product to illustrate how advanced current technology has become. The Nvidia Xavier chip has 9 billion transistors, containing an 8 core CPU, and using 512 Volta GPUs, with a new deep learning accelerator that can perform 30 trillion operations per second. It uses only 30W. Clearly this is some of the largest and fastest commercial silicon ever produced. There are now many large chips specifically being designed for machine learning applications.
There is, however, a growing gap between the performance of DRAM memories and CPUs. Navraj estimates that this gap is growing at a rate of 50% per year. Novel packaging techniques and advances have provided system designers with choices though. DDR5 is the champion when it comes to capacity, and HBM is ideal where more bandwidth is needed. One comparison shows that dual-rank RDIMM with 3DS offers nearly 256 GB of RAM at a bandwidth of around 51GB/s. HBM2 in the same comparison gives a bandwidth of over 250 GB/s, but only supports 16GB. Machine learning requires both high bandwidth and high capacity. By mixing these two memory types as needed in systems excellent results can be achieved. HBM also has the advantage of very high PHY energy efficiency. Even when compared to LPDDR4, it is much more efficient when measuring Pico joules per bit.
Up until 2015 deep learning was less accurate than humans at recognizing images. In 2015 ResNet using 152 layers exceeded human accuracy by achieving an error rate of 3.6%. If we look at the memory requirements of two of the prior champion recognition algorithms we can better understand the training and inference memory requirements. ResNet-50 and VGG-19 both needed 32 GB for training. With optimization they needed 25 MB and 12 MB respectively for inference. Nevertheless, deep neural networks create three issues for interface IP, capacity, irregular memory access and bandwidth.
Navraj asserts that the capacity issue can be handled with virtualization. Irregular memory access can have large performance impacts. However, by moving cache coherency into hardware overall performance can be improved. CCIX can take things even further and allow memory cache coherence across multiple chips – in effect giving the entire system a unified cache with full coherence. A few years ago you would have had a dedicated GPU chip with GDDR memory connected via PCIe to a CPU with DDR4. Using new technology, a DNN engine can have fast HBM2 memory and connect with CCIX to an apps processor using DDR5, with massive capacity. The CCIX connection boosts bandwidth between these cores from PCIe’s 8GB/s to 25GB/s. And, the system benefits from the improved performance from cache coherence.
Navraj also surveys techniques used for on-chip power reduction and for reducing power consumption in interface IP. PCIe 5.0 utilizes multiple techniques for power reduction. Both the controller and the PHY present big opportunities for power reduction. Using a variety of techniques, controller standby power can be reduced up to 95%. PHY standby power can be less than 100 uW.
Navraj’s talk ends by discussing the changes that are happening in the networking space and also talking about optimal processor architectures for machine learning. Network architectures and technologies are bifurcating based on diverging requirements for enterprise servers and data centers. Data centers like Facebook, Google and Amazon are pushing for the highest speeds. The roadmap for data center network interfaces includes speeds of up to 800 Gbps in 2020. Whereas enterprise servers are looking to optimize power and operate at 100Gbps by then for a single lane interface.
The nearly hour long video entitled “IP with Near Zero Energy Budget Targeting Machine Learning Applications” is filled with interesting information covering nearly every aspect of computing that touches on machine learning applications. While a near zero energy budget is a bit ambitious, aggressive techniques can make huge differences in overall power needs. With that in mind I highly suggest viewing this video.