When we think of datacenters, we think of serried ranks of high-performance servers. Recent announcements from Google (on the Tensor Processing Unit), Facebook and others have opened our eyes to the role that specialized hardware and/or GPUs can play in support of deep/machine learning and big data analytics. But most of us would probably still consider those applications, while important, somewhat niche in their role in the datacenter.
Several years ago, motivated by what they knew was already happening at Google and Amazon, Microsoft started to build their own machine learning system to enhance the capabilities of Bing. But rather than develop a custom device, or build on a GPU platform, they decided to build on FPGAs. As we know, FPGA-based solutions can be significantly cheaper to build and deploy when you know you are going to be the sole customer. And of course FPGAs have the advantage of re-programmability. The Microsoft team built an FPGA-based platform they called Catapult and demonstrated this would significantly accelerate machine-learning algorithms in Bing (over previous software-only approaches, I assume).
Fast forward to 2015. Even the most starry-eyed Microsoft supporter would admit that Bing has a long way to go to catch up with the leader in search and is unlikely to drive significant revenue for Microsoft in the near future. What the company really wants are more ways to propel their major online services – Azure (the MS Cloud) and Office 365. Catapult was appealing to both of these applications, but not necessarily for machine-learning.
A major problem for Azure’s has been managing the high volume of PCIe network traffic to and from virtual machines through virtual network (VN) adapters. When this gets up to GB/sec for a VM, the the VN management load on the CPU becomes substantial. Obviously off-loading this to a system to support physical traffic and handle network virtualization can significantly improve throughput. Network cards would be one solution but the Azure team didn’t find this approach adaptable enough in supporting what they needed in a flexible VN fabric on the server side. After all, if you want maximum flexibility in VM management in the cloud, you need corresponding flexibility in VN management. The Azure team felt this could best be handled through FPGAs, particularly in support for programmability for load balancing and other rules.
All of this required a major rework of Catapult, but now the hardware is done and is being rolled out. And this is no longer a few specialized boxes to serve specialized needs. Azure needs a Catapult system per server (exact details are difficult to find – looks like one per server). And you can add to that the deep/machine learning requirements to support Bing and later encryption/compression and machine learning requirements to support Office 365.
This is a whole new ball-game for FPGA deployment. Since a large datacenter contains many hundreds of thousands of servers, Microsoft’s demand alone has apparently shifted FPGA worldwide volumes significantly. You should know by the way that Catapult is based on Altera FPGAs. Intel EVP Diane Bryant is on record as saying this is why Intel bought Altera last year. She also anticipates that for similar reasons, one third of all servers in datacenters will contain FPGAs (presumably optical connectivity sets the limit on volume, where FPGAs maybe can’t help – for now, but stay tuned since Intel was talking about both FPGAs and photonics at the OCP summit this year).
Of course you could argue that Microsoft and Intel have misread the market and the virtual networking functionality will be replaced by ASIC hardware solutions (especially optical). I’m not so sure, at least for the next few years. This is an area of critical differentiation for cloud services providers, so they’ll each want their own solutions. Of course the economics of ASIC may not be a big factor in those budgets, but adaptability could be a very big factor, especially as capabilities in cloud services are evolving quickly. Eventually differentiation always moves on to other factors, but it’s not clear that is going to happen here anytime soon.