Accelerating compute-intensive software functions by moving them into hardware has a long history, stretching back (as far as I remember) to floating-point co-processors. Modern SoCs are stuffed with these applications, from signal processors, to graphics processors, codecs and many more functions. All of these accelerators work extremely well for functions with broad application where any need for on-going configurability can be handled through switches or firmware / software upgrades in aspects which don’t significantly compromise performance.
But that constraint doesn’t always fit well with needs in the very dynamic markets which are common today, where competitive differentiation continually changes targets for solution-providers. That’s why FPGAs have become hot in big datacenter applications. Both Amazon Web Services (AWS) and Microsoft Azure have announced FPGA-based capabilities within their datacenters, for differentiated high-speed networking and to provide customizable high-performance options to cloud customers. The value proposition is simple – as demands change, the FPGA can be adapted more quickly than you could build a new ASIC, and often more cheaply given relatively low volumes in these applications.
Naturally there is a middle ground between ASIC and FPGA options. FPGA SoCs might be an answer in some cases, but when you’re stretching for a differentiated edge or wanting to offer an SoC solution to those who are, it’s not hard to imagine cases where an application-specific ASIC shell around an embedded FPGA core might be just right. You get all the flexibility of the FPGA core, combined with high performance plus low power and area of the fit-to-purpose ASIC functionality around the core. Target applications include data intensive AI / machine learning, 5G wireless, automotive ADAS and datacenter and networking applications.
As in any good FPGA, you expect support for logic and ALUs, DSP functions, also block RAMs (BRAM) and smaller RAM blocks (LRAMs in the picture above). When you want to customize the embedded FPGA (eFPGA) in your SoC, you go through the usual design cycle to map a logic design onto the primitives in the eFPGA. If you are using the Achronix Speedcore technology, you will use their ACE design tools.
Now take this a step further. When you write a piece of software, you can profile it to find areas where some additional focus could greatly speed up performance. The same concept can apply in your eFPGA design. By profiling benchmark test cases (Achronix works collaboratively with customers to do this), you can identify performance bottlenecks. Based on this analysis, Achronix can then build custom blocks for certain functions, which can be tiled into the eFPGA. Now you have the advantage of the high-performance shell along with configurability in the eFPGA, yet with significantly better PPA than you would get in a conventional eFPGA.
Achronix offer several application examples where the benefit of their Speedcore Custom Blocks is quite obvious. The first is for a YOLO (you only look once) function supporting a convolutional neural net (CNN) in real-time object detection. By converting a matrix-multiply operation to a custom block they have been able to reduce the size of the eFPGA by 35%.
In another example for networking, they have been able to build custom functions which can examine network traffic at line speed (400Gb/s line rate), for example to do header inspection. In this example, the purple blocks are the custom packet segment extraction/insertion blocks.
Another especially interesting example is use of this capability in building TCAMs. These functions are widely used in networking but are typically considered very expensive to implement in standalone FPGAs. However they can be very feasible in application-specific uses in an eFPGA when implemented as Custom Blocks.
One final example – string search. This has many applications, not least in genome matching, another hot area. (If you don’t like that example, think of how many programs contain string equaloperations, how that operation dominates many profiles and is therefore likely to be a bottleneck in real-time matching on streams or fast matching on giant datasets.) FPGAs are already used to accelerate these operations but are still not fast enough. Which makes this a great candidate for Custom Block acceleration. Achronix show an example where they can reduce time to do a match from 72 cycles to 1 cycle and massively reduce area.
No big surprise in a way – we all know that custom is going to be much faster and smaller than FPGA. The difference here is that now you can embed custom in eFPGA – pretty neat. Of course, this takes work. Robert Blake, the CEO of Achronix, told me that you might typically expect a 6-month cycle for profiling and custom block development. And there will be an NRE (you didn’t think it would be free, did you?). But if it can deliver this kind of advantage, it may be worth the investment.
Achronix business is growing very nicely, thanks to development in each of their FPGA accelerator lines. They expect to close 2017 at >$100M, with a strong pipeline and apparently well-balanced between their standalone FPGA (Speedster) and embedded applications. Speedcore, introduced to customers in 2015, is their fastest–growing product line and is already in production on TSMC 16nm and at testchip and first designs in TSMC 7nm.
You can read more HERE. You can also see Achronix present at ARM TechCon on:
· Reprogammable Logic in an Arm-Based SoC, presented by Kent Orthner, Systems Architect
· Smaller, Faster and Programmable – Customizing Your On-Chip FPGA, presented by Steve Mensor, VP of Marketing
· Customize Your eFPGA – Control Your Destiny for Machine Learning, 5G and Beyond, presented by Kent Orthner, Systems Architect