ARM’s Cortex-M4 processor core represented quite a breakthrough in digital signal controller technology when launched in 2010. Adding a single-cycle multiplier and SIMD instructions enabled basic DSP algorithms while retaining the low power benefits of an MCU. New technology circa 2016 – embedded programmable logic – can extend the Cortex-M4 or other core for the same DSP operations using significantly less energy.
Flex Logix has published a new case study in presentation format exploring performance and power consumption of a stock ARM Cortex-M4 in TSMC 40G versus the same algorithms offloaded into EFLX embedded programmable logic tiles. For the comparison, EFLX figures are from TSMC 40ULP (with comparable dynamic power), and leakage is nullified with power gating. The study also takes out memory access overhead for the Cortex-M4, assuming instructions and data are cached.
Similar to many MCU applications, the crux of this argument is reducing energy per unit of algorithmic work. Shortening the bursts of active computation and allowing functional blocks to be power gated more often results in an overall energy savings and longer device battery life. Rather than using a complete DSP core and C programming, the EFLX configuration can be tuned in RTL for the exact algorithm at hand. (Several posts have introduced the EFLX technology – navigate to FPGA > Flex Logix to see the previous discussions.)
Conceptually, this is a similar idea to using a full-sized external FPGA for algorithm offload, but with major differences in power consumption. EFLX is an embedded FPGA, in the same process node as the MCU core alongside it. There are no high-speed transceivers, which are one of the big power hogs in an FPGA. EFLX reconfigurable building blocks (RBBs) and tiles have been optimized for fine grain clock gating, and the interconnect fabric is optimized with power gating – reducing leakage power some 36x.
As we suggested in another post on IoT processing a few days ago, a fast multiplier is great for many applications, but it is insufficient for many others. To illustrate the differences, Flex Logix chose to study a 5 tap FIR filter and a single-stage BIQUAD filter, DSP algorithms that involve both multiplies and data accesses. The computations certainly can be performed on a Cortex-M4 alone – for the 5 tap FIR, 8080 clock cycles are required for 256 samples.
The DSP version of the EFLX-100 tile provides 2 MACs and 88 LUTs. Tiles can be arrayed in up to a 5×5 configuration to get more multipliers and LUTs. For a 32-bit data, 16-bit coefficient version of the 5 tap FIR, 5 EFLX DSP tiles are required to get the required multipliers, and no additional logic is required with LUTs to spare. The 16-bit BIQUAD implementation needs only 3 EFLX-100 tiles. Both versions can be optimized at the RTL level for more efficient multiply sequencing.
Keep in mind that RTL is synthesized using the Synopsys Synplify Pro engine, not some proprietary piece of magic. Gate level simulation for this study was performed in Mentor Graphics Questa, and power analysis done with Cadence Voltus, providing a level, reproducible playing field. Both the Cortex-M4 and the EFLX were run for 256 data samples. Since the EFLX-based hardware acceleration handles one sample per clock cycle, what was a sizable advantage in dynamic power for the Cortex-M4 is completely offset by extended numbers of cycles to perform the same function. Again, the Cortex-M4 power doesn’t include any memory access.
The energy delta is massively in favor of the EFLX configuration. For the 32-bit 5 tap FIR, has a 1.75x advantage; for a 16-bit filter, that jumps to 4.76x. The 16-bit BIQUAD has similar results with a 1.49x advantage.
EFLX tiles take only 0.13 mm[SUP]2[/SUP], so these implementations are not using up a lot of extra area. Leakage power can start to dominate at lower frequencies, but the simple solution is power gating when the EFLX-based hardware accelerator is not in use – and there is negligible wake-up overhead, unlike an MCU core that takes energy just to come out of sleep.
Follow the link to the complete study presentation with all the background on the Flex Logix landing page (PDF, registration not required):
I don’t think Flex Logix is picking on an ARM Cortex-M4 per se. It’s just that the Cortex-M4 is extremely popular in wearable and IoT applications because of its computational punch and relative energy efficiency compared with other conventional solutions. The fact is any MCU-style core would probably have similar issues being asked to take on heavier DSP algorithms. The approach of adding a small chunk of DSP hardware (or more general purpose logic) with synthesizable, optimizable, power and clock gated embedded programmable logic while keeping the rest of the IP around a favorite processor core is quite compelling.