WP_Term Object
(
    [term_id] => 19
    [name] => Flex Logix
    [slug] => flex-logix
    [term_group] => 0
    [term_taxonomy_id] => 19
    [taxonomy] => category
    [description] => 
    [parent] => 36
    [count] => 61
    [filter] => raw
    [cat_ID] => 19
    [category_count] => 61
    [category_description] => 
    [cat_name] => Flex Logix
    [category_nicename] => flex-logix
    [category_parent] => 36
)

WP_Term Object
(
    [term_id] => 19
    [name] => Flex Logix
    [slug] => flex-logix
    [term_group] => 0
    [term_taxonomy_id] => 19
    [taxonomy] => category
    [description] => 
    [parent] => 36
    [count] => 61
    [filter] => raw
    [cat_ID] => 19
    [category_count] => 61
    [category_description] => 
    [cat_name] => Flex Logix
    [category_nicename] => flex-logix
    [category_parent] => 36
)

October 14, 2016November 22, 2019 by Don Dingee

Adding DSP hardware shrinks energy for MCU core

Adding DSP hardware shrinks energy for MCU core
by Don Dingee on 10-14-2016 at 4:00 pm
Categories: eFPGA, Flex Logix, FPGA, IoT, IP

ARM’s Cortex-M4 processor core represented quite a breakthrough in digital signal controller technology when launched in 2010. Adding a single-cycle multiplier and SIMD instructions enabled basic DSP algorithms while retaining the low power benefits of an MCU. New technology circa 2016 – embedded programmable logic – can extend the Cortex-M4 or other core for the same DSP operations using significantly less energy.

Flex Logix has published a new case study in presentation format exploring performance and power consumption of a stock ARM Cortex-M4 in TSMC 40G versus the same algorithms offloaded into EFLX embedded programmable logic tiles. For the comparison, EFLX figures are from TSMC 40ULP (with comparable dynamic power), and leakage is nullified with power gating. The study also takes out memory access overhead for the Cortex-M4, assuming instructions and data are cached.

Similar to many MCU applications, the crux of this argument is reducing energy per unit of algorithmic work. Shortening the bursts of active computation and allowing functional blocks to be power gated more often results in an overall energy savings and longer device battery life. Rather than using a complete DSP core and C programming, the EFLX configuration can be tuned in RTL for the exact algorithm at hand. (Several posts have introduced the EFLX technology – navigate to FPGA > Flex Logix to see the previous discussions.)

Conceptually, this is a similar idea to using a full-sized external FPGA for algorithm offload, but with major differences in power consumption. EFLX is an embedded FPGA, in the same process node as the MCU core alongside it. There are no high-speed transceivers, which are one of the big power hogs in an FPGA. EFLX reconfigurable building blocks (RBBs) and tiles have been optimized for fine grain clock gating, and the interconnect fabric is optimized with power gating – reducing leakage power some 36x.

As we suggested in another post on IoT processing a few days ago, a fast multiplier is great for many applications, but it is insufficient for many others. To illustrate the differences, Flex Logix chose to study a 5 tap FIR filter and a single-stage BIQUAD filter, DSP algorithms that involve both multiplies and data accesses. The computations certainly can be performed on a Cortex-M4 alone – for the 5 tap FIR, 8080 clock cycles are required for 256 samples.

The DSP version of the EFLX-100 tile provides 2 MACs and 88 LUTs. Tiles can be arrayed in up to a 5×5 configuration to get more multipliers and LUTs. For a 32-bit data, 16-bit coefficient version of the 5 tap FIR, 5 EFLX DSP tiles are required to get the required multipliers, and no additional logic is required with LUTs to spare. The 16-bit BIQUAD implementation needs only 3 EFLX-100 tiles. Both versions can be optimized at the RTL level for more efficient multiply sequencing.

Keep in mind that RTL is synthesized using the Synopsys Synplify Pro engine, not some proprietary piece of magic. Gate level simulation for this study was performed in Mentor Graphics Questa, and power analysis done with Cadence Voltus, providing a level, reproducible playing field. Both the Cortex-M4 and the EFLX were run for 256 data samples. Since the EFLX-based hardware acceleration handles one sample per clock cycle, what was a sizable advantage in dynamic power for the Cortex-M4 is completely offset by extended numbers of cycles to perform the same function. Again, the Cortex-M4 power doesn’t include any memory access.

The energy delta is massively in favor of the EFLX configuration. For the 32-bit 5 tap FIR, has a 1.75x advantage; for a 16-bit filter, that jumps to 4.76x. The 16-bit BIQUAD has similar results with a 1.49x advantage.

EFLX tiles take only 0.13 mm[SUP]2[/SUP], so these implementations are not using up a lot of extra area. Leakage power can start to dominate at lower frequencies, but the simple solution is power gating when the EFLX-based hardware accelerator is not in use – and there is negligible wake-up overhead, unlike an MCU core that takes energy just to come out of sleep.

Follow the link to the complete study presentation with all the background on the Flex Logix landing page (PDF, registration not required):

EFLX: Energy Efficient Embedded FPGA for DSP Applications

I don’t think Flex Logix is picking on an ARM Cortex-M4 per se. It’s just that the Cortex-M4 is extremely popular in wearable and IoT applications because of its computational punch and relative energy efficiency compared with other conventional solutions. The fact is any MCU-style core would probably have similar issues being asked to take on heavier DSP algorithms. The approach of adding a small chunk of DSP hardware (or more general purpose logic) with synthesizable, optimizable, power and clock gated embedded programmable logic while keeping the rest of the IP around a favorite processor core is quite compelling.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Instance

Array
(
    [node_name] => Flex Logix
    [node_id] => Array
        (
            [0] => 2
        )

)

Instance

Array
(
    [node_name] => 
    [node_id] => Array
        (
            [0] => 2
        )

    [title] => Recent Forum Threads
)

Threads

Recent Forum Threads

Samsung Wins $200 Billion Order to Supply Chips to Broadcom (The NOT TSMC Market Thrives!)

latest reply by Fred Chen on July 26, 2026

started by Daniel Nenni on July 25, 2026
Is SK Hynix Buying Intel’s Ohio Fab? Korean Chipmaker Denies Report — Now All Eyes Are on Earnings

latest reply by hist78 on July 26, 2026

started by Daniel Nenni on July 22, 2026
Nvidia teams up with chip rival d-Matrix instead of fighting it

latest reply by KevinK on July 26, 2026

started by swka on July 25, 2026
‘He Wouldn’t Waste Time Yelling at You If You Didn’t Matter’—Inside Jensen Huang’s Leadership Style

started by swka on July 25, 2026
TSMC's $265B Spend Drive by Demand, Rivals, Says CFO

latest reply by Barnsley on July 25, 2026

started by Daniel Nenni on July 24, 2026
Cerebrus and AMD

latest reply by KevinK on July 25, 2026

started by Markwrob on July 24, 2026
Intel Reports Second-Quarter 2026 Financial Results

latest reply by hist78 on July 25, 2026

started by Daniel Nenni on July 23, 2026
Will CXMT take over micron in 2030?

latest reply by Barnsley on July 25, 2026

started by DanX on July 24, 2026
How hard is 2.5D and 3D advanced packaging from an equipment prospective

latest reply by count on July 24, 2026

started by Andy1299 on August 9, 2021
How China's DRAM Maker CXMT Caught Up With Micron Without EUV

latest reply by Fred Chen on July 24, 2026

started by karin623 on July 24, 2026

Search Semiwiki

Recent Flex Logix Articles

Comments

Sponsor

Recent Forum Threads