Floating-point computation has been a staple of mainframe, minicomputer, supercomputer, workstation, and PC platforms for decades. Almost all modern microprocessor IP supports the IEEE 754 floating-point standard. Embedded design, for reasons of power and area and thereby cost, often eschews floating-point hardware forcing designers to fixed-point computations.
Digital signal processing technology has been dominated by fixed-point implementations and algorithms. Much of the reason for that is the nature of the input stream, often quantized to 10 or 12 bits via analog to digital conversion of real-world variables. In particular, audio, video, and mobile baseband operations fit well in fixed-point formats. Fixed-point DSP is very efficient, accomplishing computations in fewer cycles and less power than the same algorithms on a general purpose CPU.
Some algorithms suffer from reduced dynamic range of fixed-point math, and truncated data sometimes affects computational accuracy. Good examples are variants of SLAM (simultaneous localization and mapping) algorithms, with their sensitivity to bias errors. IEEE 754 floating-point offers a notation for dealing with highly precise, large numbers:
Aggravating the problem is the widespread popularity and ease of use of MATLAB. On a PC or supercomputer, MATLAB floating-point models just run, but before implementation on embedded hardware an extensive conversion to fixed-point modeling is usually required. Microcontroller types are quite adept at these conversions out of necessity, since few of them can afford the power for floating-point units.
As models get more complex, the risk and time to convert them from floating- to fixed-point is growing. Embedded teams have tried to sneak by with lower precision 16-bit floating-point to save power. However, if floating-point weren’t so expensive in silicon, more people would use it. Advanced applications such as sensor fusion, machine vision, radar with direction-of-arrival algorithms such as MUSIC and ESPRIT, wireless networking with beamforming and MIMO, and others are making the case for new IP with floating-point designed in.
Cadence Design Systems is betting heavily on floating-point in its DSP families. This week’s announcement at the Linley Processor Conference focused on the Tensilica Xtensa LX7, but floating-point support is moving across their range of IP.
Dror Maydan of Cadence points out that these offerings aren’t really a processor core per se, but rather a processor generator running from pre-configured DSP templates with extensive configurability. Their TIE customization language offers a way to add instructions quickly and differentiate products. He says their DSP customers are using “zero assembly language”; everything is written in C, no matter what is added to it.
A prime example is the recent disclosure by Microsoft at Hot Chips about their HoloLens. Inside is a chip in TSMC 28nm with 24 Tensilica DSP cores. This gave Microsoft a way to add 300 custom instructions – yes, 300, not a typo. “If you can’t add custom instructions, the math density you wind up with is not what you need,” according to Microsoft’s Nick Baker.
Microsoft claims they got a 200x speedup through a mix of hardware accelerators and customized DSP instructions. Maydan says that software effort was done using a Palladium hardware emulator. I got to see a Microsoft HoloLens first hand eyes-on this week at a networking event in Austin, and I can say it’s awesome – blows the Samsung Gear VR out of the water. Developers are still finding ways to optimize code and fully exploit its DSP capability in AR/VR applications.
Maydan shows some data that floating-point units cost some 10-15% extra area, but depending on the precision can result in 15-30% lower power on vector matrix multiply operations. Like many applications, the real savings comes in a complete algorithm mixing operations, such as a radar – note that front end operations are better in fixed-point, while the back end benefits from floating-point.
Cadence tools handle the C compile including data types and auto-vectorization, while still giving programmers control over the effort.
More specifics of the Xtensa LX7 IP are in the press release:
It’s an interesting trend to watch. Given that software has eaten the world, and the vast majority of it is floating-point, removing the performance/power barrier in embedded DSP implementations is a big step. Developers can create very complex algorithms and get them onto embedded devices much more quickly, avoiding the conversion step and taking models straight to implementation.