I’ve written several articles on High-Level Synthesis (HLS), designing in C, C++ or SystemC, then synthesizing to RTL. There is unquestionable appeal to the concept. A higher level of abstraction enables a function to be described in less lines of code (LOC). Which immediately offers higher productivity and implies less bugs because the number of bugs in any kind of code scales pretty reliably with LOC. Simulation for architectural design and validation runs multiple orders of magnitude faster, allowing for broader experimentation with options. It also can run much larger tests like image recognition on streaming video, a tough goal for RTL simulations. Yet these methods have largely been restricted to specialized design objectives it seemed. Signal processing functions, some simple ML inference engines, that sort of thing.
I’m always willing to be re-educated, especially when I can hear from customers. Siemens EDA just hosted a webinar, mostly customer talks on use of HLS with just a little marketing thrown in. Pretty much a full day of presentations, centering around a few core applications, which made me rethink my position. The algorithm classes the technology best serves haven’t changed so much. What has changed is that big market needs have shifted to overlap more with those algorithms. Check out which companies presented on these topics. Naturally, when these speakers talked about HLS, they meant Catapult from Siemens EDA.
There’s been a massive worldwide increase in cloud video workload. According to Google, video now accounts for more than 80% of internet traffic, thanks to streaming and YouTube in particular. Aki Kuusela of Google said that this volume demands warehouse scale encoding with fast throughput. From his perspective the whole warehouse must be viewed as a system – storage, networking, codec, compute, etc. – to optimize for this level of traffic and throughput. Moreover, codecs must support a variety of video formats, required to be seamless from the latest formats, to popular standards to legacy standards. Think of YouTube; every minute 500 hours of new content is uploaded, and tens of thousands of live streams must be served simultaneously.
Off the shelf solutions can’t meet this need. For the same reason Google built their own ML training platforms (TPUs), they are building their own codecs which must be optimized across traffic diversity, quality, throughput, and availability that only they can reproduce. Google started early with HLS to integrate with the YouTube stack. Nvidia is doing very similar work, also on video codecs. The world leader in GPUs, for gaming, for graphics, for AI needs to have the fastest and highest quality video. Of course they are building their own codecs.
Object detection for the Mars sample return program
Another cool video example (but not a codec) is from NASA/JPL. This from the team that brought you Ingenuity, the Mars helicopter. Now they are designing something called a Harris corner detector, an image-related algorithm, as a part of development for the Mars sample return project. The original implementation was in RTL as a DSP-like function, but this proved difficult to optimize. The speaker describes approaches using SystemC, implementing a DSP process or a Kahn (essentially self-timed) process, using the flexibility HLS offers for experimenting with these options.
OK, so video applications like these are still in that same algorithmic niche I was talking about earlier. But the business relevance of the video processing niche has exploded. Carrying HLS along with it.
NXP, as a leader in automotive electronics, is working on a complete baseband for ultra-wideband (UWB). The technology you will soon be using for ultra-secure keyless entry to your car (your current Bluetooth-enabled keyless entry is not so secure). At some point maybe also contactless payment for the same reason. They found their traditional approach to designing the baseband, starting from Simulink, was too slow to converge. Much of the functionality here is signal processing; think filters and equalizers in multiple channels for example. Such a design demands high levels of parallelism at high clock rates which is difficult to architect in a timing-unaware platform. The application must also be very low power; think of UWB in a car key fob, running off a coin cell battery. These designs must build on custom-crafted signal processing.
A new company, Viosoft, is building a complete RAN physical layer for 5G (the radio unit piece of the network), from rate matching/channel mapping to model, time/frequency synchronization, MIMO/beamforming to RF processing and more. This must handle multiple bandwidth and latency requirements and multiple transmission frequencies. Once more lots of signal processing with huge demand for flexibility. The application will be built on an FPGA but still must be power optimized because it will be sitting in a potentially remote location.
Wireless, lots of signal processing, and low power demand. Once again requiring custom design solutions, built through HLS.
Smart sensing and wireless power transfer
ST provided a fascinating 3-part pitch. The first section was on infrared sensing for people detection in a room using a smart sensor. This technology can be useful for energy-saving controls. Sensing is on a grid within a room, allowing for machine learning of patterns of movement, thus a neural network which is where they use HLS.
The next application was a Qi (wireless power transmission) demodulator, a modem-like (and therefore DSP-like) function extracting power rather than information from the signal. The third application was a contactless infrared sensor, something familiar to all of us now thanks to COVID. A prior implementation did the temperature calcs in an embedded processor. This work pushes the calculation into the smart sensor, first establishing a correction for ambient temperature and for the sensed object temperature, then using Stefan Boltzmann law (yay physics!) to compute the temperature of the object. Note these are simply formulae, not DSP or ML operations, but they do use floating point math for precision, so the HLS approach was an easy choice.
What I like here is the applicability of HLS to these consumer-oriented applications, where cost and power will both be critical.
I skipped a couple of talks, one from Nvidia research on modeling interconnect in SystemC to get some feel for latencies as a function of layout. Another was from Siemens EDA on MatchLib, the open-source library originally developed by Nvidia in support of this modeling. All good stuff but not directly relevant to my theme here of the compelling demand for HLS in multiple applications.
Bottom line, best fit algorithms still tend to be signal processing centric, but big markets now see huge value in custom hardware development around those algorithms. You can watch the entire set of talks HERE.
HLS in a Stanford Edge ML Accelerator Design
Standardization of Chiplet Models for Heterogeneous Integration
Using EM/IR Analysis for Efinix FPGAsShare this post via:
There are no comments yet.
You must register or log in to view/post comments.