Designing the Right Architecture Using HLS

Designing the Right Architecture Using HLS
by Pawan Fangaria on 09-17-2014 at 9:05 am

With the advent of HLS tools, general notion which comes to mind is that okay, there’s an automated tool which can optimize your design description written in C++/SystemC and provide you a perfect RTL. In real life, it’s not so, any design description needs hardware designer’s expertise to adopt right algorithm and architecture in order to fulfil the right intent of the design; the desired RTL architecture must be understood before writing the design description. Effectively it’s a hardware design and not software synthesis. So, more than the transformation of an abstract level h/w description to RTL, major contribution of an HLS tool is in improving the QoR (Quality of Results) by tuning the micro-architecture according to HLS constraints and making the design technology specific from technology independence. Calypto’sHLS process using Catapult has a dedicated ‘Architecture Refinement’ stage between ESL Reference Model and ESL Synthesizable Model.

Consider the above example of a simple filter model where ‘multiply and accumulate’ loop can be unrolled for parallelism. The s/w code has bit-accurate types (Algorithmic C, or SystemC) with proper rounding, known sizes, internal taps and external coeffs. This s/w model can be easily synthesized.

Now consider an optimized architecture (reduced area and complexity) of the folded 5-tap filter as shown in the above picture, the coeffs are reduced to 3. The decision to share or unroll summing adders can be made in HLS. As shown in the s/w model, loop merging in HLS can share folding adder which becomes technology dependent.

HLS untimed model is technology and performance neutral. Depending on the system clock, sampling frequency and other design parameters such as throughput, the number of taps and appropriate level of folding or unrolling are decided. The area saving by folding becomes more pronounced with fully unrolled solutions with one sample per clock cycle.

Above is an example of circular buffer RAM implementation with mutually exclusive read and write that allows single port RAM for tap storage. Circular buffer RAM may require large number of taps.

Decimation is a technique to reduce sample rate by discarding samples, say 3 out of 4, and therefore it’s wise to reduce computational overhead for those discarded samples. Polyphase decimation is a concept that computes the required result in phases to reduce this overhead.

A more complex example can be from image processing. Below is a sample code of image windowing – edge detector.

It is inefficient to read an image 9 times for a single image out. For such cases, window & line buffer architecture is needed; a line buffer is a circular buffer delay line implementation with a write and read every cycle. In the above example considering positions 0 through 8 as registers and injecting pixels into position 8 and shifting (with appropriate delay of inputs) will get first pixel_out result at position 4. The line buffer can be implemented using dual port RAM with one read and one write or single port RAM with guaranteed read-before-write behavior or with double-width ping-pong read/write buffering.

In order to implement appropriate h/w for single port RAM, a template can be defined for SPRAM hardware_window class and corresponding SPRAM class constructor and member function are defined. The RAM access operations are appropriately defined for mutually exclusive read and write operations. Similarly, shifting of window pixels, injecting data from delay line and updating the window registers are defined appropriately.

The above image shows synthesis process in Catapult. The RAM array from SPRAM class instance can be mapped to SPRAM library. A 3×3 window on 1920 image width will have 958 deep double width RAM. 12-bit pixels, two lines to buffer and double width will require 48-bit wide RAM.

It’s clear from the above examples that the hardware expertise of a RTL designer proves quite valuable while writing the description at a higher level of abstraction which leads to productivity in design exploration and optimization, and accelerates verification and validation. To know more details and actual synthesis process about these examples, attend an on-line webinar(needs a quick registration on-line) presented by Stuart Clubb from Calypto. Stuart explained the code in great detail, pointing to specific variables, data and operations. It’s a must webinar to attend for designers and ESL specialists exploring to write hardware descriptions for SoCs at system level.

More Articles by Pawan Fangaria…..