Designing an IDCT for H.265 using High Level Synthesis

Designing an IDCT for H.265 using High Level Synthesis
by Daniel Payne on 07-27-2015 at 8:00 pm

Math geeks know all about Inverse Discrete Cosine Transforms (IDCT) and a popular use is in the hardware architecture of High Efficiency Video Coding (HEVC), also known as H.265, the new video compression standard and widely used in consumer and industrial video devices. You could go about hand-coding RTL to create an IDCT function, but it would take you too many lines of code and precious engineering time compared to using higher level languages like C++ or SystemC. The promise of using High Level Synthesis (HLS) is that you can actually code your video algorithms in much less time and code compared to RTL, thus getting to market quicker with less engineering effort.

Uday Das from Calypto presented a tutorial at the #52DACevent last month in San Francisco with the subject, “Building an IDCT for H.265 Using Catapult“, so I reviewed the 46 slides and share my impressions in this brief blog. The HEVC specification calls for 4 transform units of various sizes: 4×4, 8×8, 16×16 and 32×32 to code the prediction residual. The hardware architecture here uses a row column decomposition approach that performs a 1-D operation on each row, followed by another 1-D operation on each column:

Related – NVIDIA and Qualcomm Talk about High Level Synthesis, Samsung on Low Power for Mobile

Algorithm
The IDCT algorithm can be described as a lower order matrix embedded in a higher order matrix, then detailed in a signal flow graph as an 8 point IDCT A8, made up of 4 point 1D IDCT A4 and an odd matrix M4:

Data flow for this algorithm can be designed using two major functions: Butterfly, Mult_odd.

An interface description can then be written in either C or SystemC, where C code is more compact:


A core class can be written and then re-used for the 4, 8, 16 and 32 points of Mult_odd and Butterfly member functions:

The Butterfly function is common for all sizes, and notice that there is no timing information at this level. The HLS tool Catapult will unroll the loop to create hardware for parallel execution.

Related – Shorten the Learning Curve for High Level Synthesis

Our functional model of the 1-D IDCT has instances of function calls and some muxes:

To meet the H.265 specification we have to make a parallel implementation and create a 2-D IDCT using some hierarchy:

Using HLS
Designers use the HLS tool Catapult by adding design files, clicking on a hierarchy tab selecting the top-level blocks, then clicking on libraries to select a specific technology and RAM models. Next you click on mapping an choose a target clock frequency, than map your data_in and data_out as RAM.

You next select your main loop and see which resources are being used in the design:

To schedule when operations are to occur you click on the schedule tab and work with a Gantt chart. Finally, you are ready to generate RTL code.

Verification
To double check that the generated RTL code is actually performing what we had in mind with our algorithm we need to create a testbench and verification flow. Most of this process is now push-button automated for us:

The transactors are what converts function calls into pin-level signal activity.

Related – Verifying the RTL Coming out of a High-Level Synthesis Tool

Summary
The tutorial from DAC showed me that C++ and SystemC coding are more compact to describe my video hardware than using RTL code. The Catapult tool for HLS is used to control micro-architectural decisions so that I can trade off power, performance and area metrics.

Companies like Google have found that using HLS on their VP9 video compression design was 2X faster than the previous approaches using hand-coded RTL, while dramatically reducing the number of lines written. Give the folks at Calypto a call to start discussing how appropriate HLS is for your hardware architecture, you may just find out that you can get your next IP or SoC to market in less time with fewer engineers, a nice benefit.