WP_Term Object
(
    [term_id] => 159
    [name] => Siemens EDA
    [slug] => siemens-eda
    [term_group] => 0
    [term_taxonomy_id] => 159
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 750
    [filter] => raw
    [cat_ID] => 159
    [category_count] => 750
    [category_description] => 
    [cat_name] => Siemens EDA
    [category_nicename] => siemens-eda
    [category_parent] => 157
)
            
Q2FY24TessentAI 800X100
WP_Term Object
(
    [term_id] => 159
    [name] => Siemens EDA
    [slug] => siemens-eda
    [term_group] => 0
    [term_taxonomy_id] => 159
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 750
    [filter] => raw
    [cat_ID] => 159
    [category_count] => 750
    [category_description] => 
    [cat_name] => Siemens EDA
    [category_nicename] => siemens-eda
    [category_parent] => 157
)

HLS in a Stanford Edge ML Accelerator Design

HLS in a Stanford Edge ML Accelerator Design
by Bernard Murphy on 06-16-2022 at 6:00 am

I wrote recently about Siemens EDA’s philosophy on designing quality in from the outset, rather than trying to verify it in. The first step is moving up the level of abstraction for design. They mentioned the advantages of HLS in this respect and I refined that to “for DSP-centric applications”. A Stanford group recently presented at a Siemens EDA-hosted webinar, extending this range to building ML accelerators for the edge. Their architecture is built around several innovations, also an enthusiastic endorsement of the value of HLS in designing the accelerator core.

HLS in a Stanford Edge ML

Key Innovations

Karthik Prabhu, a doctoral candidate in EE at Stanford, presented their Chimera SoC, with a goal to support training at the edge with excellent performance yet still at edge-like low-power. For this purpose their design uses resistive RAM (RRAM) for weight, eliminating need to go off-chip for this data. The SoC architecture supports scale-out to multiple chips, something they call an Illusion system, with chip-to-chip interfacing (protocol not mentioned). I would imagine this might be even more effective in a multi-chiplet implementation, but as a proof of concept I’m sure the multi-chip version is enough.

For ResNet-18 with ImageNet they measured energy at 8.1 mJ/image, latency at 60 ms/image, average power at 136 mW and efficiency at 2.2 TOPS/W. Given that the intent is support on-chip training, they do note RRAM drawback in high write-energy required and relatively low write endurance. The tests they apply seem to converge on training within the endurance bound, however they didn’t mention how they overcome the energy issue in training.

Architecting the accelerator core

This section could have been taken direct from an earlier Siemens EDA tutorial. The team started with a convolution algorithm (6 nested loops in this case) over input activations, weights and output activations. Their goal was to map that to a systolic array of processors, considering many possible variables in the architecture. How many PEs they might need in the array, how many levels in memory hierarchy, and how should they size the buffers in that hierarchy? In data optimization, were they going to prefer weight stationary, output stationary or row stationary?

They used Interstellar to optimize architecture. This is an open-source tool for design space exploration of CNN accelerator architectures, also from Stanford. I think this is pretty cool. They input a neural net basic spec (layers in network and tensor sizes), a range of memory sizes to explore, along with cost info for a MAC, a register file and a memory. Based on this input, Interstellar told them they should use a 16×16 systolic array with a 9-wide vector inside each PE. They needed a 16KB input buffer, no weight buffer and a 32KB accumulation buffer. And many more details!

Implementation

The Chimera team used Catapult to implement the accelerator, which they were able to accomplish in 2-3 months. This was a timeframe they reasonably argued would not have been possible if they were implementing in RTL. They also stressed another advantage – they made heavy use of C++ templates to parametrize much of the implementation. Simplifying adjusting implementation details, from buffer sizes to changing how weights were distributed to reduce wiring congestion. This level of parametrization also made it easy to reuse the implementation for follow-on designs.

There’s a nice description of the verification flow. All the test development was at the C++ level, allowing for fast testing; a 10 second simulation in C++ versus a 1-hour parallelized simulation in RTL. (Catapult also generated the infrastructure to map this to RTL testing.) They caught almost all bugs at C++ and could experiment with design tweaks given the fast turn-around. This also allowed them to verify training, requiring many samples to run through the design. C++-based simulation made this possible.

An interesting bottom line to this work is that they implemented Chimera in 40nm (I’m guessing for the RRAM support?) A comparison SoC, implemented in 16nm, shows higher core energy and about the same energy and latency/image. Not bad! All in all, a useful validation from an obviously credible academic research source. You can watch the session HERE.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.