The Linley Group held its Fall Processor Conference 2021 last week. There were a number of very informative talks from various companies updating the audience on the latest research and development work happening in the industry. The presentations were categorized as per their focus, under eight different sessions. The sessions topics were, Applying Programmable Logic to AI Inference, SoC Design, Edge-AI Software, High-Performance Processors, Low-Power Sensing & AI, Server Acceleration, Edge-AI Processing, High-Performance Processor Design.
Edge-AI processing has been garnering a lot of attention over the recent years and hardware accelerators are being designed-in for this important function. One of the presentations within the Edge-AI Processing session was titled “A Packet-based Approach for Optimal Neural Network Acceleration.” The talk was given by Sharad Chole, Co-Founder and Chief Scientist at Expedera, Inc. Sharad makes a strong case for rethinking implementation of Deep Learning Acceleration (DLA). He presents details of Expedera’s DLA platform and how their packet-based accelerator solution enables optimal results. The following is what I gathered from Expedera’s presentation at the conference.
As fast as the market for edge processing is growing, the performance, power and cost requirements of these applications are also getting increasingly demanding. And AI adoption is pushing processing requirement more toward data manipulation rather than general purpose computing. Deep learning models are fast evolving and an ideal accelerator solution optimizes for many different metrics. Hardware accelerator solutions are being sought after to meet the needs of a growing number of consumer and commercial applications.
Inefficiencies in Neural Network Acceleration
Traditional DLA architectures breakdown neural networks (NN) into granular work units for execution. This approach directly limits performance as existing hardware cannot directly execute higher level functions. While CPU-centric solutions may offer flexibility and potential for optimization, they are non-deterministic and fall short on power efficiency. Interpreter-centric solutions offer layer-level reordering optimization but they require large amounts of on-chip memory, a resource that is precious, particularly in edge devices. Benchmark studies indicate that current AI inference SoCs are performing at 20-40% utilization levels. Efforts to improve performance efficiency often prove counterproductive. For example, increasing throughput with larger batch sizes adversely impacts latency. Improving accuracy with compute precision consumes additional bandwidth and power. Targeting higher system utilization increases software complexity. Deploying trained models using these solutions is cumbersome and time-consuming.
Packet-based Approach for NN Acceleration
To overcome the above inefficiencies, Expedera breaks down NN into optimal work units designed for its DLA, calling them packets. Packets in this context are defined as contiguous fragments of NN layers, along with the respective contexts and dependencies. Packets allow for simple compilation of the neural network into packet streams. The packet streams are executed natively on the Expedera DLA platform.
This packet-based approach renders lots of benefits. The design is simplified and performance is improved. The amount of memory and bandwidth requirements are drastically reduced, allowing DLA hardware to be better right-sized. For example, using a popular benchmark the packet-based approach was shown to reduce DDR transfers by more than 5x compared to layer-based processing. The packet-based approach provided cascading benefits including fewer intermediate data movement, higher throughput, lower system power requirement, and reduced BOM cost.
As edge processing workloads evolve, applications need to support multiple models and increasing data rates. And SoCs need to support a mix of applications. The packet-based DLA codesign approach delivers a high-performance solution that is scalable and power efficient. It allows for parallel use of independent resources. Expedera DLA enables zero-cost context switching and provides for multi-tenant application support.
Expedera’s Compiler achieves out-of-the-box high performance. Its Estimator allows for right-sizing the DLA hardware. The Runtime scheduler orchestrates the best sequence of NN segments based on application requirements, enabling seamless deployments.
Benefits of Expedera’s DLA Platform
- Best performance per Watt
- Smaller designs
- Lower power
A Deterministic Advantage
As the packets are complete with context and dependencies, the packet-stream approach guarantees cycle-accurate DLA performance. Packet-DLA codesign enables deterministic and high performant compilation. Exact execution cycles as well as memory and bandwidth needs are known ahead of time, leading to a deterministic execution. This is a prized advantage with edge applications where low and consistent latency is important.
In a nutshell, Expedera’s customers can easily and rapidly implement their AI SoC designs for edge processing to deliver optimal deep learning acceleration for their applications. Achieving optimal AI performance calls for solving a multi-dimensional problem. With a comprehensive SDK built on Apache TVM, Expedera accelerator IP platform enables ideal accelerator configuration selection, accurate NN quantization and seamless deployment.
To learn about how Expedera’s DLA IP performance compares against other DLA IP solutions, refer to a whitepaper published by the Linley Group. The whitepaper titled “Expedera Redefines AI Acceleration for the Edge” can be downloaded from here.
** Apache TVM is an open-source machine learning compiler framework for CPUs, GPUs, and machine learning accelerators.