WP_Term Object
(
    [term_id] => 18324
    [name] => Expedera
    [slug] => expedera
    [term_group] => 0
    [term_taxonomy_id] => 18324
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 9
    [filter] => raw
    [cat_ID] => 18324
    [category_count] => 9
    [category_description] => 
    [cat_name] => Expedera
    [category_nicename] => expedera
    [category_parent] => 178
)
            
Screenshot 2023 10 16 at 2.20.33 PM
WP_Term Object
(
    [term_id] => 18324
    [name] => Expedera
    [slug] => expedera
    [term_group] => 0
    [term_taxonomy_id] => 18324
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 9
    [filter] => raw
    [cat_ID] => 18324
    [category_count] => 9
    [category_description] => 
    [cat_name] => Expedera
    [category_nicename] => expedera
    [category_parent] => 178
)

Area-optimized AI inference for cost-sensitive applications

Area-optimized AI inference for cost-sensitive applications
by Don Dingee on 02-15-2023 at 6:00 am

Often, AI inference brings to mind more complex applications hungry for more processing power. At the other end of the spectrum, applications like home appliances and doorbell cameras can offer limited AI-enabled features but must be narrowly scoped to keep costs to a minimum. New area-optimized AI inference technology from Expedera is taking on this challenge, targeting 1 TOPS performance in the smallest possible chip area.

Optimized for one model, but maybe not for others

Fitting into an embedded device brings constraints and trade-offs. For example, many teams concentrate on developing the inference model for an application using a GPU-based implementation, only to discover that no amount of optimization will get them anywhere near the required power-performance-area (PPA) envelope.

A newer approach uses a neural processing unit (NPU) to handle AI inference workloads more efficiently, delivering the required throughput in less die size and power consumption. NPU hardware typically scales up or down to meet throughput requirements, often measured in tera operations per second (TOPS). In addition, compiler software can translate models developed in popular AI modeling frameworks like PyTorch, TensorFlow, and ONNN into run-time code for the NPU.

Following a long-held principle of embedded design, there’s a strong temptation for designers to optimize their NPU hardware in their application, wringing out every last cent of cost and milliwatt of power. However, if only a few AI inference models are in play, it might be possible to optimize hardware tightly using a deep understanding of model internals.

Model parameters manifest as operations, weights, and activations, varying considerably from model to model. Below is a graphic comparing several popular lower-end neural network models.

NN ops weights activations

On top of these differences sits the neural network topology – how execution units interconnect in layers – adding to the variation. Supporting different models for additional features or modes leads to overdesigning with a one-size-fits-all NPU big enough to cover performance in all cases. However, living with the resulting cost and power inefficiencies may be untenable.

NPU co-design solves optimization challenges

It may seem futile to optimize AI inference in cost-sensitive devices where models are unknown when the project starts or running more than one model for mode preferences. But, is it possible to tailor an NPU more closely to a use case without enormous investments in design time or running the risk of an AI inference model changing later?

Here’s where Expedera’s NPU co-design philosophy shines. The key is not hardcoding models in hardware but instead using software to map models to hardware resources efficiently. Expedera does this with a unique work sequencing engine, breaking operations down into metadata sent to execution units as a packet stream. As a result, layer organization becomes virtual, operations order efficiently, and hardware utilization increases to 80% or more.

Expedera uses packet-centric scalability to move up and down in AI inference performance while maintaining efficiency

 

 

 

 

 

In some contexts, packet-centric scalability unlocks higher performance, but in Expedera’s area-optimized NPU technology, packets can also help scale performance down for the smallest chip area.

Smallest possible NPU for simple models

Customers say a smaller NPU that matches requirements and keeps costs to a minimum can make the difference between having AI inference or not in cost-sensitive applications. On the other hand, a general-purpose NPU might have to be overdesigned by as much as 3x, driving up die size, power requirements, and additional costs until a design is no longer economically feasible.

Starting with its Origin NPU architecture, fielded in over 8 million devices, Expedera tuned its engine for a set of low to mid-complexity neural networks, including MobileNet, EfficientNet, NanoDet, Tiny YOLOv3, and others. The results are the new Origin E1 edge AI processors, putting area-optimized 1 TOPS AI inference performance in soft NPU IP ready for any process technology.

“The focus of the Origin E1 is to deliver the ideal combination of small size and lower power consumption for 1 TOPS needs, all within an easy-to-deploy IP,” says Paul Karazuba, VP of Marketing for Expedera. “As Expedera has already done the optimization engineering required, we deliver time-to-market and risk-reduction benefits for our customers.”

Seeing a company invest in more than just simple throughput criteria to satisfy challenging embedded device requirements is refreshing. For more details on the area-optimized AI inference approach, please visit Expedera’s website.

Blog post: Sometimes Less is More—Introducing the New Origin E1 Edge AI Processor

NPU IP product page: Expedera Origin E1

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.