Inference and eFPGA are both data flow architectures. A single inference layer can take over a billion multiply-accumulates. Flex Logix directly connects the compute and RAM resources into a data path, like an ASIC; then repeats this layer by layer. Flex Logix utilizes a new breakthrough interconnect architecture: less than half the silicon area of traditional mesh interconnect, fewer metal layers, higher utilization and higher performance. The ISSCC 2014 paper detailing this technology won the ISSCC Lewis Winner Award for Outstanding Paper. The interconnect continues to be improved resulting in new patents.
We can easily scale up our Inference and eFPGA architectures to deliver compute capacity of any size. Flex Logix does this using a patented tiling architecture with interconnects at the edge of the tiles that automatically form a larger array of any size.
TIGHTLY COUPLED SRAM AND COMPUTE
SRAM closely couples with our compute tiles using another patented interconnect. Inference efficiency is achieved by closing coupling local SRAM with compute which is 100x more energy efficient than DRAM bandwidth. This interconnect is also useful for many eFPGA applications.
Our eFPGA compiler has been in use by dozens of customers for several years. Our inference compiler takes Tensorflow Lite and ONNX models to program our inference architecture using our eFPGA compiler in the back end. A performance modeler for our inference architecture is available now.
Software drivers will be available for common Server OS and real time OS for MCUs and FPGAs.
Flex Logix chip products will be available in PCIe Card format for Edge Servers and Gateways. Other formats like U.2 or M.2 can be supplied as well.
SUPERIOR LOW-POWER DESIGN METHODOLOGY
Flex Logix has numerous architecture and circuit design technologies to deliver the highest throughput at the lower power.
InferX™ X1 Edge Inference Co-Processor
High Throughput, Low Cost, Low Power
The InferX X1 Edge Inference Co-Processor is optimized for what the edge needs: large models and large models at batch=1. InferX X1 offers throughput close to data center boards that sell for thousands of dollars but does so at single digit watts and at a fraction of the price. InferX X1 is programmed using TensorFlow Lite and ONNX: a performance modeler is available now. Also, InferX X1 is based on our nnMAX architecture integrating 4 tiles for 4K MACs and 8MB L2 SRAM. InferX X1 connects to a single x32 LPDDR4 DRAM. Four lanes of PCIe Gen3 connect to the host processor; a x32 GPIO link is available for hosts without PCIe. Two X1’s can work together to increase throughput up to 2x.
InferX X1 has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.
nnMAX™ Inference Acceleration Architecture
High Precision, Modular & Scalable
nnMAX is programmed with TensorFlow Lite and ONNX. Numerics supported are INT8, INT16 and BFloat16 and can be mixed layer by layer to maximize prediction accuracy. INT8/16 activations are processed at full rate; BFloat16 at half rate. Hardware converts between INT and BFloat as needed layer by layer. 3×3 Convolutions of Stride 1 are accelerated by Winograd hardware: YOLOv3 is 1.7x faster, ResNet-50 is 1.4x faster. This is done at full precision. Weights are stored in non-Winograd form to keep memory bandwidth low. nnMAX is a tile architecture any throughput required can be delivered with the right amount of SRAM for your model. Cheng Wang, Co-Founder and Senior VP of Flex Logix, presented a detailed update on nnMAX at the Autonomous Vehicle Hardware Summit.
nnMAX has excellent Inference Efficiency, delivering more throughput on tough models for less $, less watts.
Think Inference Efficiency,
TOPS is a misleading marketing metric. It is the number of MACs times the frequency: it is a peak number. Having a lot of MACs increases cost but only delivers throughput if the rest of the architecture is right.
The right metric to focus on is Throughput: for your model, your image size, your batch size. Even ResNet-50 is a better indicator of throughput than TOPS (ResNet-50 is not the best benchmark because of it’s small image size: real applications process megapixel images).tInference Efficiency is achieved by getting the most throughput for the least cost (and power).
In the absence of cost information we can get a sense of throughput/$ by plotting throughput/TOPS, throughput/number of DRAMs & throughput/MB of SRAM: the most efficient architecture will need to get good throughput from each of these major cost factors. See our Inference Efficiency slides for more information.
Dialog Semiconductor is using eFPGA to Increase Configurability of Dialog’s Advanced Mixed Signal Products
SEE THE PRESS RELEASE HERE.
Learn about Dialog’s Advanced Mixed Signal segment, click here for December 2019 presentation and webcast
There are no comments yet.
You must register or log in to view/post comments.