AI-Enabled Embedded Systems / EfficientNet: Compound Model Scaling and More

Al Gharakhanian · Jun 7, 2019

Intriguing Embedded Module from SolidRun Armed with Gyrfalcon AI Accelerator

SolidRun (www.solid-run.com) announced the availability of i.MX 8M Mini System on Module (SOM) with some serious AI acceleration horsepower, thanks to Gyrfalcon Lightspeeur® 2803S accelerator chip. This module is a powerful platform that contains all the ingredients needed to develop a quick prototype of an AI-enabled edge device. The size of the module is 47mm x 30mm and it is jam-packed with features such as:

- Various grades of NXP (based on ARM Cortex A53) Host Processor
- 4GB LPDDR4 Memory
- Bluetooth and Wi-Fi Connectivity (u-blox module)
- PCIe 2.0
- Robust multimedia prowess (support of 20 audio channels, MIPI-DSI, and 1080p encoder/decoder, camera and display interfaces)
- The AI inference is done using Gyrfalcon's Lightspeeur® 2803S (24 TOPS/W, 9 x 9 mm) accelerator chip
- Power dissipation 3.3W
- MSRP: $56

Admittedly the list above does not do justice covering all the capabilities of this product, but the intention here is not to duplicate the data sheet. What is significant, in my view; can be summarized in just four numbers:

1. 24 TOPS/W (quite sufficient for most edge vision inference workloads)
2. 3.3 Watts (high for battery-operated devices but suitable for whole host of industrial, automotive, medical and robotics applications)
3. 47mm x 30mm (can be much smaller if optimized for a specific applications)
4. $56 (given time, volume, and good negotiation the price can be much lower)

I am certain we will see a flurry of small and low-cost embedded systems with tremendous AI horsepower that can serve a myriad of applications and use cases. So why is this significant?
Imagine a scenario that every instrument, device, machinery, or vehicle costing more than a few hundred dollars can be easily enhanced (by a similar module) and enabled to utilize historical data to gain predictive, inferential, and recognition abilities above and beyond its baseline features. Do you see value in this? I certainly do. Not in all cases imaginable, but in most. Don’t take me wrong. I am not ignoring or redefining IoT here. This goes above and beyond IoT. Successful deployment of IoT requires an infrastructure angle, but there are hundreds of legacy use cases (that can benefit from AI) that will do just fine without being connected to millions of other nodes. I believe most of the buzz around AI has been around IoT, autonomous vehicles, surveillance cameras, and robots but there are numerous other applications that can also draw benefit from having artificial cognitive capability.

Mixed Precision Training with 8-bit Floating Point

Group of researchers at Intel Labs have been able to demonstrate that 8-bit floating point representation (FP-8) can be as effective as FP-16 and FP-32 during training.
A bit of history first. The choice of numerical representation of weights, activations, errors, and gradients in Deep Neural Networks (DNNs) can have a dramatic impact on the die size and power dissipation of training and inference chips. It should come as no surprise that there are virtually dozens research teams working feverishly attempting to find the most efficient numerical representation without sacrificing accuracy. This journey has been relatively easy for inference chips. Even 8-bit integer representation has produced remarkable results in inference chips. Unfortunately, the same can’t be claimed for training. Presently the most common numerical format for training is 16-bit floating point (FP-16). There is ample evidence that FP-16 can come very close to FP-32 when it comes to validation accuracy. Number of teams have also attempted to use integer representation for training, but the results have been mixed at best. Some have been successful in improving the outcomes, but at the expense of adding additional hardware (for stochastic rounding).
Researchers at Intel Labs have moved away from INT-8 and have instead proposed a new scalable solution to use FP- 8 compute primitives that no longer require additional hardware for stochastic rounding. Much reduction in cost and complexity of MAC units has been the result. They have shown outstanding state-of-the-art accuracy with FP-8 representation of weights, activations, errors and weight gradients across a broad set of popular data sets. In some cases, their accuracy has been better than FP-32. Truly remarkable accomplishment.

EfficientNet: Compound Model Scaling in CNNs
Kudos to Quac Le and his team at Google AI for coming up with the concept of “Compound Model Scaling”. The findings have led to dramatic improvement in size and efficiency of Convolutional Neural Network (CNN) implementations.

Background
The standard practice for choosing a CNN architecture is to start with a baseline model and apply various scaling strategies to improve its accuracy and efficiency while staying within a given resource budget. Scaling in CNNs can be done in three ways:
1. Width Scaling: Increase the number of neurons in each layer
2. Depth Scaling: Adding layers to various stages of the network (more convolutional, pooling, or fully connected layers)
3. Resolution Scaling: Increase the input resolution
Traditionally the process of scaling up a network has been tedious to say the least and requiring numerous guesses and arbitrary attempts.

The Novelty
Intel researchers have shown that it is critical to balance the scaling of all dimensions of the network (width, depth, and resolution), and have proposed a method to find the optimal scaling balance. Furthermore, they have shown that an optimal balance can be achieved by scaling each dimension by a constant ratio. They have proposed a formal process (called compound scaling) that uses simple grid search to come up with three fixed scaling coefficients (one for each dimension) and that optimizes the accuracy of the model given available resources. In practice, the optimization will start by picking a bare-minimum model and would entail finding three scaling coefficients (given available computational resource). Finally, the baseline network is scaled across all three dimensiona using these coefficients.

Results
The team has been able to demonstrate truly remarkable results. In one specific use case they have achieved state-of-the-art 84.4% top-1 and 97.1% top-5 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on inference compared to the best-of-breed CNNs.

Few Market Data Points

• According to IDC the worldwide shipments of AI-optimized processors for edge systems will reach 340M units in 2019 and will increase to 1.5B units in 2023
• According to Woodside Capital the total VC investment in semiconductor companies reached nearly $1B, 67% of which had to do with AI
• According to Strategy Analytics, the number of automotive image sensors will grow from 110M in 2019 to 330M in 2026

Search

AI-Enabled Embedded Systems / EfficientNet: Compound Model Scaling and More

Al Gharakhanian

New member