For vision DSP IP running convolutional neural networks (CNNs), a big driver of performance is increasing the bits processed per cycle with parallel MACs. Tom Simon did a great job in recent posts of introducing CNNs at a high level, so I’ll look at what is architecturally behind Cadence’s latest announcement: the Tensilica Vision P6 DSP.
CNNs are essentially pattern matching and sifting engines. A server farm goes after a large labeled dataset of images and derives a set of coefficients, or weights in a convolution, in a training procedure. Once those coefficients are defined, they can be loaded into an embedded CNN engine that rapidly processes a new incoming image by sifting it through successive convolutional layers until the desired pattern is found.
However, architecturally speaking looking at the entire incoming image with a CNN is still inefficient and extremely computationally intense. Most of an image is, well, boring – actual information used in recognizing objects is contained in a few candidate regions of interest. A vision DSP can run more conventional algorithms to enhance the image and extract those regions, handing them over to the neural network side for quicker object recognition.
That implies vision DSP IP needs to be good at both jobs, handling image processing and CNNs. Most DSPs have concentrated on the image processing side: more memory bandwidth, VLIW operations, floating point operations, deep pipelining, and more.
Leaps in memory bandwidth to keep the DSP fed are a good thing, but from there what makes a CNN perform well starts to differ. Image sensor data typically comes in with less than 16-bit resolution, and 8-bit coefficients are plenty wide for CNNs. Optimizing 8- and 16-bit operations and launching many operations on small data elements in a single cycle is the way to faster CNNs at the back end of the vision subsystem.
In the Vision P5 DSP, Cadence had the memory bandwidth and pipelining well handled. Just 7 months after the Vision P5 introduction, the opportunity for increasing CNN performance while maintaining image processing performance became clear. In the Vision P6 DSP, the big change is increasing the vector processing capability from 64 MACs to 256 MACs. Other enhancements include FP16 support in the optional 32-way SIMD vector floating point unit, and new custom instruction capability supporting CNNs.
The result is a massive 9728 bits processed per cycle – what Pulin Desai, Director of Product Marketing in the Imaging/Vision Group at Cadence, says is better than twice the current DSP IP competition. Cadence has the Vision P6 DSP targeted at 1.1 GHz in 16nm FF. Software for the Vision P5 DSP will run while users can recompile to take advantage of the new Vision P6 DSP features.
Blending traditional image processing with neural networks in one IP block has a lot of merit, using the strengths of each approach in vision processing to improve performance and reduce power. Seeing a DSP vendor like Cadence go back and optimize 8- and 16-bit operations specifically for CNNs is an architectural twist that may have a large payoff – we only have a self-referencing comparison at this point.
More from Cadence in their press release:
Are CNNs the new battleground for embedded vision? We are seeing the DSP IP and the mobile GPU IP vendors all talking CNNs. Something to watch.