Macbeth may have been uncertain of what he saw but, until recently, image recognition systems would have fared even less well. The energy and innovation put into increasingly complex algorithms always seemed to fall short of what any animal (including us humans) is able to do without effort. Machine vision algorithms have especially struggled to be robust to distortions, different lighting conditions, different poses, partially obscured images, low quality images, shifts and many more factors. Macbeth’s dagger might have been recognized face-on in ideal lighting but probably not in this troubled vision.
The problem seems to have been in the approach, a sort of brute-force attempt to algorithm our way to recognition. A different research path asked if we could model recognition on how we see and especially how the visual cortex maps images to recognized objects. First the image is broken up into small regions. Pixels within each region are tested against a function to detect a particular feature such as a diagonal edge. The function is simple – a weighted sum of the inputs, checked against a threshold function to determine if the output should trigger. Then a second feature test, now for a different feature (perhaps color), is applied again across each region in the image. This process repeats across multiple feature tests, as many as 100 or more.
All of these outputs are fed into a second layer of neurons. The same process repeats, this time with a different set of functions which extract slightly higher level details from the first-level. This then continues through multiple layers until the final outputs provide a high-level characterization of the recognized object. Implement that in a compute engine of some kind and you have a Convolutional Neural Network (CNN). What the CNN is good at recognizing is determined by the weights used in each feature test; these are not set by hand but rather by training the CNN over samples images. Which is as it should be since this is mimicking functions of the brain.
CNNs were something of a backwater in vision research until 2012 when a CNN-based solution beat all comers in a widely-respected image recognition challenge. Since then almost all competitive work in this area has switched to CNNs. These are now achieving correct detection rates (CDRs) in the high ninety percent range, besting not only other solutions but also human recognition in some cases, such as identifying species of dogs and birds.
Cadence has implemented a CNN on the Tensilica Vision P5 DSP. They used as their reference a test known as the German Traffic Sign Reference Benchmark, a small sample of which is shown here. This should give you some sense of the recognition challenge: low lighting, glare, dappled lighting, signs at angles, signs barely in focus – these are fully up to the limits of our own ability to recognize. Cadence was able to achieve CDRs of over 99%, and nearly 99.6% with a proprietary algorithm which beats all known results to date. They have also demonstrated with this algorithm the ability to trade-off a small compromise in accuracy for greatly reduced run-times.
The Tensilica Vision P5 DSP is pretty much ideal for building CNNs. As a DSP, multiply-accumulate instructions (for all those weighted sum calculations) are native. It supports high levels of parallelism through a VLIW architecture and ability to load long words from memory every cycle, so multiple image regions can be processed in parallel. And it has many other features which support the special functionality required by CNNs. All this is good but the results ultimately testify to the strength of the solution. Running the Cadence algorithm, this approach is able to identify 850 traffic signs per second. For pedestrian and obstacle recognition, and as we progress towards greater autonomy in cars, that kind of quick reaction time is critical.
No complete vision system will be implemented solely with a CNN. A complete system must first identify areas in the image to which recognition should be applied, then recognize and finally provide guidance based on what has been recognized. This requires an architecture adept across wide range of functions, supporting a rich set of operations, multiple data-types, and should be balanced to support traditional vision algorithms as well as CNNs, demands for which the Vision P5 solution is well suited.
I find it interesting that Cadence is investing heavily in the software part of this solution. Rather than run benchmarks based on open-source software, they’ve built their own software and have gone deep enough to produce a best-in-class CNN algorithm. Where they take this next should be interesting to watch. To learn more about the Tensilica Vision P5 DSP and the Cadence CNN algorithm, clickHERE.
If Macbeth had access to this technology, events might have taken a different though less poetic and certainly less dramatic turn:
Is this a dagger which I see before me? Let’s check this gizmo. No – definitely not a dagger. Well, can’t argue with technology. I’ll just have to tell the wife it’s off. I’m not going to kill the king.