In all the enthusiasm around machine learning (ML) and intelligent vision, we tend to forget the front-end of this process. The image captured on a CCD camera goes through some very sophisticated image processing before ML even gets to work on it. The devices/IPs that do this are called image signal processors (ISPs). You might not be aware (I wasn’t) that Arm is in this game and has been for 15+ years, working with companies like Nikon, Sony, Samsung and HiSilicon and now targeting those trillion IoT devices they expect, of which a high percentage are likely to need vision in one form or another.
So what do ISPs do? As Thomas Ensergueix (Sr Dir of Embedded at ARM) explained it to me, this largely comes down to raising the level of visual acuity in “digital eyes” to the level we have in our own eyes. A big factor here is handling the high dynamic range (HDR) that you will often find in raw images. And to get better than human eyes, you want performance in low light-conditions and ability to handle 4k resolution (professional photography level) at smartphone frame rates or better.
Look at the images of a street scene above, a great example of the dynamic range problem. Everything is (just barely) visible in the standard image on the left, but in attempting to balance between the bright sky and the rest of the image, the street becomes quite dark; you wouldn’t even know there was a pedestrian on the right, about to walk out into the road. You can’t fix this problem twiddling global controls; you need much more sophisticated processing through a quite complex pipeline.
An ISP pipeline starts with raw processing and raw noise reduction, followed by a step called de-mosaicing to fill out the incomplete color images which result from how imagers manage color (a color filter array overlaying the CCD). Then the image goes into HDR management and color management steps. Arm view the noise reduction, HDR management and color management as particular differentiators for their product line.
Thomas said that in particular they use their Iridix technology to manage HDR better than conventional approaches. Above on the left, an image has been optimized using conventional global HDR range compression. You can see the castle walls quite clearly and the sky isn’t a completely white blur, but it doesn’t accurately reflect what you would see yourself. The image on the right is much closer. You can see clouds in the sky, the castle walls are clearer, as are other areas. This is because Iridix uses local tone mapping rather than global balancing to get a better image.
Arm introduced two new products recently including this capability, the Mali-C52 for full-range applications requiring near human-eye response, and the Mali-C32 for value-priced applications. In addition to improved HDR management they use their Sinter and Temper technologies to reduce spatial and temporal noise in images. In color management, beyond basic handling they have just introduced a new 3D color enhancer to allow subjective tuning of color. Finally, all of this is built on a new pixel pipeline which can handle 600M pixels/sec, easily enabling DSLR resolution at 60 frames/sec.
So when you think about smart vision for pedestrian detection, intruder detection or whatever application you want to target, spare a thought for the front end image processing. In vision as in everything else, garbage-in inevitably becomes garbage-out. Even less-than-perfect-in limits the accuracy of what can come out. Object recognition has to start with the best possible input to deliver credible results. A strong ISP plays a big part in meeting that objective.