I touched earlier on challenges that can appear in AI systems which operate as black-boxes, particularly in deep learning systems. Problems are limited when applied to simple recognition tasks, e.g. recognizing a speed limit posted on a sign. In these cases, the recognition task is (from a human viewpoint) simply choosing from among a limited set of easily distinguished options, so an expert observer can easily determine if/when the system made a bad decision.
But as AI is extended to more complex tasks, it becomes increasingly difficult to accurately grade the performance of those systems. Certainly, there will still be many cases where an expert observer can classify performance easily enough. But what about cases where the expert observer isn’t sure? Where is the student surpassing the master and where is the student simply wrong?
This reminded me of an important Indian mathematician, Srinivasa Ramanujan, whose methods were in some ways as opaque as current AI systems. There was a movie release this year – The Man Who Knew Infinity – covering Ramanujan’s career and challenges. He had an incredible natural genius for mathematics, but chose to present results with little or no evidence for how he got there (apparently because he couldn’t afford the extra paper required to write out the proofs).
This lack of demonstrated proofs raised concerns among professional mathematicians of Ramanujan’s time. Mathematical rigor requires a displayed proof leading to the result, so that other experts can validate (or disprove) the claim. This is not unlike the above-mentioned concern with modern deep learning systems. For conclusions which a human expert can easily classify there is no problem, but for more complex assertions a bald statement of a conclusion is insufficient. We want to know how the system arrived at that conclusion for one of two reasons: it might be wrong and if so we want to know where it went wrong so we can fix it (perhaps by improving the training set), or it might be right in which case we’d like to know why so we can improve our own understanding.
Recent work at UC Berkeley and the Max Planck Institute for Informatics has made progress in this direction for deep learning systems. The underlying mechanics are the same but they use multiple training datasets, to deliver a conclusion and to justify sub-steps leading to that conclusion. The domain for the study is image recognition, specifically determining aspects of what is happening in an image (for example, what sport is being played).
The research team noted that a system-generated chain of reasoning may not correspond to how a human expert would think of a problem, so a better approach needs some user friendliness. Instead of presenting the user with a proof, let them ask questions which the system should answer, an approach known as visual question answering (VQA). While this may not lead to mathematically rigorous proofs, it seems very appropriate for many domains where a human expert wants to feel sufficiently convinced but may not need every possible proof point.
The method requires two principle components: VQA augmented by spatial attention where the system looks at localized image features (such as a figure) to draw conclusions (this person is holding a bat), and more global activity recognition/explanation (this is a baseball game). These datasets were annotated through crowdsourcing with “proposition because explanation” labels.
The research team wanted also to point to an object supporting a proposition, for example if the VQA asserted “this person is holding a bat”, they wanted to point to the bat. This is where the attention aspect of the model becomes important. You could imagine this kind of capability being critical in a medical diagnosis where perhaps a key aspect of the diagnosis rests on an assumption that a dark spot in an X-ray corresponds to a tumor. “This patient has a tumor as shown in this X-ray” is hardly a sufficient proposition, whereas “this patient has a tumor in the liver as shown at this location in this X-ray” is much more usable information and something a doctor could confirm or challenge.
As we aim to push AI into more complex domains, this kind of justification process will become increasingly important. Which of us would trust a medical diagnosis delivered by a machine without a medical expert first reviewing and approving that diagnosis? Collision avoidance in a car may not allow time to review before taking action, but subsequent litigation may quite possibly demand review on whether the action taken was reasonable. Even (and perhaps especially) where AI is being used to guide scientific discovery or proof, the AI will need to demonstrate a chain of reasoning which human experts can test for robustness. It will be a very long time before “because my AI system said so” will be considered a sufficient alternative to peer review. Which is really the point. Important decisions, whether made by people or machines, should not be exempt from peer review.
You can read the UCB/MPI arXiv paper HERE.Share this post via: