When I search for ‘what is the next big thing in AI?’ I find a variety of suggestions around refining and better productizing what we already know. Very understandable for any venture aiming to monetize innovation in the near term, but I am more interested in where AI can move outside the box, to solve problems well outside the purview of today’s deep learning and LLM technologies. One example is in tackling math problems, a known area of weakness for the biggest LLMs, even more so for GPT 4. OpenAI Q* and Google Gemini both have claims in this space.
I like this example because it illustrates an active area of research in reasoning with interesting ideas while also clarifying the scale of the mountain that must be climbed on the way to anything resembling artificial general intelligence (AGI).
Math word problems
Popular accounts like to illustrate LLM struggles with math through grade school word problems, for example (credit to Timothy Lee for this example):
John gave Susan five apples and then gave her six more. Susan then ate three apples and gave three to Charlie. She gave her remaining apples to Bob, who ate one. Bob then gave half his apples to Charlie. John gave seven apples to Charlie, who gave Susan two-thirds of his apples. Susan then gave four apples to Charlie. How many apples does Charlie have now?
Language recognition is obviously valuable in some aspects of understanding this problem, say in translating from a word-based problem statement to an equation-based equivalent. But I feel this step is incidental to LLM problems with math. The real problem is in evaluating the equations, which requires a level of reasoning beyond LLM statistical prompt/response matching.
Working a problem in steps and positional arithmetic
The nature of an LLM is to respond in one shot to a prompt; this works well for language-centric questions. Language variability is highly bounded by semantic constraints, therefore a reasonable match with the prompt is likely to be found with high confidence in the model (more than one match) then triggering an appropriate response. Math problems can have much more variability in values and operations; therefore any given string of operations is much less likely to be found in a general training pile no matter how large the pile.
We humans learn early that you don’t try to solve such a problem in one shot. You solve one step at a time. This decomposition, called chain-of-thought reasoning, is something that must be added to a model. In the example above, first calculate how many apples Susan has after John hands over his apples. Then move to the next step. Obvious to anyone with skill in arithmetic.
Zooming further in, suppose you want to solve 5847+15326 (probably not apples). It is overwhelmingly likely that this calculation will not be found anywhere in the training dataset. Instead, the model must learn how to do arithmetic on positional notation numbers. First compute 7+6 = 13, put the 3 in the 1s position for the result and carry 1. And so on. Easy as an explicit algorithm but that’s cheating; here the model must learn how to do long addition. That requires training examples for adding two numbers, each between 0 and 9, plus multiple training examples which demonstrate the process of long addition in chain-of-thought reasoning. This training will in effect build a set of rules in the model, but captured in the usual tangle of model parameters rather than as discernible rules. Once training is finished against whatever you decided was a sufficient set of examples it is ready to run against addition tests not seen in the training set.
This approach, which you might consider meta-pattern recognition, works quite well, up to a point. Remember that this is training to infer rules by example rather than by mathematical proof. We humans know that the long addition algorithm works no matter how big the numbers are. A model trained on examples should behave similarly for a while, but as the numbers get bigger it will at some point run beyond the scope of its training and will likely start to hallucinate. One paper shows such a model delivering 86% accuracy on calculations using 5-digit numbers – much better than the 5-6% of native GPT methods – but dropping to 41% for 12-digit numbers.
Progress is being made but clearly this is still a research topic. Also a truly robust system would need to move up another level, to learning absolute and abstract mathematical facts, for example true induction on long addition.
Beyond basic arithmetic
So much for basic arithmetic, especially as expressed in word problems. UC Berkeley has developed an extensive set of math problems, called MATH, together with AMPS, a pretraining dataset. MATH is drawn from high school math competitions covering prealgebra, algebra, number theory, counting and probability, geometry, intermediate algebra, and precalculus. AMPS, the far larger training dataset, is drawn from Khan Academy and Mathematica script examples and runs to 23GB versus 570GB for GPT3 training.
In a not too rigorous search, I have been unable to find research papers on learning models for any of these areas outside of arithmetic. Research in these domains would be especially interesting since solutions to such problems are likely to become more complex. There is also a question of how to decompose solution attempts into sufficiently granular chains-of-thought reasoning to guide effective training for LLMs rather than human learners. I expect that could be an eye-opener, not just for AI but also for neuroscience and education.
What seems likely to me (and others) is that each such domain will require, as basic arithmetic requires, its own fine-tuning training dataset. Then we can imagine similar sets of training for pre-college physics, chemistry, etc. At least enough to cover commonsense know-how. In compounding all these fine-tuning subsets, at some point “fine-tuning” a core LLM will no longer make sense. We will need to switch to new types of foundation model. So watch out for that.
While example-based intuition won’t fly in math, there are many domains outside the hard sciences where best guess answers are just fine. I think this is where we will ultimately see the payoff of this moonshot research. One interesting direction here further elaborates chain-of-thought from linear reasoning to exploration methods with branching searches along multiple different paths. Again a well-known technique in algorithm circles but quite new to machine learning methods I believe.
Yann LeCun has written more generally and certainly much more knowledgeably on this area as the big goal in machine learning, combining what I read as recognition (something we sort of have a handle on), reasoning (a very, very simple example covered in this blog), and planning (hinted at in the branching searches paragraph above). Here’s a temporary link: A Path Towards Autonomous Machine Intelligence. If the link expires, try searching for the title, or more generally “Yann LeCun planning”.
Very cool stuff. Opportunity for new foundation models and no doubt new hardware accelerators 😀