It is tempting to think that everything GPT-related is just chasing the publicity bandwagon and that articles on the topic, especially with evidently impossible claims (as in this case), are simply clickbait. In fact, there are practical reasons for hosting at least a subset of these large language models (LLMs) on edge devices such as phones, especially for greatly improved natural language processing. At the same time the sheer size of such models, routinely associated with massive cloud-based platforms, presents a challenge for any attempt to move an LLM to the edge. Making the transition to mobile GPT requires some pretty major innovation.
Why is mobile GPT worth the effort?
When Siri and similar capabilities appeared, we were captivated – at first. Talking to a machine and having it understand us seemed like science fiction become fact. The illusion fell apart as we started to realize their “understanding” is very shallow and we’re reduced to angrily repeating variants of a request in the hope that at some point the AI will get it right. Phrase-level recognition (rather than word-level) helps somewhat but is still defeated by the flexibility of natural language. As a user your question will probably be most effective if you know the training phrases in advance, rather than if you ask a natural question. This hardly rises to the level of natural language processing.
LLMs have proven very successful in understanding natural language requests effectively through brute force. They learn across worldwide datasets and use transformer methods for learning based on self-attention algorithms, recognizing common patterns in natural language unconstrained by proximity of keywords or pre-determined phrase structures. In this way they can extract and rephrase intent, or some suggested refinements of intent, from a natural language request. This is much more like NLP and can be valuable quite independent of ability to retrieve factual references from the internet. But it is still running in the cloud.
How can mobile GPT be accomplished?
More capable smart speakers do some local processing beyond the basic voice pickup (speech recognition, tokenization) then pass the real understanding problem back to the cloud. There are proposals that a similar hybrid approach could be used from our phones to upgrade NLP quality. But that has the usual downsides – latency and privacy concerns. It would be greatly preferable to keep all or most of the compute on the edge device.
Do we really need the same huge model used in the cloud? GPT-4 has about a trillion parameters, wildly infeasible to fit in a mobile application. In 2021 Google DeepMind published a significant advance, their Retrieval-Enhanced Transformer (Retro), which breaks away from storing facts in the model, recognizing that these can be retrieved directly from pure text data or searches rather than from model weights. (There are some wrinkles in doing this effectively but that is the basic principle.) This change alone can reduce model size to a few percent of the original size, getting closer but still bulky for a handheld device.
Going lower requires pruning and quantization. Quantization is already familiar in mapping trained CNN or other models to edge devices. You selectively replace floating point weights with fixed point, down to 8-bit, 4-bit or even 2-bit, while constantly checking accuracy of results to ensure you didn’t go too far. Together with compression and decompression, the more fine-grained the quantization the smaller you can make the model, also inference becomes faster and lower power because DDR activity can be reduced. Pruning is a step in which you test sensitivity of results to selectively replacing weights with zeroes. For many NLP models a significant percentage of the model weights are really not that important. With an effective sparsity handler, such pruning further reduces effective size and further increases performance. How much improvement these techniques can deliver depends on the AI platform.
CEVA NeuPro-M for Mobile GPT
The NeuPro-M NPU IP are a family of AI processor engines designed for embedded applications. The pre-processing software to map a model to these engines can reduce effective model size by up to 20:1, delivering a total LLM size including Retro compression to around a billion parameters, comfortably within the capacity of a modern edge AI engine such as the NeuPro-M IP.
NeuPro-M cores come in configurations with from 1 to 8 parallel engines. Each engine provides a set of accelerators, including a true sparsity module to optimize throughput for unstructured sparsity, a neural multiplier for attention or softmax computations and a vector processing unit to handle any special purpose customization an application might need.
Cores share a common L2 memory and can run streams in parallel, for higher throughput and especially to parallelize time-consuming softmax normalization computations with attention steps, effectively eliminating the latency overhead normally incurred by normalization. Ranging from 4 TOPS/core up to 256 TOPS/core, NeuPro-M can deliver over 1200 TOPS.
If we want to replace Siri and similar voice-based apps, we need to add voice recognition and text to speech for a fully voice-centric interface. CEVA already has voice pickup in WhisPro, and speech recognition on input and text to speech on output can each be handled by small transformers which can run on the NeuPro-M. So you can build a full voice-based pipeline on this platform, from voice input to recognition and intelligent response, to speech output.
If you really want your phone to write a detailed essay on a complex topic from a voice prompt, it may still need to turn to the internet to retrieve factual data from which it can generate that essay. More realistically in many cases (find a restaurant, find a movie on the TV, tell me the weather in Austin) it will probably only need to go to internet for that last piece of data, no longer needing that step to accurately understand your question. Pretty cool.
You can learn more at CEVA’s ChatGPT Mobile page.
Share this post via:
Comments
3 Replies to “Fitting GPT into Edge Devices, Why and How”
You must register or log in to view/post comments.