WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 390
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 390
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)
            
Mobile Unleashed Banner SemiWiki
WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 390
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 390
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)

Mobile LLMs Aren’t Just About Technology. Realistic Use Cases Matter

Mobile LLMs Aren’t Just About Technology. Realistic Use Cases Matter
by Bernard Murphy on 10-16-2024 at 6:00 am

Arm has been making noise about running large language models (LLMs) on mobile platforms. At first glance that sounds wildly impractical, other than Arm acting as an intermediary between a phone and a cloud-based LLM. However Arm are partnered with Meta to run Llama 3.2 on-device or in the cloud, apparently seamlessly. Running in the cloud is not surprising but running on-device needed more explaining so I talked to Ian Bratt (VP of ML Technology and Fellow) at Arm to dig deeper.

Mobile LLMs Aren’t Just About Technology

Start with what’s under the hood

I think we’re conditioned now to expect every new (hardware) announcement signals a new type of accelerator, but that is not what Arm is claiming. First, they are starting from Llama 3.2 lightweight models built for edge deployment, not just a smaller parameter count but also with pruning (zeroing parameters which have low impact on result accuracy) and something Meta calls knowledge distillation:

… uses a larger network to impart knowledge on a smaller network, with the idea that a smaller model can achieve better performance using a teacher than it could from scratch.

The Arm demonstration platform uses 4 CPU cores on a middle-of-the-road phone. Let me repeat that – 4 CPUs, no added NPU. Arm then put a lot of (repeatable) work into optimization. Starting from a trained model they heavily compress from Bfloat16 weights down to 4-bit. They compile operations through their hand-optimized Kleidi libraries and run on CPUs hosting ISA extensions for matrix operations they have had in place in place for years.

No magic other than aggressive optimization, in a way that should be repeatable across applications. Ian showed me a video of a demo they ran recently for a chatbot running on that same phone. He typed in “Suggest some birthday card greetings” and it came back with suggestions in under a second. All running on those Arm CPU cores.

Of course this is just running inference (repeated next token prediction) based on a prompt. It’s not aiming to support training. It won’t be as fast as a dedicated NPU. It’s not aiming to run big Llama models on-device, though apparently it can seamlessly interoperate with a cloud-based deployments to handle such cases. And it will sacrifice some accuracy through aggressive compression. But how important are those limitations?

The larger question in mobile AI

We’ve seen unbounded expectations in what AI might be able to do, chased by innovation in foundation models from CNNs to DNNs to transformers to even newer fronts, and innovation in hardware to accelerate those models in the cloud and mobile applications.

While now-conventional neural nets have found real applications in automotive, building security, and other domains, LLM applications in mobile are still looking for a winner. Bigger, faster, better is great in principle but only if it is useful. Maybe it is time for the pendulum to swing from performance to utility. To explore first at relatively low cost what new features will attract growth.

Adding an AI accelerator to a design adds cost, power drain and complexity to system design and support. Arm’s argument for sticking to familiar CPU-based platforms for relatively modest inference tasks (with a path to cloud-based inference if needed) sounds like a sensible low-risk option until we consumers figure out what we find appealing as killer apps.

Not all edge devices are phones, so there will still be opportunity for NPUs at the edge. Predictive maintenance support for machines, audio personalization in earbuds, voice-based control for systems lacking a control surface, are examples where product innovators will start with a real world need in consumer, industry, office, hospital applications and then need to figure out how to apply AI to that need.

Interesting twist to the mobile AI story. You can learn more from Ian’s blog.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.