WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 397
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 397
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)

WP_Term Object
(
    [term_id] => 13
    [name] => Arm
    [slug] => arm
    [term_group] => 0
    [term_taxonomy_id] => 13
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 397
    [filter] => raw
    [cat_ID] => 13
    [category_count] => 397
    [category_description] => 
    [cat_name] => Arm
    [category_nicename] => arm
    [category_parent] => 178
)

October 16, 2024November 14, 2024 by Bernard Murphy

Mobile LLMs Aren’t Just About Technology. Realistic Use Cases Matter

Mobile LLMs Aren’t Just About Technology. Realistic Use Cases Matter
by Bernard Murphy on 10-16-2024 at 6:00 am
Categories: AI, Arm, IP

Arm has been making noise about running large language models (LLMs) on mobile platforms. At first glance that sounds wildly impractical, other than Arm acting as an intermediary between a phone and a cloud-based LLM. However Arm are partnered with Meta to run Llama 3.2 on-device or in the cloud, apparently seamlessly. Running in the cloud is not surprising but running on-device needed more explaining so I talked to Ian Bratt (VP of ML Technology and Fellow) at Arm to dig deeper.

Start with what’s under the hood

I think we’re conditioned now to expect every new (hardware) announcement signals a new type of accelerator, but that is not what Arm is claiming. First, they are starting from Llama 3.2 lightweight models built for edge deployment, not just a smaller parameter count but also with pruning (zeroing parameters which have low impact on result accuracy) and something Meta calls knowledge distillation:

… uses a larger network to impart knowledge on a smaller network, with the idea that a smaller model can achieve better performance using a teacher than it could from scratch.

The Arm demonstration platform uses 4 CPU cores on a middle-of-the-road phone. Let me repeat that – 4 CPUs, no added NPU. Arm then put a lot of (repeatable) work into optimization. Starting from a trained model they heavily compress from Bfloat16 weights down to 4-bit. They compile operations through their hand-optimized Kleidi libraries and run on CPUs hosting ISA extensions for matrix operations they have had in place in place for years.

No magic other than aggressive optimization, in a way that should be repeatable across applications. Ian showed me a video of a demo they ran recently for a chatbot running on that same phone. He typed in “Suggest some birthday card greetings” and it came back with suggestions in under a second. All running on those Arm CPU cores.

Of course this is just running inference (repeated next token prediction) based on a prompt. It’s not aiming to support training. It won’t be as fast as a dedicated NPU. It’s not aiming to run big Llama models on-device, though apparently it can seamlessly interoperate with a cloud-based deployments to handle such cases. And it will sacrifice some accuracy through aggressive compression. But how important are those limitations?

The larger question in mobile AI

We’ve seen unbounded expectations in what AI might be able to do, chased by innovation in foundation models from CNNs to DNNs to transformers to even newer fronts, and innovation in hardware to accelerate those models in the cloud and mobile applications.

While now-conventional neural nets have found real applications in automotive, building security, and other domains, LLM applications in mobile are still looking for a winner. Bigger, faster, better is great in principle but only if it is useful. Maybe it is time for the pendulum to swing from performance to utility. To explore first at relatively low cost what new features will attract growth.

Adding an AI accelerator to a design adds cost, power drain and complexity to system design and support. Arm’s argument for sticking to familiar CPU-based platforms for relatively modest inference tasks (with a path to cloud-based inference if needed) sounds like a sensible low-risk option until we consumers figure out what we find appealing as killer apps.

Not all edge devices are phones, so there will still be opportunity for NPUs at the edge. Predictive maintenance support for machines, audio personalization in earbuds, voice-based control for systems lacking a control surface, are examples where product innovators will start with a real world need in consumer, industry, office, hospital applications and then need to figure out how to apply AI to that need.

Interesting twist to the mobile AI story. You can learn more from Ian’s blog.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Instance

Array
(
    [node_name] => Arm
    [node_id] => Array
        (
            [0] => 2
        )

)

Instance

Array
(
    [node_name] => 
    [node_id] => Array
        (
            [0] => 2
        )

    [title] => Recent Forum Threads
)

Threads

Recent Forum Threads

TSMC 3Q25 Earnings Conference Presentation Materials & Press Release

latest reply by Daniel Nenni on October 16, 2025

started by Daniel Nenni on October 16, 2025
Intel Unveils Panther Lake Architecture: First AI PC Platform Built on 18A

latest reply by siliconbruh999 on October 16, 2025

started by Daniel Nenni on October 9, 2025
Apple unleashes M5, the next big leap in AI performance for Apple silicon

started by Daniel Nenni on October 16, 2025
ASML looks to calm fears over 2026 growth as it warns of China sales decline

latest reply by Daniel Nenni on October 16, 2025

started by Daniel Nenni on October 15, 2025
China blacklists researchers that exposed Huawei chip secrets

latest reply by eding42 on October 16, 2025

started by Barnsley on October 10, 2025
Samsung 1c DRAM yield troubles

latest reply by Fred Chen on October 16, 2025

started by Fred Chen on October 16, 2025
BlackRock, Nvidia-backed group strikes $40 billion AI data center deal

latest reply by Barnsley on October 16, 2025

started by Daniel Nenni on October 15, 2025
The Day Huawei’s CEO Bent the Knee in a Santa Clara Marriott

latest reply by Daniel Nenni on October 15, 2025

started by Daniel Nenni on October 13, 2025
Intel to Expand AI Accelerator Portfolio with New GPU

latest reply by siliconbruh999 on October 15, 2025

started by Brady on October 15, 2025
Former Intel CEO Pat Gelsinger says AI is a bubble that won't pop for 'several years'

latest reply by count on October 15, 2025

started by Daniel Nenni on October 13, 2025

Search Semiwiki

Recent Arm Articles

Start with what’s under the hood

The larger question in mobile AI

Comments

Sponsor

Recent Forum Threads