WP_Term Object
(
    [term_id] => 37
    [name] => Achronix
    [slug] => achronix
    [term_group] => 0
    [term_taxonomy_id] => 37
    [taxonomy] => category
    [description] => 
    [parent] => 36
    [count] => 70
    [filter] => raw
    [cat_ID] => 37
    [category_count] => 70
    [category_description] => 
    [cat_name] => Achronix
    [category_nicename] => achronix
    [category_parent] => 36
)

WP_Term Object
(
    [term_id] => 37
    [name] => Achronix
    [slug] => achronix
    [term_group] => 0
    [term_taxonomy_id] => 37
    [taxonomy] => category
    [description] => 
    [parent] => 36
    [count] => 70
    [filter] => raw
    [cat_ID] => 37
    [category_count] => 70
    [category_description] => 
    [cat_name] => Achronix
    [category_nicename] => achronix
    [category_parent] => 36
)

September 13, 2023January 17, 2024 by Don Dingee

Scaling LLMs with FPGA acceleration for generative AI

Scaling LLMs with FPGA acceleration for generative AI
by Don Dingee on 09-13-2023 at 6:00 am
Categories: Achronix, eFPGA

Large language model (LLM) processing dominates many AI discussions today. The broad, rapid adoption of any application often brings an urgent need for scalability. GPU devotees are discovering that where one GPU may execute an LLM well, interconnecting many GPUs often doesn’t scale as hoped since latency starts piling up with noticeable effects on user experience. Achronix’s Bill Jenkins, Director of AI Product Marketing, has a better solution for scaling LLMs with FPGA acceleration for generative AI.

Expanding from conversational AI into transformer models

Most of us are familiar with conversational AI as a tool on our smartphones, TVs, or streaming devices, providing voice-based search capability to simple questions, usually returning the best result or a short list. Requests head to the cloud where data and search indexing live, and results come back within a few seconds, usually faster than typing requests. Behind the scenes of these queries are three steps, depending on the application: automatic speech recognition (ASR), natural language processing (NLP), and speech synthesis (SS).

Generative AI builds on this concept with more compute-intensive transformer models with billions of parameters. Complex, multi-level prompts can return thousands of words seemingly written from research across various short-form and long-form sources. Accelerating ASR, NLP, and text synthesis – using an LLM like ChatGPT – becomes crucial if response times are to stay bounded within reasonable limits.

A good LLM delivering robust results quickly can draw hundreds of simultaneous users, complicating a hardware solution. One popular approach avoids long vendor lead times, allocation that can lock out smaller customers, and high capital costs of procuring high-end GPUs with cloud-based GPU implementations using on-demand elastic resource expansion. But, operating expenses can eat up the apparent advantages of rented GPUs at scale. “Spending millions of dollars in the cloud for GPU-based generative AI processing and still ending up with latency and inefficiency is not for everybody,” observes Jenkins.

FPGA acceleration for generative AI throughput and latency

The solution for LLMs is not bigger GPUs or more of them because the generative AI latency problem isn’t due to execution unit constraints. “When an AI model fits in a single high-end GPU, it will win in a contest versus an FPGA,” says Jenkins. But as models get larger, requiring multiple GPUs to increase throughput, the scale tips in favor of Achronix Speedster 7t FPGAs due to their custom-designed 2D network-on-chip (NoC) running at 2 GHz and built all the way out to the PCIe interfaces. Jenkins indicates they are seeing as much as 20 Tbps of bandwidth across the chip and up to 80 TOPS, essentially wiping out floor planning issues.

Achronix has been evangelizing that one FPGA accelerator card can replace up to 15 high-end GPUs for speech-to-text applications, reducing latency by 90%. Jenkins decided to study GPT-20B (an LLM named for its 20 billion parameters) to see how the architectures compare in accelerating generative AI applications. We’ll cut to the punchline: at 32 devices, Achronix FPGAs deliver 5 to 6 times better throughput and similarly reduced latency. The contrast is striking at INT8 precision, which also reduces power consumption in an FPGA implementation.

“Generative AI developers can choose Achronix FPGAs they can actually get their hands on quickly, getting 5x-6x more performance for the same device count, or using fewer parts and saving space and power,” Jenkins emphasizes. He continues to say that familiarity with high-level libraries has kept many developers on GPUs, but they may not realize how inefficient a GPU-based architecture is until they run into these larger generative AI models. Jenkins worked on the team that developed OpenCL, so he understands programming libraries. He shares that AI compilers and FPGA IP libraries have advanced so developers don’t need intimate knowledge of FPGA hardware details or hand-coding to get the performance advantages.

LLMs are not getting smaller, and high-end GPUs are not getting cheaper (although vendors are working on the lead time problems). As models develop and grow, FPGA acceleration for generative AI will be a more acute need. Achronix stands ready to help teams understand where GPUs become inefficient in generative AI applications, how to deploy FPGAs for scalability in real-world scenarios, and how to keep capital and operating expenses in check.

Learn more about the GPT-20B study in the Achronix blog post:
FPGA-Accelerated Large Language Models Used for ChatGPT

Also Read:

400 GbE SmartNIC IP sets up FPGA-based traffic management

eFPGA Enabled Chiplets!

The Rise of the Chiplet

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Instance

Array
(
    [node_name] => Achronix
    [node_id] => Array
        (
            [0] => 2
        )

)

Instance

Array
(
    [node_name] => 
    [node_id] => Array
        (
            [0] => 2
        )

    [title] => Recent Forum Threads
)

Threads

Recent Forum Threads

Taiwan's No. 2 chipmaker UMC eyes entering cutting-edge race

latest reply by osnium on June 30, 2025

started by XYang2023 on June 30, 2025
Just finished "Only the Paranoid Survive", and have a few thoughts about current day Intel

latest reply by DanX on June 30, 2025

started by Xebec on June 26, 2025
Insightful Intel Diagnosis @ Strategeion

latest reply by DanX on June 30, 2025

started by benb on June 29, 2025
If China takes TSM?

latest reply by Fred Chen on June 30, 2025

started by Arthur Hanson on June 23, 2025
Better Battery Technologies to Change the Entire Electric Ecosystem

started by Arthur Hanson on June 30, 2025
Question on PC security

latest reply by semiman on June 30, 2025

started by Arthur Hanson on June 26, 2025
Former CEO Pat Gelsinger reveals he was 'not given the opportunity' to finish his job at Intel as he predicts the future of computing

latest reply by Rahul Razdan on June 30, 2025

started by Daniel Nenni on June 27, 2025
IMEC Roadmap to 2A on TechTechPotato

latest reply by semiman on June 30, 2025

started by benb on June 21, 2025
How far away are we from quantum computing?

latest reply by blueone on June 29, 2025

started by Arthur Hanson on June 27, 2025
Building a chip fab (for survival) on the Moon or Mars

latest reply by staf on June 29, 2025

started by Xebec on June 7, 2025

Search Semiwiki

Recent Achronix Articles

Expanding from conversational AI into transformer models

FPGA acceleration for generative AI throughput and latency

Also Read:

Comments

Sponsor

Recent Forum Threads