Large language model (LLM) processing dominates many AI discussions today. The broad, rapid adoption of any application often brings an urgent need for scalability. GPU devotees are discovering that where one GPU may execute an LLM well, interconnecting many GPUs often doesn’t scale as hoped since latency starts piling up with noticeable effects on user experience. Achronix’s Bill Jenkins, Director of AI Product Marketing, has a better solution for scaling LLMs with FPGA acceleration for generative AI.
Expanding from conversational AI into transformer models
Most of us are familiar with conversational AI as a tool on our smartphones, TVs, or streaming devices, providing voice-based search capability to simple questions, usually returning the best result or a short list. Requests head to the cloud where data and search indexing live, and results come back within a few seconds, usually faster than typing requests. Behind the scenes of these queries are three steps, depending on the application: automatic speech recognition (ASR), natural language processing (NLP), and speech synthesis (SS).
Generative AI builds on this concept with more compute-intensive transformer models with billions of parameters. Complex, multi-level prompts can return thousands of words seemingly written from research across various short-form and long-form sources. Accelerating ASR, NLP, and text synthesis – using an LLM like ChatGPT – becomes crucial if response times are to stay bounded within reasonable limits.
A good LLM delivering robust results quickly can draw hundreds of simultaneous users, complicating a hardware solution. One popular approach avoids long vendor lead times, allocation that can lock out smaller customers, and high capital costs of procuring high-end GPUs with cloud-based GPU implementations using on-demand elastic resource expansion. But, operating expenses can eat up the apparent advantages of rented GPUs at scale. “Spending millions of dollars in the cloud for GPU-based generative AI processing and still ending up with latency and inefficiency is not for everybody,” observes Jenkins.
FPGA acceleration for generative AI throughput and latency
The solution for LLMs is not bigger GPUs or more of them because the generative AI latency problem isn’t due to execution unit constraints. “When an AI model fits in a single high-end GPU, it will win in a contest versus an FPGA,” says Jenkins. But as models get larger, requiring multiple GPUs to increase throughput, the scale tips in favor of Achronix Speedster 7t FPGAs due to their custom-designed 2D network-on-chip (NoC) running at 2 GHz and built all the way out to the PCIe interfaces. Jenkins indicates they are seeing as much as 20 Tbps of bandwidth across the chip and up to 80 TOPS, essentially wiping out floor planning issues.
Achronix has been evangelizing that one FPGA accelerator card can replace up to 15 high-end GPUs for speech-to-text applications, reducing latency by 90%. Jenkins decided to study GPT-20B (an LLM named for its 20 billion parameters) to see how the architectures compare in accelerating generative AI applications. We’ll cut to the punchline: at 32 devices, Achronix FPGAs deliver 5 to 6 times better throughput and similarly reduced latency. The contrast is striking at INT8 precision, which also reduces power consumption in an FPGA implementation.
“Generative AI developers can choose Achronix FPGAs they can actually get their hands on quickly, getting 5x-6x more performance for the same device count, or using fewer parts and saving space and power,” Jenkins emphasizes. He continues to say that familiarity with high-level libraries has kept many developers on GPUs, but they may not realize how inefficient a GPU-based architecture is until they run into these larger generative AI models. Jenkins worked on the team that developed OpenCL, so he understands programming libraries. He shares that AI compilers and FPGA IP libraries have advanced so developers don’t need intimate knowledge of FPGA hardware details or hand-coding to get the performance advantages.
LLMs are not getting smaller, and high-end GPUs are not getting cheaper (although vendors are working on the lead time problems). As models develop and grow, FPGA acceleration for generative AI will be a more acute need. Achronix stands ready to help teams understand where GPUs become inefficient in generative AI applications, how to deploy FPGAs for scalability in real-world scenarios, and how to keep capital and operating expenses in check.
Learn more about the GPT-20B study in the Achronix blog post:
FPGA-Accelerated Large Language Models Used for ChatGPT
Also Read:
400 GbE SmartNIC IP sets up FPGA-based traffic management
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.