The three-step conversational AI (CAI) process – automatic speech recognition (ASR), natural language processing, and text-to-synthesized speech response – is now deeply embedded in the user experience for smartphones, smart speakers, and other devices. More powerful large language models (LLMs) can answer more queries accurately after appropriate training. However, the computational price of keeping up with large numbers of ASR streams in real-time running complex models stresses conventional GPU-based solutions. A webinar, moderated by Sally Ward-Foxton of EETimes featuring speakers Bill Jenkins, Director of AI Product Marketing at Achronix, and Julian Mack, Sr. Machine Learning Scientist at Myrtle.ai, discusses how Achronix FPGA technology applied to FPGA-accelerated AI speech recognition is taking real-time ASR to new levels.
Deploying an FPGA with a WebSocket API for AI inference
“Conversational AI is interacting with a computer like a human,” says Jenkins. “The idea is it needs to be as real-time as possible so the conversation is fluid – we’ve all called a customer service number and gotten long pauses or discovered it only knows certain words, and that’s what we’re trying to get away from.” Jenkins sees a growing CAI market going beyond customer service to challenges like medical and law enforcement bodycam transcription, where speed and accuracy are essential.
GPUs are almost always the choice for AI training, where there are fewer constraints on time and resources. But in real-time CAI applications, developers are running into the limits of what GPUs can deliver for AI inference, even with racks of equipment and massive amounts of power and cooling, whether in the cloud or on-premises.
FPGA-accelerated AI speech recognition combines the hardware benefits of FPGAs with the software benefits of easier programmability. “Our solution is a real-time appliance running the Myrtle.ai software stack jointly on a server-class CPU and our VectorPath Accelerator Card with a Speedster7t FPGA,” continues Jenkins. “A very simple WebSocket API interface abstracts away the fact that there is an FPGA in the system.”
“A WebSocket API is very similar to sending HTTP ‘get’ requests, except that it creates a stateful connection,” says Mack. “The server and client can continue talking to each other with low latency, even as the number of streams scales.”
Evaluating ASR performance in Achronix’s virtual lab
Achronix and Myrtle.ai have taken FPGA-accelerated AI speech recognition into Achronix’s virtual lab, available to ASR developers by remote access on request, to demonstrate the potential. “On one Speedster7t FPGA, we can run ASR on 1050 streams with 90th percentile latency under 55 milliseconds,” observes Jenkins. “Users can click on any stream, listen to spoken words, and see the real-time transcription.”
This performance translates to one Speedster7t FPGA replacing up to 20 servers, each running conventional CPU-plus-GPU payloads, with lower latency and no loss in ASR accuracy. “GPUs are a warp-locked architecture, executing one instruction across a lot of data that has to go back and forth from memory,” says Jenkins. “In our FPGA, we can run functions simultaneously using data from GDDR6 memory with up to 4TB of bandwidth without going to external memory or the host CPU.” A two-dimensional network-on-chip (NoC) speeds up data ingress with low latency when transfers occur.
Efficient number formats are critical to achieving conversational AI performance with the required accuracy. “You want as few bits as possible to represent weights because you win twice, once in data transfers and once in multiply-accumulate operations that make up the backbone of neural network models,” says Mack. “We’re using block float 16 (bfloat16), which is hard to use on GPUs that don’t support it, but it is natively supported on the Speedster7t FPGA.” Training at floating point 32 (fp32) quantizes to bfloat16 with accuracy intact, compared to int4 or int8 quantizations often used in ASIC-based hardware.
Continuing scalability as LLMs grow
Most of this webinar is a conversation between Jenkins, Mack, and Ward-Foxton, with a few slides and a Q&A session at the end. For instance, Ward-Foxton asks what happens when LLMs inevitably get larger. Mack suggests they can fit the 7B parameter Llama 2 model in one Speedster7t FPGA now and should be able to fit the 13B parameter model soon. Jenkins adds their ASR server complex can grow to a 4U solution with eight VectorPath cards. This scalability means FPGA-accelerated AI speech recognition will take on more realistic real-time translation with more languages and broader vocabularies.
For the entire conversation, watch the archived Achronix webinar:
LinkedIn Live: FPGA-Accelerated AI Speech Recognition Revolution