Block floating point (BFP) has been around for a while but is just now starting to be seen as a very useful technique for performing machine learning operations. It’s worth pointing out up front that bfloat is not the same thing. BFP combines the efficiency of fixed point operations and also offers the dynamic range of full floating point. When examining the method used in BFP I am reminded of several ‘tricks’ used for simplifying math problems. The first that came to mind was the so-called Japanese multiplication method, which uses a simple graphical method for determining products. Another, of course, is the once popular yet now nearly forgotten slide rule.
As will be explained in an upcoming webinar, by Mike Fitton senior director of strategy and planning at Achronix, on the topic of using BFP in FPGAs for AI/ML workloads, BFP relies on normalized fixed point mantissas so that a ‘block’ of numbers used in a calculation all have the same exponent value. In the case of multiplication, only a fixed point multiply is needed on the mantissas and a simple addition is performed on the exponents. The surprising thing about BFP is that it offers much higher speed and accuracy with much lower power consumption than traditional floating point operations. Of course, integer operations are more accurate and use slightly lower power, but they lack the dynamic range of BFP. According to Mike BFP offers a sweet spot for AI/ML workloads and the webinar will show supporting data for his conclusions.
The requirements for AI/ML training and inference are very different from what is typically needed in DSPs for signal processing. This applies to memory access and also for math unit implementation. Mike will discuss this in some detail and will show how the new Machine Learning Processor (MLP) unit they built into the Speedster7t has native support for BFP and also supports a wide range of fully configurable integer and floating point precisions. In effect their MLP is ideal for traditional workloads, and also excels at AI/ML, without any area penalty. Each one has up to 32 multipliers per MAC block.
Achronix MLPs have tightly coupled memory that facilitates AI/ML workloads. Each MLP has a local 72K bit block RAM and a 2K bit register file. The MLP’s math blocks can be configured to cascade memory and operands without using FPGA routing resources. Mike will have a full description of the math block’s features during the webinar.
The Speedster7t is also very interesting because of the high data rate Network on Chip (NoC) that can be used to move data between MLPs and/or to other blocks or data interfaces on the chip. The NoC can move data without consuming valuable FPGA resources and avoids bottlenecks inside the FPGA fabric. The NoC has multiple pipes that are 256 bits wide running at 2GHz for a 512G data rate. They can be used to move data directly from the peripherals, like the 400G Ethernet, directly to the GDDR6 memories without requiring the use of any FPGA resources.
Achronix will be making a compelling case for why the native implementation of BFP in their architecture that includes many groundbreaking features is a very attractive choice for AI/ML and a wide range of other more traditional FPGA applications such as data aggregation, IO bridging, compression, encryption, network acceleration, etc. The webinar will include information on real world benchmarks and test cases that highlight the capabilities of the Speedster7t. You can register now to view the webinar replay here.
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.