WP_Term Object
(
    [term_id] => 21412
    [name] => Semidynamics
    [slug] => semidynamics
    [term_group] => 0
    [term_taxonomy_id] => 21412
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 5
    [filter] => raw
    [cat_ID] => 21412
    [category_count] => 5
    [category_description] => 
    [cat_name] => Semidynamics
    [category_nicename] => semidynamics
    [category_parent] => 178
)
            
small logo Semidynamics
WP_Term Object
(
    [term_id] => 21412
    [name] => Semidynamics
    [slug] => semidynamics
    [term_group] => 0
    [term_taxonomy_id] => 21412
    [taxonomy] => category
    [description] => 
    [parent] => 178
    [count] => 5
    [filter] => raw
    [cat_ID] => 21412
    [category_count] => 5
    [category_description] => 
    [cat_name] => Semidynamics
    [category_nicename] => semidynamics
    [category_parent] => 178
)

RISC-V 64 bit IP for High Performance

RISC-V 64 bit IP for High Performance
by Daniel Payne on 08-30-2023 at 10:00 am

RISC-V as an Instruction Set Architecture (ISA) has grown quickly in commercial importance and relevance since its release to the open community in 2015, attracting many IP vendors that now provide a variety of RTL cores. Roger Espasa, CEO and Founder of Semidynamics, has presented at RISC-V events on how their IP is customized for compute challenges that require high bandwidth and high performance cores with vector units. Semidynamics was founded in 2016, has Barcelona for the HQ, and already has customers in the US and Asia by offering two customizable RISC-V IPs:

  • Avispado – in-order RISCV64GCV, supporting AXI and CHI
  • Atrevido – out-of-order RISCV64GC, supporting AXI and CHI

A typical CPU has a handful of big cores and large caches, making them easy to program, though not high performance.

GPUs, by contrast, have many tiny cores that provide high performance for parallel code, but are harder to program and add communication latency through the PCIe bus when data needs to be passed back and forth between the CPU and the GPU.

CPU, GPU min
CPU, GPU comparison

The approach at Semidynamics is to use a RISC-V core connected to compute cores which makes it easy to program, higher performance for parallel codes and offering zero communication latency. CPU plus vector unit provides the best of both worlds.

RISC-V CPU plus Vector unit, higher performance
CPU plus Vector unit

The RISC-V specification documents 32 vector registers, and you can add a number of vector cores, along with a connection to your cache inside a vector unit.

Vector Unit min
Vector Unit

With Semidynamics IP you can customize the number of Vector Cores: 4, 8, 16, 32. Another way to look at this is to note that 4 Vector Cores is 256-bit, up to 32 Vector Cores which is 2,048-bit.

IP users also choose which data types: FP64, FP32, FP16, BF16, INT64, INT32, INT16, INT8. For an AI application they may choose data types of FP16, BF16, while an HPC application could select FP64, FP32.

The third customization is the Vector Register Length, where for more performance and lower power you can make the vector register bigger than the vector unit.

Here’s the block diagram of the Atrevideo 423-V8:

Atrevido min
Atrevido 423 + V8 Vector Unit

The vector unit is fully out of order, which is unique among RISC-V IP vendors. The combination of the vector unit plus Gazzillion unit are capable of streaming data at over 60 Bytes/cycles.

Vector + Gazzillion, Bytes/Cycle performance
High Bandwidth: Vector + Gazzillion

The purple line shows the Read performance and in the L1 Cache it’s 20-60 bytes/cycle, other machines show a rapid drop in bandwidth after leaving L1 Cache, while this approach keeps going, with a flattening at 56. Even going to DDR memory shows a bandwidth of 40. With a clock rate of 1.0GHz that makes 40 GB/s bandwidth.

IP customers can even add their own RTL code connected to the Vector Unit for their own purposes.

Performance of matrix multiplication is important in AI workloads, and on the OOO V8 Vector Unit there’s a peak of 16 FP64 FLOPS/cycle, and a 99% of peak for a matrix size >= 400. For a small matrix size of 24×24 the performance is 7 FP64 FLOPS/cycle, or 50% of peak. Matrix multiplication for FP16 using a Vector Unit with 8 vector cores has a peak of 64 FP16 FLOPS/cycle, and 99% of peak for M >= 600.

A real-time object detection benchmark called YOLO (You Only Look Once) was run on the Atrevido 423-V8 platform, and it showed a 58% higher performance per vector core than competitors. These results were for video with 24 layers. 5.56 Gops/frame and about 9M parameters.

YOLO performance
YOLO Comparison

Summary

Choosing a RISC-V IP vendor is a complicated task, so knowing about vendors like Semidynamics can help you better understand how a customized approach could most efficiently run your specific workloads. With Semidynamics you get to choose between architectural choices like in-order or out-of-order, with or without vector units. The reported numbers from this IP vendor look promising, and I look forward to their future announcements.

Related Videos

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.