
At the 2025 RISC-V Summit North America, Min Hsu, Staff Compiler Engineer at SiFive, presented on enhancing tiling support within SiFive’s AI/ML software stack for the RISC-V Vector-Matrix Extension (VME). This extension aims to boost matrix multiplication efficiency, a cornerstone of AI workloads. SiFive’s VME implementation introduces a large matrix accumulator state for the result matrix C, leveraging existing RISC-V Vector (RVV) registers to supply source operands A and B. This design enables outer-product-style multiplications directly into the C accumulator, with options for “fat” k>1 support to handle narrower input datatypes. Rows or columns of C can be moved to vector registers or loaded/stored from memory, and the C state may be segmented into multiple tiles. By positioning the accumulator near arithmetic units, the matrix engine achieves high throughput, making it ideal for compute-intensive AI tasks.
A key focus was tiled matrix multiplication, illustrated through a Python pseudocode example. The function tiled_matmul decomposes large matrices A (m x k), B (k x n), and C (m x n) into manageable tiles. Outer loops iterate over tile_m, tile_n, and tile_k dimensions, creating views of sub-matrices (e.g., lhs_tile = A[m1:m1+tile_m, k1:k1+tile_k]). Inner loops then apply register-level tiling with tile_m_v, tile_n_v, and tile_k_v, performing the core operation: dst_tile[mv:mv+tile_m_v, nv:nv+tile_n_v] += np.matmul(lhs_tile_v, rhs_tile_v). This hierarchical tiling optimizes data locality—outer tiles fit into caches, inner ones into registers—reducing memory access overhead and enhancing performance for large-scale AI models.
SiFive’s AI/ML software stack integrates these hardware features seamlessly, enabling end-to-end execution of high-profile models on SiFive platforms. Central to this is the Intermediate Representation Execution Environment (IREE), an open-source MLIR-based compiler and runtime optimized for SiFive microarchitectures. IREE supports diverse front-ends like PyTorch for LLMs, applying target-specific tiling policies to break down operations. It enables intra-operation parallelization, generates code via SiFive’s tuned LLVM compilers and Scalable Kernel Libraries (SKL), and mixes MLIR codegen with microkernels (ukernels) for efficiency. The runtime handles inter-operation parallelization through asynchronous execution and task scheduling, supporting both Linux and bare-metal environments.
Hsu highlighted advancements in multi-tile matrix multiplication within IREE. Previously, IREE supported only single-tile K-loops, where sources A0 and B0 are loaded once, and a single matmul accumulates into C00. Now, enhancements allow multi-tile K-loops, loading sources like A0, A1 once and distributing accumulations across multiple C tiles (e.g., C00 += A0 * B0, C10 += A1 * B0, then C01 += A0 * B1, C11 += A1 * B1). This reduces redundant loads, improving arithmetic intensity and efficiency, especially for deep neural networks where K dimensions are large.
In takeaways, Hsu emphasized that tiled matrix multiplication is essential for high-performance AI/ML applications, as it maximizes hardware utilization. IREE excels in automating and optimizing these tiling strategies. RISC-V’s VME is purpose-built for such tiled operations, delivering native performance gains. SiFive’s XM series implements VME in a compact, integrated form factor, and the team’s contributions to IREE—particularly multi-tile support—further amplify efficiency. This software-hardware synergy positions SiFive’s stack as a robust solution for AI acceleration on RISC-V, bridging custom extensions with standardized ecosystems to drive innovation in edge and datacenter AI.
Bottom line: The presentation underscores SiFive’s commitment to advancing RISC-V for AI, combining architectural extensions with sophisticated compiler tools to tackle compute bottlenecks effectively.
Also Read:
SiFive Launches Second-Generation Intelligence Family of RISC-V Cores
Podcast EP197: A Tour of the RISC-V Movement and SiFive’s Contributions with Jack Kang
Enhancing RISC-V Vector Extensions to Accelerate Performance on ML Workloads
Share this post via:
Comments
There are no comments yet.
You must register or log in to view/post comments.