
Lessons from Hands-On DGEMM Benchmarking
Using cycle-accurate simulation to explore how RISC-V vector extensions accelerate one of computing’s most important workloads
1. Why Vector Performance Matters
While GPUs dominate large-scale model training, CPUs execute a vast amount of matrix math in inference pipelines, data preprocessing, and scientific computing. Basic Linear Algebra Subprograms (BLAS) libraries underpin many scientific and machine learning frameworks, and one of their most important routines is GEMM — General Matrix Multiply. DGEMM (double-precision GEMM) is the gold-standard benchmark for this class of workload: it simultaneously stresses floating-point throughput, vector execution, register utilization, and memory bandwidth, making it a highly representative proxy for real-world compute intensity.
In this article we share practical lessons from hands-on experiments implementing a DGEMM kernel on an Andes AX46MPV near cycle-accurate RISC-V simulator. A cycle-accurate simulator lets us observe performance at the granularity of individual processor cycles and instrument the code with hardware performance monitoring (HPM) counters to track cycle counts, instruction throughput, and memory activity — all without silicon.
Our experiments confirm that the RISC-V Vector Extension (RVV) can dramatically improve performance with modest coding effort. A minimal RVV port delivered a 12× speedup over a scalar baseline; progressive tuning pushed that to over 150×; and enabling High Bandwidth Vector Memory (HVM) feature unlocked 275× — reaching 92.8% of theoretical peak efficiency. These gains stem from RVV’s scalable design, and the lessons learned along the way reveal how vector parameters interact in ways that are not always intuitive.
2. RISC-V Vector Extension (RVV) Primer
Unlike fixed-width vector ISAs such as x86 AVX-512 or Arm NEON, RVV uses a scalable vector model: the hardware determines the physical vector register width and software adapts at runtime. While Arm SVE is also a scalable architecture, RVV is more flexible in both scalability and software-controlled tunability. The same binary runs efficiently across a wide range of RISC-V hardware without recompilation.
Three parameters govern how vector operations behave:
- VLEN — the physical width of a vector register in bits (e.g. 512 or 1024). A 1024-bit register holds 16 double-precision (FP64) elements.
- SEW (Selected Element Width) — the element size used for each operation, chosen at runtime. For DGEMM, we use SEW=64 (FP64).
- LMUL (Length MULtiplier) — a register-grouping multiplier. LMUL=4 fuses four physical registers into one logical vector, quadrupling capacity but proportionally reducing the number of independent logical registers available. With 32 physical registers, LMUL=4 yields 8 logical register groups.
Experimenting with these “knobs” taught us that increasing vector capacity always involves trade-offs. The interactions between VLEN, LMUL, and loop structure can produce counter-intuitive results, as the experiments below illustrate.
What is HVM?
High Bandwidth Vector Memory (HVM) is a microarchitectural feature that addresses the bottleneck that is the data path between the cache and the vector register file. In a standard RVV implementation, vector loads travel through a 512-bit wide cache data bus. At VLEN=1024 with SEW=64, each vector register holds 16 FP64 elements — 1024 bits — so filling it requires two sequential 512-bit transfers. HVM provides a dedicated 1024-bit wide memory path to the vector register file, wide enough to deliver a full 1024-bit vector register in a single transfer.
HVM is transparent to software: no changes to the RVV binary are required. The same kernel benefits automatically when HVM is enabled. This makes it an especially attractive feature for memory-bandwidth-bound workloads like DGEMM, where vector loads are on the critical path.
3. DGEMM: What We Are Measuring
DGEMM computes C = αAB + βC where A is M×K, B is K×N, and C is M×N. The computational cost is 2×M×N×K FLOPs (one multiply and one add per element pair). For our 64×64 test matrices that is approximately 524K floating-point operations per kernel call.
Theoretical peak throughput for the AX46MPV at VLEN=1024, SEW=64 is 64 FLOPs/cycle. This reflects the processor’s dual VFMACC capability: two VFMACC operations issued per cycle, each operating on 16 FP64 elements (2 FLOPs each), giving 2 × 16 × 2 = 64 FLOPs/cycle.
We instrumented the kernel with HPM counters to capture cycle count, instruction count, and memory activity. The scalar baseline was compared with a naïve RVV version and progressively tuned implementations, systematically varying VLEN, LMUL, and loop-blocking structure.
4. Scalar vs. RVV: The Impact of Vectorization
We implemented four progressively optimized versions of DGEMM. The results are summarized in Table 1.

Even the most naïve RVV port — a straightforward translation of the scalar loop using three RVV primitives — delivered a 12× speedup with minimal effort. The inner loop relies on just three instructions:
- vsetvl — sets the active vector length for the hardware; the same code runs on any VLEN-capable core
- vle64 — loads a vector of FP64 values from memory into a register group
- vfmacc — fused multiply-accumulate across a full vector in a single instruction
Subsequent tuning — loop blocking and multiple accumulators — delivered a further 13× on top of the naïve RVV result, reaching 152× over scalar. Enabling HVM then added another 1.8× to reach the peak result of 275×. The key insight is that these gains are cumulative and largely independent: vectorize first, tune the register structure second, then exploit microarchitectural features like HVM third.
5. Counter-Intuitive Performance Lessons
Systematic parameter sweeps revealed three lessons that initially seem counterintuitive.
Lesson 1: Vectorize First
Do not assume that a modern optimizing compiler targeting a vector-capable processor will auto-vectorize hot loops. Despite compiling with -O3 and a vector-capable target, the compiler produced a scalar binary indistinguishable in cycle count from a build with auto-vectorization explicitly disabled.
Writing even a naïve RVV kernel using the three primitives described above immediately yielded 12×. The lesson is that even a straightforward manual RVV port is transformative, and becomes the foundation for all further optimization.
Lesson 2: The Register Budget Cliff
Increasing LMUL raises the number of elements processed per instruction, which sounds unambiguously good. However, LMUL also consumes physical vector registers. With 32 physical registers, LMUL=4 provides 8 logical register groups; LMUL=8 provides only 4. A kernel that maintains multiple independent accumulators — essential for hiding FMA pipeline latency — requires a budget of registers for both the accumulators and the live data vectors.
When that budget is exceeded, the compiler must spill registers to memory and reload them, replacing fast FMA throughput with expensive load/store traffic. Table 2 shows the cliff in practice.

Two observations stand out. First, the performance collapse at LMUL=8 is severe — 7.8× slower without HVM and 18.5× slower with HVM — making this one of the most consequential single-parameter choices in the sweep. Second, and importantly, HVM does not rescue LMUL=8. HVM widens the memory bandwidth path to the vector register file; it does not add physical registers. The register budget constraint is a fundamental microarchitectural limit, not a software artifact.
The practical rule: for a 64×64 DGEMM kernel, LMUL=4 with 4–6 accumulator rows is the sweet spot. LMUL=4 provides enough logical registers (8 groups) to sustain high accumulator parallelism while keeping all live vectors within the 32-register budget.
Lesson 3: HVM and the Importance of Matching VLEN to the Data Bus
HVM provides a substantial further gain when VLEN=1024 and the register budget is managed correctly. Using cache, only 512-bits of data can be transferred per vector load however with HVM each vector load can transfer 1024-bits per cycle. Table 3 illustrates that without HVM, increasing VLEN from 512 to 1024 provides a 30% improvement (23,892 to 15,979 cycles) whereas with HVM the performance doubles (17,812 to 8,831 cycles).
Table 3 also illustrates why “efficiency relative to theoretical peak” requires careful interpretation. The VLEN=512 no-HVM configuration shows 68.6% efficiency — higher than the VLEN=1024 no-HVM result of 51.3% — yet the VLEN=1024 configuration is faster in absolute terms (15,979 vs. 23,892 cycles). The theoretical peak also doubles with the vector length (VLEN), so a configuration that gains less than 2× when doubling VLEN will show a drop in efficiency percentage even while improving absolute throughput.

Finding the Blocking Sweet Spot
With HVM enabled at VLEN=1024, LMUL=4, we swept from 2 to 7 rows to find the accumulator count that best hides FMA latency within the available register budget. Table 4 shows a clean progression with a peak at 6-row blocking.

The slight reversal at 7-row reflects the same register pressure dynamic seen in Lesson 2: adding a seventh accumulator row begins to crowd out the load registers, introducing minor spill overhead. The 6-row optimum represents the point at which the kernel fully hides FMA pipeline latency without exceeding the register budget.
What is row-blocking?
Row-blocking (also called accumulator unrolling or kernel unrolling in GEMM literature) is a specific application of loop unrolling applied to the output rows of a matrix multiply kernel. In a naïve implementation, the inner loop computes one row of the output matrix at a time, loading a single accumulator register and issuing one FMA per iteration. Row-blocking instead computes N output rows simultaneously within the same inner loop body, holding N independent accumulator vectors live across the loop.
Row-blocking specifically targets accumulator independence to keep enough independent work in flight to fully pipeline the FMA units. The optimal blocking factor is a function of FMA latency, issue width, and available register budget.
6. Conclusions
Our experiments confirm that RVV vectorization is transformative for compute-intensive workloads — and that scalar to near-peak efficiency is accessible with modest effort. The practical takeaways are:
- Vectorize explicitly and early. The compiler will not auto-vectorize complex kernels like DGEMM even with -O3 and a vector-capable target. A naïve RVV port using vsetvl, vle64, and vfmacc immediately delivers 12× over scalar. Subsequent tuning then compounds that gain.
- Be aware of the 32-register budget. LMUL and accumulator count jointly consume physical vector registers. Exceeding the budget triggers spills that can slow execution by 8–18× — a larger penalty than most other single-parameter mistakes. Stay within budget and LMUL=4 with 4–6 accumulator rows is typically the safe operating region.
- Consider features like HVM to achieve higher memory bandwidth. Wider vectors are only faster if the memory path can sustain the bandwidth. On the AX46MPV, HVM provides a dedicated 1024-bit data path to match VLEN=1024. At VLEN=512, the standard bus already handles a vector register in a single transfer.
- Track absolute throughput alongside efficiency ratios. A drop in percentage efficiency when increasing VLEN does not mean performance got worse — it may mean the theoretical ceiling scaled faster than you could follow.
- Cycle-accurate simulation with HPM counters is a powerful development tool. All of the above was characterized without silicon, enabling rapid iteration over a large parameter space before tape-out.
RISC-V’s scalable vector model, combined with microarchitectural features like HVM, delivers a compelling path to near-peak floating-point efficiency on demanding workloads. The same binary runs across the full range of VLEN-capable cores; tuning effort is focused on a small number of well-understood parameters; and the gains compound in a predictable way. For teams targeting AI inference, HPC, or any compute-intensive workload on RISC-V, the message is clear: explicit RVV programming is both necessary and highly rewarding.
About the AX46MPV
The Andes AX46MPV is a high-performance RISC-V application processor implementing the RVV 1.0 vector extension with configurable VLEN (up to 1024 bits). The near cycle-accurate simulator used in this study models the processor’s pipeline, vector unit, cache hierarchy, and HPM counters with sufficient fidelity for architectural performance analysis.
Also Read:
Share this post via:



Comments
There are no comments yet.
You must register or log in to view/post comments.