Key Takeaways
- VSORA has developed a novel architecture optimized for AI inference, achieving near-theoretical performance in latency, throughput, and energy efficiency.
- The architecture addresses the 'memory wall' issue by using a unified memory stage with a massive SRAM array, facilitating faster data access and eliminating bottlenecks.
- Each processing core in VSORA's architecture features 16 million registers and integrates high-throughput MAC units, enabling flexible tensor operations and high computational efficiency.

VSORA, a pioneering high-tech company, has engineered a novel architecture designed specifically to meet the stringent demands of AI inference—both in datacenters and at the edge. With near-theoretical performance in latency, throughput, and energy efficiency, VSORA’s architecture breaks away from legacy designs optimized for training workloads.
The team behind VSORA has deep roots in the IP business, having spent years designing, testing, and fine-tuning their architecture. Now in its fifth generation, the architecture has been rigorously validated and benchmarked over the past two years in preparation for silicon manufacturing.
Breaking the Memory Wall
The “memory wall” has challenged chip designers since the late 1980s. Traditional architectures attempt to mitigate the impact on performance induced by data movement between external memory and processing units by layering memory hierarchies, such as multi-layer caches, scratchpads, and tightly coupled memory, each offering tradeoffs between speed and capacity.
In AI acceleration, this bottleneck becomes even more pronounced. Generative AI models, especially those based on incremental transformers, must constantly reprocess massive amounts of intermediate state data. Conventional architectures struggle here. Every cache miss—or any operation requiring access outside in-memory compute—can severely degrade performance.
VSORA tackles this head-on by collapsing the traditional memory hierarchy into a single, unified memory stage: a massive SRAM array that behaves like a flat register file. From the perspective of the processing units, any register can be accessed anywhere, at any time, within a single clock. This eliminates costly data transfers and removes the bottlenecks that hamper other designs.
A New AI Processing Paradigm: 16 Million Registers per Core
At the core of the VSORA’s architecture is a high-throughput computational tile consisting of 16 processing cores. Each core integrates 64K multi-dimensional matrix multiply–accumulate (MAC) units, scalable from 2D to arbitrary N-dimensional tensor operations, alongside eight high-efficiency digital signal processing (DSP) cores. Numerical precision is dynamically configurable on a per-operation basis, ranging from 8-bit fixed-point to 32-bit floating-point formats. Both dense and sparse execution modes are supported, with runtime-selectable sparsity applied independently to weights or activations, enabling fine-grained control of computational efficiency and inference performance.
Each core incorporates an unprecedented 16 million registers, orders of magnitude higher than the few hundred to few thousand typically found in conventional architectures. While such a massive register file would normally challenge traditional compiler designs, VSORA overcomes these with two architectural innovations:
- Native Tensor Processing: VSORA’s hardware natively supports vector, tensor, and matrix operations, removing the need to decompose them into scalar instructions. This eliminates the manual implementation of nested loops often required in GPU environments such as CUDA, thereby improving computational efficiency and reducing programming complexity.
- High-Level Abstraction: Developers program at a high level using familiar frameworks, such as PyTorch and ONNX for AI workloads, or Matlab-like functions for DSP, without the need to write low-level code or manage registers directly. This abstraction layer streamlines development, enhances productivity, and maximizes hardware utilization.
Chiplet-Based Scalability
VSORA’s physical implementation leverages a chiplet architecture, with each chiplet comprising two VSORA computational tiles. By combining VSORA chiplets with high-bandwidth memory (HBM) chiplet stacks, the architecture enables efficient scaling for both cloud and edge inference scenarios.
- Datacenter-Grade Inference. The flagship Jotunn8 configuration pairs eight VSORA chiplets with eight HBM3e chiplets, delivering an impressive 3,200 TFLOPS of compute performance in FP8 dense mode. This configuration is optimized for large-scale inference workloads in datacenters.
- Edge AI Configurations. For edge deployments, where memory requirements are lower, VSORA offers:
- Tyr2: Two VSORA chiplets + one HBM chiplet = 800 TFLOPS
- Tyr4: Four VSORA chiplets + one HBM chiplet = 1,600 TFLOPS
These configurations empower efficient tailoring of compute and memory resources to suit the constraints of edge applications.
Power Efficiency as a Side Effect
The performance gains are evident, but equally remarkable are the advances in processing and power efficiency.
Extensive pre-silicon validation using leading large language models (LLMs) across multiple concurrent workloads demonstrated processing efficiencies exceeding 50%, that’s an order of magnitude higher than state-of-the-art GPU-based designs.
In terms of energy efficiency, the Jotunn8 architecture consistently delivers twice the performance-per-watt of comparable solutions. In practical terms, its power draw is limited to approximately 500 watts, compared to more than one kilowatt for many competing accelerators.
Collectively, these innovations yield multiple times higher effective performance at less than half the power consumption, translating to an overall system-level advantage of 8–10× over conventional implementations.
CUDA-Free Compilation Simplifies Algorithmic Mapping and Accelerate Deployment
One of the often-overlooked advantages of the VSORA architecture lies in its streamlined and flexible software stack. From a compilation perspective, the flow is dramatically simplified compared to traditional GPU environments like CUDA.
The process begins with a minimal configuration file of just a few lines that defines the target hardware environment. This file enables the same codebase to execute across a wide range of hardware configurations, whether that means distributing workloads across multiple cores, chiplets, full chips, boards, or even across nodes in a local or remote cloud. The only variable is execution speed; the functional behavior remains unchanged. This makes on-premises and localized cloud deployments seamless and scalable.
A Familiar Flow, Without the Complexity
Unlike CUDA-based compilation processes, the VSORA flow appears reassuringly basic without the layers of manual tuning and complexity. Traditional GPU environments often require multiple painful optimization steps that, when successful, can deliver strong performance, but are fragile and time-consuming. VSORA simplifies this through a more automated and hardware-agnostic compilation approach.
The flow begins by ingesting standard AI inputs, such as models defined in PyTorch. These are processed by VSORA’s proprietary graph compiler, which automatically performs essential transformations such as layer reordering or slicing for optimal execution. It extracts weights and model structure and then outputs an intermediate C++ representation.
This C++ code is then fed into an LLVM-based backend, which identifies the compute-intensive portions of the code and maps them to the VSORA architecture. At this stage, the system becomes hardware-aware, assigning compute operations to the appropriate configuration—whether it’s a single VSORA tile, a TYR4 edge device, a full Jotunn8 datacenter accelerator, a server, a rack or even multiple racks in different locations.
Invisible Acceleration for Developers
From a developer’s point of view, the VSORA accelerator is invisible. Code is written as if it targets the main processor. During compilation, the compilation flow identifies the code segments best suited for acceleration and transparently handles the transformation and mapping to VSORA hardware. This significantly lowers the barrier for adoption, requiring no low-level register manipulation or specialized programming knowledge.
VSORA’s instruction set is high-level and intuitive, carrying over rich capabilities from its origins in digital signal processing. The architecture supports AI-specific formats such as FP8 and FP16, as well as traditional DSP operations like FP16 arithmetic, all handled automatically on a per-layer basis. Switching between modes is instantaneous and requires no manual intervention.
Pipeline-Independent Execution and Intelligent Data Retention
A key architectural advantage is pipeline independence—the ability to dynamically insert or remove pipeline stages based on workload needs. This gives the system a unique capacity to “look ahead and behind” within a data stream, identifying which information must be retained for reuse. As a result, data traffic is minimized, and memory access patterns are optimized for maximum performance and efficiency, reaching levels unachievable in conventional AI or DSP systems.
Built-In Functional Safety
To support mission-critical applications such as autonomous driving, VSORA integrates functional safety features at the architectural level. Cores can be configured to operate in lockstep mode or in redundant configurations, enabling compliance with strict safety and reliability requirements.
Conclusion
VSORA is not retrofitting old designs for modern inference needs, instead it’s building from the ground up. With a memory architecture that eliminates traditional bottlenecks, compute units tailored for tensor operations, and unmatched power efficiency, VSORA is setting a new standard for AI inference—whether in the cloud or at the edge.
Also Read:
The Rise, Fall, and Rebirth of In-Circuit Emulation (Part 1 of 2)
The Rise, Fall, and Rebirth of In-Circuit Emulation: Real-World Case Studies (Part 2 of 2)
Silicon Valley, à la Française
Share this post via:


AI RTL Generation versus AI RTL Verification