No, I’m not going to talk about in-memory-compute architectures. There’s interesting work being done there but here I’m going to talk here about mainstream architectures for memory support in Machine Learning (ML) designs. These are still based on conventional memory components/IP such as cache, register files, SRAM and various flavors of off-chip memory, including not yet “conventional” high-bandwidth memory (HBM). However, the way these memories are organized, connected and located can vary quite significantly between ML applications.
At the simplest level, think of an accelerator in a general-purpose ML chip designed to power whatever edge applications a creative system designer might dream up (Movidius provides one example). The accelerator itself may be an off-the-shelf IP, perhaps FPGA or DSP-based. Power may or may not be an important consideration, latency typically is not so important. The accelerator is embedded in a larger SoC controlled by maybe an MCU or MCU cluster along with other functions, perhaps the usual peripheral interfaces and certainly a communications IP. To reduce off-chip memory accesses (for power and performance), the design provides on-chip cache. Accesses to that cache can come from both the MCU/MCU cluster and from the accelerator, so these must be coherently managed.
Now crank this up a notch, to ADAS applications, where Mobileye is well-known. This is still an edge application, but performance is much more demanding from latency, bandwidth and power consumption standpoints. Complexity is also higher; you need to support multiple accelerator types to support different types of sensor and sensor fusion for example. For scalability in product design, you cluster accelerators in groups, very likely with local scratchpad memory and/or cache; this enables you to release a range of products with varying numbers of these groups. As you increase the numbers and types of accelerators, it makes sense to cluster them together using multiple proxy cache connections to the system interconnect, one for each accelerator group. In support of your product strategy, it should then be easy to scale this number by device variant.
Arteris IP supports both of these use-cases through their Ncore cache-coherent NoC interconnect. Since this must maintain coherence across the NoC, it comes with its own directory/snoop filters. The product also provides proxy caches to interface between the coherent domain and non-coherent domains, and you can have multiple such caches to create customized clusters of IP blocks that use non-coherent protocols like AXI, but can now communicate as a cluster of equals in the cache coherent domain. Arteris IP also provides multiple types of last-level cache including the Ncore Coherent Memory Cache, which is also tied into coherency management to provide a final level of caching before needing to go to main memory. For non-coherent communications, Arteris IP also provides a standalone last-level cache integrating through an AXI interface (CodaCache).
These ML edge solutions are already proven in the field: Movidius and Mobileye are two pretty compelling examples (the company will happily share a longer list).
Moving now to datacenter accelerators, memory architectures look quite different based on what’s happening in China. I’ve talked before about Baidu and their leading-edge work in this area, so here I’ll introduce a new company: Enflame (Suiyuan) Technology, building high-performance but low-cost chips for major machine-learning frameworks. Enflame is a Tencent-backed startup based in Shanghai with $50M in pre-series A funding, so they’re a serious player in this fast-moving space. And they’re going after the same objective as Cambricon, and Baidu with their Kunlun chip – the ultimate in ML performance in the datacenter.
I’ve also talked before about how design teams are architecting for this objective – generally a mesh of accelerators to achieve massive parallelism in 2-D image processing. The mesh may be folded over into a ring or folded twice into a torus to implement RNNs, to support processing temporal sequences. The implementation is often tiled, with say 4 processors per tile and local memory and tiles are abutted to build up larger systems, simplifying some aspects of place and route in the back-end.
Designs like this quickly get very big and they need immediate access to a lot of off-chip working memory, without the latency that can come with mediation through cache coherency management. There are a couple of options here: HBM2 at high-bandwidth but at high cost, versus GDDR6 at lower cost but also lower bandwidth (off-chip memory on the edge is generally LPDDR). Kurt Shuler (VP Marketing at Arteris IP) tells me that GDDR6 is popular in China for cost reasons.
Another wrinkle in these mesh/tiled-mesh designs is that memory controllers are placed around the periphery of the mesh to minimize latency between cores in the mesh and controllers. Traffic through those controllers must then be managed through to channels on the main memory interface, (e.g. HBM2). That calls for a lot of interleaving, reordering, traffic aggregation and data-width adjustments between the memory interface and the controllers, while preserving the benefits of high throughput from these memory standards. The Arteris IP AI-package provides the IP and necessary interfacing to manage this need. On customers, they can already boast Baidu, Cambricon and Enflame at minimum; two of these (that I know of) have already made it through to deployment.
Clearly there is more than one way to architect memory and NoC interconnect for ML applications. Kurt tells me that they have been working with ML customers for years, refining these solutions. Since they’re now clearly king of the hill in commercial NoC solutions, I’m guessing they have a bright future.