The classical problem every MBA student studies is manufacturing resource planning (MRP II). It quickly illustrates that at the system level, good throughput is not necessarily the result of combining fast individual tasks when shared bottlenecks and order dependency are involved. Modern SoC architecture, particularly the memory subsystem, presents a similar problem.
Traditional single-board computing architecture – for example, a PC – is organized around a system bus for peripherals and usually a dedicated CPU interface for high-speed memory. Once modern DRAM arrived with reasonable access times, most PC operating systems and applications like spreadsheets were more memory capacity-limited than performance-bound.
Peripherals evolved from register-level polling to streaming data with DMA into shared memory. In systems with a countable number of peripherals sharing works pretty well, but starts to degrade about the time one runs out of fingers … or interrupt requests. When systems grow to dozens or hundreds of interrupt sources, predictability suffers.
Cramming all those peripherals with an application processor and maybe a graphics core or DSP core into an SoC seems like a good idea for the integration benefits. It does little to relieve the real-time interrupt problem, and may actually make it bigger as the temptation is to place more on chip. Worse still from a memory standpoint, it takes DRAM that was spread out in many places across chips on a board and for cost, real estate, and pin count reasons concentrates it in a single space attached to the SoC and its memory controller. (I hesitate to call that Unified Memory Architecture since that term has specific GPU implications, but it is definitely sharing.)
Consumer products dictate that memory be cheap. Bandwidth is more important than capacity; however, scaling DRAM bandwidth means more DRAMs and more power. That leads to the idea of multichannel DRAMs, which solves some issues but creates others. There are extra pins to deal with, and issues in prefetching and alignment and cache misses – mostly due to the diversity of data types stored in DRAM.
Even on a relatively small scale, this means chaos for a shared SoC memory subsystem. Most designs have potential for a high peak bandwidth but relatively achieve poor hit rates and overall utilization. Poorly executed designs, or ones just plain overloaded such as this example from NXP history presented nearly 12 years ago with 90 DRAM consumers, end up being little more than a large bottleneck if everything kicks in.
If everything were homogenous 32-bit words, life would be a lot simpler – we’d just move to a new memory type and increase the bandwidth. More channels increases bandwidth and helps with the real-time stress of a variety of requests. Some say we are nearing the end for the venerable DDR interface, despite the progress in package-on-package (PoP) at Apple and other vendors, and newer interfaces are about to come into play.
These interfaces offer more channels and more bandwidth, but note that the minimum efficient request is still 32 bytes in all cases. Now consider all the different types of traffic involved, and the idea that much of it is order dependent. The analogy of an MRP system should be getting clearer. A single memory operation hitting a bottleneck may throw the entire chain of operations into a stall. Buffering reduces the problem but costs precious area. Software solutions given the complexity of multichannel requests are difficult at best.
An ideal memory resource planning system – aka a hardware memory controller – would handle four requirements:
- Optimize DRAM utilization and QoS,
- Balance traffic from large numbers of DRAM consumers across more DRAM channels,
- Eliminate performance loss from ordering dependencies with minimal buffering,
- Support tradeoffs with analysis tooling.
That first bullet may be troubling – utilization and QoS usually move in opposite directions when evaluated individually. That leads to a truism from study of MRP scenarios: waiting for some requests may lead to a faster overall system result, allowing some different types of requests to be stuffed in the gaps. Rather than coarse adjustments, per-request tuning can yield a more optimum flow of data.
Sonics has carefully evaluated the DRAM-as-MRP problem, with these and other considerations leading to a suite of solutions. A powerful observation in their approach is that the network-on-chip already knows what the traffic demand looks like, and can be used to prime the DRAM scheduler accordingly.
I’m fascinated by the analogy of an SoC being a small data manufacturing plant, and in fact the 85% utilization figure famous in MRP circles shows up in the Sonics slides. If you saw the Sonics presentation at MemCon 2015, it won best paper – we’ll look at their ideas in more detail in our next installment.