WP_Term Object
(
    [term_id] => 77
    [name] => Sonics
    [slug] => sonics
    [term_group] => 0
    [term_taxonomy_id] => 77
    [taxonomy] => category
    [description] => 
    [parent] => 14433
    [count] => 49
    [filter] => raw
    [cat_ID] => 77
    [category_count] => 49
    [category_description] => 
    [cat_name] => Sonics
    [category_nicename] => sonics
    [category_parent] => 14433
    [is_post] => 1
)

Optimizing memory scheduling at integration-level

Optimizing memory scheduling at integration-level
by Don Dingee on 04-04-2016 at 4:00 pm

In our previous post on SoC memory resource planning, we shared 4 goals for a solution: optimize utilization and QoS, balance traffic across consumers and channels, eliminate performance loss from ordering dependencies, and analyze and understand tradeoffs. Let’s look at details on how Sonics is achieving this.

How exactly does one optimize utilization and QoS simultaneously? Assuming for this discussion there is a lot of traffic volume and variety, utilization and QoS are usually at odds. Utilization likes to have long memory bursts, where DRAM operates most efficiently. QoS is more about low latency typically accomplished by keeping bursts short. Many memory controllers have a knob allowing one or the other to be favored, or a balanced midpoint – sort of a tradeoff between creamy and chunky peanut butter.


A better strategy is to collect the incoming memory requests and their QoS requirements, then evaluate the DRAM bank/page/direction states to pick the transactions that should go first and the ones that can wait. In-flight requests are unstoppable, but holding some requests off waiting for more favorable conditions as others finish may improve overall memory utilization without creating QoS problems.

Other tactics include spatial locality (oft used in the GPU arena), organizing data in memory to stay within an operating system page and reducing page misses. Judicious buffering using compiled SRAM also helps. Rolling all the tricks in prioritization and state tracking together Sonics has come up with the MemMax memory scheduler IP – a “choice generator” as Drew Wingard puts it. Operating as a target on the network-on-chip, MemMax can push memory utilization up to around 85%.


The chase for more memory bandwidth has led to new multichannel architectures. Somehow, traffic from potentially hundreds of consumers must be balanced across memory channels to achieve the bandwidth potential. This can theoretically be done in either hardware or software, but those who have tried know how hard it is to achieve both spatial and time domain balance with order-dependent and order-independent requests on more than 2 channels in software.

Realistically, hardware is the only feasible choice to deal with interleaving many memory channels. Intel arrived at that conclusion years ago for PC and server chipsets, and the embedded and mobile SoC communities are coming to the same realization. Hardware can also go after fine-grained interleaving with resolution from 64 to 256 bytes, while still dealing with DRAM page impacts and transaction splitting and reassembly.

Sonics again leverages their network-on-chip to help with the load balancing using Interleaved Multichannel Technology (IMT). Transparent to software, IMT overlaps memory channels almost as if they are random targets while minimizing buffer area. IMT also deals with reordering, for instance comprehending AXI IDs with flexible ordering.


It may sound like buffers are starting to add up. Not using buffers leaves a reordering problem – DRAM subsystems achieve their best throughput when unconcerned with order, but interface protocols, interleaving reassembly, and deadlock prevention mandate dealing with reordering. Ideally, placing buffers near the memory subsystem and the scheduler is best, since their depth can be optimized based on concurrency and throughput versus latency requirements. Buffers should be compiled as SRAM whenever possible (as they are in MemMax), further reducing area.

Lastly is the analysis challenge. Once set up the multichannel memory subsystem can do wonders. Analyzing the situation statically is difficult, and often IP blocks are uncharacterized at a detailed level. (That may be even more true for internally-designed IP, where the objectives were functional at block-level and not optimization at integration-level.) Testbenches need to be created, traffic generation, an accurate simulation run, and performance then analyzed at integration-level to be sure everything not only works but moves data efficiently.

SonicsStudio integrates the traffic analysis tasks into the network-on-chip design flow with a GUI-driven environment supporting schematics, tabular readings, scripting, and more. It allows designers to work in RTL and cycle-accurate SystemC models with 20x better simulation performance. (Non-cycle-accurate models border on useless for this type of memory scheduling analysis.) It also brings together reporting and visualization of results into a single tool.

Even on a small scale – two memory channels with five initiators – paying attention to details can deliver substantial memory subsystem improvements. Without reordering, fine-grained interleaving delivers better balance between channels but is only half as fast compared to coarse-grained interleaving. When reordering buffers are added to IMT technology using 64-byte interleaving, bandwidth jumps 16% thanks to better utilization with balancing.


These concepts become even more important as technologies such as Wide I/O 2 and HBM start supplanting DDR memory. While there are gains to be had in the memory controller itself, I’d go back to the peanut butter analogy – you get what you get by turning a knob on the coarseness scale. The Sonics approach is proactive, using knowledge from the traffic flowing in the NoC across the entire chip to optimize memory scheduling at integration-level.

Our introduction to the DRAM-as-MRP topic for reference:
4 goals of memory resource planning