[content] => 
    [params] => Array
            [0] => /forum/index.php?threads/a-survey-on-cache-bypassing-techniques-for-cpus-gpus-and-cpu-gpu-systems.7709/

    [addOns] => Array
            [DL6/MLTP] => 13
            [Hampel/JobRunner] => 1030170
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000670
            [ThemeHouse/XPress] => 1010394
            [XF] => 2010770
            [XFI] => 1030070

    [wordpress] => /var/www/html

A Survey On Cache Bypassing Techniques for CPUs, GPUs and CPU-GPU systems


New member
Abstract: With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. This paper presents a survey of cache bypassing techniques for CPUs, GPUs and CPU-GPU heterogeneous systems, and for caches designed with SRAM, non-volatile memory (NVM) and die-stacked DRAM. It covers bypassing techniques for inclusive, non-inclusive and exclusive cache hierarchies, and studies performed on both simulators and real-processors.

PDF is attached. Also available at: A Survey of Cache Bypassing Techniques | Sparsh Mittal -, paper accepted in JLPEA 2016, reviews ~90 papers.
Last edited:


Law of holes - Wikipedia, the free encyclopedia

The problem is (partially) having reused a computing paradigm from the 1980s that requires cache coherency to function. There are other ways round the problem of scalability in parallel computing systems, but a clue to the answer is that there is no awareness of the problem at the programming level (it's a 1970's paradigm).

If you are dedicating more Silicon to caches and branch prediction than actual compute, isn't time for a rethink?


New member
Another example of cache bypass

At HP we were built an MCM with two Itanium CPUs on it and an inclusive L4 cache for use in Superdome. The no load latency of the L4$ was ok, but its bandwidth could be overwhelmed by the two CPUs (data limited, not tag limited). The following patent covers the solution. The resultant shipped part was the MX2 (aka. Hondo).

[table] style="width: 487px"
| align="left" style="width: 236px" | United States Patent
| align="right" style="width: 236px" | 9,405,696

[SIZE=+1]Cache and method for cache bypass functionality [/SIZE]
<center style="color: rgb(0, 0, 0); font-family: "Times New Roman"; font-size: medium;">
</center>A cache is provided for operatively coupling a processor with a main memory. The cache includes a cache memory and a cache controller operatively coupled with the cache memory. The cache controller is configured to receive memory requests to be satisfied by the cache memory or the main memory. In addition, the cache controller is configured to process cache activity information to cause at least one of the memory requests to bypass the cache memory.<wbr>netacgi/nph-Parser?Sect1=PTO1&<wbr>Sect2=HITOFF&d=PALL&p=1&u=%<wbr>2Fnetahtml%2FPTO%2Fsrchnum.<wbr>htm&r=1&f=G&l=50&s1=9,405,696.<wbr>PN.&OS=PN/9,405,696&RS=PN/9,<wbr>405,696