Array
(
    [content] => 
    [params] => Array
        (
            [0] => /forum/threads/sk-hynix-proposes-hbm-and-hbf-hybrid-for-llm-inference.24754/
        )

    [addOns] => Array
        (
            [DL6/MLTP] => 13
            [Hampel/TimeZoneDebug] => 1000070
            [SV/ChangePostDate] => 2010200
            [SemiWiki/Newsletter] => 1000010
            [SemiWiki/WPMenu] => 1000010
            [SemiWiki/XPressExtend] => 1000010
            [ThemeHouse/XLink] => 1000970
            [ThemeHouse/XPress] => 1010570
            [XF] => 2030970
            [XFI] => 1060170
        )

    [wordpress] => /var/www/html
)

SK Hynix proposes HBM and HBF hybrid for LLM inference

Daniel Nenni

Admin
Staff member
SK Hynix presented a recent IEEE paper describing an architecture combining High-Bandwidth Memory (HBM) speed and High-Bandwidth Flash (HBF) capacity on a single interposer connecting both to a GPU to accelerate AI model and agent inference processing.

Current GPUs and the forthcoming Nvidia Rubin device have interposer-connected HBM to supply data at high speed and bandwidth to the GPU cores. However, HBM capacity limits are lengthening AI large language model (LLM) inference times as data has to be accessed from slower local SSDs. HBF is slower to access than HBM, although much faster than a local SSD, and has higher capacity. Placing it on the same interposer as HBM, as in SK Hynix’s H³ design, allows it to be used as a fast-access HBM cache, shortening large model processing time.

This is how we could conceptualize the idea:

1587737.webp

B&F diagram showing HBM, HBF, and GPU interposer-based connectivity
Future HBM generations will extend HBM capacity and bandwidth but will not arrive soon enough to address current inference latency issues, which leaves GPUs memory-bound and waiting for data.

The paper argues that H³ is well-suited to this problem in the KV cache area of inference processing. When an AI model is being used for inference, it stores the context memory sequence – component tokens and vectors – in HBM in what is called a KV (key-value) cache structure. The H³ paper states: "The latest Llama 4 [LLM] supports sequence lengths up to 10 million." This can require a 5.4 TB cache, "requiring dozens of GPUs just to store these values."

Nvidia's ICMSP software extends the KV cache to local NVMe SSDs so that they enable the processing to complete much faster than if the tokens and vectors had to be recomputed when HBM capacity runs out.

However, providing a KV cache even closer to the GPU, getting rid of the SSD PCIe bus link time, and delivering lower latency and higher bandwidth access than the local SSD is what HBF provides. The paper states: "The expected advantages of HBF are 1) up to 16x larger capacity than HBM, and 2) similar bandwidth to HBM, and the expected disadvantages are 1) slower access (ns vs. μs), 2) lower write endurance, and 3) up to 4x higher power consumption than HBM."

Because HBF has limited endurance – only supporting approximately 100,000 write cycles – it is best suited to read-intensive workloads. The H³ paper abstract says: "H³-equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H³ suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache."

Cache-augmented generation (CAG) is such a workload. "When the LLM receives a query, it reads the gigantic shared pre-computed KV cache, performs a computation, and then outputs a response. In other words, the shared precomputed KV cache is inherently read-only."

IEEE SK Hynix H3 paper combined HBM and HBF structure diagrams.

IEEE SK Hynix H³ paper combined HBM and HBF structure diagrams
The H³ paper diagrams show the concept. D2D is die-to-die transfer. The HBM and HBF controllers are each located on their own base die. Model weights and shared pre-computed KV caches are stored in the HBF. The generated KV caches and other data are stored in the HBM. To compensate for the longer NAND flash latency, a latency hiding buffer (LHB), which is a kind of pre-fetch buffer, is integrated into the base die of the HBM in this diagram.

The H³ design envisages a GPU having HBM stacks attached to its edges (shoreline) with both GPUs and HBM sitting on an interposer (a) in the diagram. The HBM and HBF are connected daisy-chain fashion. "Within the HBM base die, memory access is divided into two paths by address decoder and router: one accessing the HBM and the other accessing the HBF. Consequently, the GPU can directly access the HBF through the HBM base die."

"In other words, both the HBM and HBF serve as the GPU's main memory." The diagram's base global address scheme shows how "the [GPU] host uses the unified address space with divided memory regions when accessing HBM or HBF."

SK Hynix's H³ design simulation testing involved using an Nvidia Blackwell GPU (B200) with 8 x HBM3E stacks and 8 x HBF stacks. In terms of tokens per second, H³ is 1.25x higher with 1 million tokens and 6.14x higher with 10 million tokens than with HBM alone.

The results showed a 2.69x improvement in performance per watt compared to tests with a Blackwell GPU and 8 x HBM stacks but no HBF.

With testing of a KV cache with 10 million tokens, this showed that an HBM+HBF setup could process 18.8x more simultaneous queries, its batch size, than an HBM-only configuration. By using HBF, such workloads, which could include 32 GPUs and their HBM, can be processed with only two GPUs, substantially reducing electricity consumption.

Read the H³ paper for more details, particularly of the simulation testing. It costs $36 from the IEEE for non-members. Click the red PDF link on this webpage to buy a copy.

Bootnote​

The IEEE H³ paper abstract says: "Large language model (LLM) inference requires massive memory capacity to process long sequences, posing a challenge due to the capacity limitations of high bandwidth memory (HBM). High bandwidth flash (HBF) is an emerging memory device based on NAND flash that offers HBM comparable bandwidth with much larger capacity, but suffers from disadvantages such as longer access latency, lower write endurance, and higher power consumption. This paper proposes H³ , a hybrid architecture designed to effectively utilize both HBM and HBF by leveraging their respective strengths. By storing read-only data in HBF and other data in HBM, H³ equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H³ suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache. Simulation results show that a GPU system with H³ achieves up to 2.69x higher throughput per power compared to a system with HBM-only. This result validates the cost-effectiveness of H³ for handling LLM inference with gigantic read-only data."

 
1773741033192.png

The Golden Alliance: Chey Tae-won’s Historic GTC Debut​

The most significant diplomatic event of the conference was the first-ever attendance of SK Group Chairman Chey Tae-won at GTC. His presence at Jensen Huang’s keynote and subsequent joint tour of the exhibition floor provided a powerful visual testament to the "Golden Alliance" between NVIDIA and SK Hynix.

This wasn't merely a business visit; it was a strategic affirmation of SK Hynix’s role as an "Innovation Partner." Following their high-profile "chimaek" (chicken and beer) meeting in February, the two leaders reviewed the integration of SK Hynix’s sixth-generation HBM4 products into NVIDIA’s next-generation Vera Rubin accelerator.

https://www.digitimes.com/news/a20260317VL219/2026-gtc-samsung-hbm-micron.html
 
This mean that they'll have to implement 2 different types of HBM base die(bottom die of the HBM: for a generic HBM and for a HBM-HBF custom), HBF-specific NAND flash die and HBF base die. And those base dies need to be fabricated in a foundry. This could be a new battlefield for foundries, since it has a large volume and it's less difficult(compared to GPUs and APs).
 
https://zdnet.co.kr/view/?no=20260317111538

SK Group Chairman Chey Tae-won predicted that the supply shortage in the global memory semiconductor market would continue for another four to five years and announced that he would soon unveil measures to stabilize the market. He also officially announced plans to pursue the listing of SK Hynix’s American Depositary Receipts (ADRs).

Chairman Choi pointed out at NVIDIA's 'GTC 2026' held in San Jose, USA on the 16th (local time) that "the core cause of the semiconductor supply shortage is the limit of wafer production capacity." He predicted that, as it takes at least 4 to 5 years to set up new wafer facilities, the situation where supply falls about 20% short of demand will persist until 2030.

4bd0854f5ba9c8def5b9c160e4a3609f.jpg


SK Group Chairman Chey Tae-won answers questions from reporters at the NVIDIA annual developer conference 'GTC 2026' exhibition hall held at the San Jose Convention Center in California on the 16th (local time). (Photo = Yonhap News)
Regarding concerns about soaring prices due to supply shortages, Chairman Choi stated, "SK Hynix CEO Kwak No-jung will soon announce a new strategy to stabilize DRAM prices." Furthermore, regarding the possibility of establishing a factory in the U.S. raised by some, he drew a line, saying, "Focusing on production facilities in Korea, where the infrastructure is already well-established, is much more efficient and allows for a faster response."

SK Hynix also officially announced plans to pursue the listing of American Depositary Receipts (ADRs) in the United States to leap forward as a global company.

Chairman Choi plans to expand his engagement with global investors through this initiative. Furthermore, he clearly demonstrated his commitment to cooperation within the global AI semiconductor ecosystem, mentioning the possibility of a meeting with NVIDIA CEO Jensen Huang and praising TSMC as an "irreplaceable and valuable partner."
 
Back
Top