
Silent Data Corruption (SDC) represents a critical challenge in modern semiconductor design, particularly in high-performance computing environments like AI data centers. As highlighted in a collaborative presentation by Broadcom Inc. and Siemens EDA at the 2025 TSMC OIP event, SDC occurs when hardware defects cause erroneous computations without triggering detectable errors, leading to subtle yet devastating failures. In one customer experiment involving a 54-day training run on 16,384 GPUs, 419 unexpected interruptions were reported, with 6 attributed directly to SDC. Though rare, accounting for about 1.4% of fails, these incidents can disrupt mission-critical operations, such as AI model training, where reliability is paramount.
The presentation underscores the industry-wide nature of SDC, driven by shrinking process nodes and increasing chip complexity. Defects that evade manufacturing tests may manifest in-field due to aging, voltage fluctuations, or thermal stress. Traditional testing methods fall short here, as they require device removal for diagnostics, which is impractical in deployed systems. To combat this, the teams advocate for in-system testing capabilities that allow periodic checks without downtime. Running ATPG patterns directly in the field detects latent defects that could precipitate SDC, ensuring system integrity. For AI applications, this means integrating test suites that can be executed routinely, preventing costly interruptions. Moreover, new patterns tailored to SDC can be deployed remotely, extending device lifespan without physical intervention.
Siemens’ In-System Test (IST) solution emerges as a key enabler. Built on the Streaming Scan Network (SSN), IST interfaces with embedded deterministic test (EDT) structures to deliver ATPG patterns efficiently. The IST controller drives the SSN’s parallel interface, supporting high-bandwidth data transfer via protocols like APB or AXI. In Broadcom’s implementation, IST was adapted for an EDT-based design with a Streaming Scan Host at the chip level. The controller resides at the top level, loading patterns into local SRAM via an on-chip CPU. Block-level EDT patterns, originally for production testing, are retargeted to IST inputs, allowing selective testing of targeted blocks while maintaining functional operation elsewhere.
Implementation brought several design challenges to the fore. Functional isolation is paramount: “functional” blocks (e.g., CPU subsystems) must remain active to load and execute IST operations, while “targeted” blocks switch to scan mode for testing. This requires isolating scan inputs to prevent interference. All functional block inputs that could disrupt IST, such as interrupts or AXI signals, must be held in a “quiet” state. Outputs from targeted blocks, which toggle during capture, are gated to avoid propagating noise. Broadcom addressed this by inserting isolation blocks and enabling Test Data Registers for control.
Clock splitting posed another hurdle. Broadcom’s methodology places On-Chip Clock controllers (OCC) at the chip top due to custom clocking. Functional blocks need free-running clocks, but targeted ones require OCC activation for scan shifts. Solutions included branching pre-OCC clocks for functional paths or adding secondary OCCs for targeted branches, ensuring synchronized yet independent clock domains.
Verification and Static Timing Analysis added complexity. Typically, STA modes separate functional and Design-for-Test (DFT) paths, but IST demands a hybrid “merged” mode where some blocks are functional and others in DFT. The Siemens tool provides verification collaterals like transaction files, C code, and SystemVerilog tasks for Design Verification (DV) environments. Testing occurs on post-DFT netlists, incorporating boot sequences, which extends runtime. Close collaboration between DV and DFT teams was essential for deliverables and debugging handshakes.
Results from the APB-based IST implementation demonstrate feasibility. With a 32-bit wide subordinate interface and SSN data bus, hardware overhead was modest: the IST Controller (ISTC) added 200 flops and 5,000 normalized combinational logic units, while SSH contributed 1,000 flops and 30,000 units. Five intest modes were run for 2,500 patterns, using 2 MB on-chip SRAM (about 0.5 million 32-bit words). Pattern storage ranged from 165,000 to 260,000 words per mode, with counts of 22-35 patterns. Overall, ~1.9 million 32-bit words were managed, with 4 loads per mode, showcasing efficient compression and bandwidth utilization.
Bottom line: The collaboration between Broadcom and Siemens highlights IST’s role in mitigating SDC through in-field testing. Despite challenges in isolation, clocking, and verification, the solution was successfully implemented and verified in DFT and DV setups. Future efforts will extend to AXI-based IST, promising broader adoption. This approach not only enhances reliability in AI and hyperscale environments but also reduces field failures, underscoring the value of embedded deterministic testing in next-generation silicon.
Also Read:
Podcast EP323: How to Address the Challenges of 3DIC Design with John Ferguson
3D ESD verification: Tackling new challenges in advanced IC design
Signal Integrity Verification Using SPICE and IBIS-AMI
Share this post via:


Comments
There are no comments yet.
You must register or log in to view/post comments.