Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

Recent Forum Threads

Samsung Wins $200 Billion Order to Supply Chips to Broadcom (The NOT TSMC Market Thrives!)

started by Daniel Nenni on July 25, 2026
Is SK Hynix Buying Intel’s Ohio Fab? Korean Chipmaker Denies Report — Now All Eyes Are on Earnings

latest reply by siliconbruh999 on July 25, 2026

started by Daniel Nenni on July 22, 2026
TSMC's $265B Spend Drive by Demand, Rivals, Says CFO

latest reply by Barnsley on July 25, 2026

started by Daniel Nenni on July 24, 2026
Cerebrus and AMD

latest reply by KevinK on July 25, 2026

started by Markwrob on July 24, 2026
Intel Reports Second-Quarter 2026 Financial Results

latest reply by hist78 on July 25, 2026

started by Daniel Nenni on July 23, 2026
Will CXMT take over micron in 2030?

latest reply by Barnsley on July 25, 2026

started by DanX on July 24, 2026
How hard is 2.5D and 3D advanced packaging from an equipment prospective

latest reply by count on July 24, 2026

started by Andy1299 on August 9, 2021
How China's DRAM Maker CXMT Caught Up With Micron Without EUV

latest reply by Fred Chen on July 24, 2026

started by karin623 on July 24, 2026
OpenAI says its AI technology acted on its own in an ‘unprecedented’ hack of another company

latest reply by Xebec on July 24, 2026

started by hist78 on July 23, 2026
Does DRAM refresh time represent a barrier to continued scaling?

latest reply by lexusumber on July 24, 2026

started by Xebec on May 2, 2026

Recent Article Comments

TSMC CoWoS versus Intel EMIB Semiconductor Packaging
I think the picture is bit of wrong for the scalability EMIB mentioned as 6X in 26 and CoWoS-L is…

— siliconbruh999 on July 17, 2026
Consolidation and Competition: Who is Winning the $4.5 Billion Interface IP Race?
HPC can be Chiplet. Wondering why UCIe is not considered. Internally AMBA neither

— chiro.lentz on July 11, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Thank you to Daniel Nenni and SemiWiki for publishing my latest article: The Packaging PDK Is the Missing Layer for…

— moh.kolb on July 8, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Very interesting. Thanks.

— U235 on July 8, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
N+3 is denser than N6: https://newsletter.semianalysis.com/p/steel-smic-n3-teardown?open=false

— Fred Chen on July 5, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
Fixed, thank you.

— Daniel Nenni on July 4, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
The article is not correct. EUV equipment is not primarily produced by ASML. It is only produced by ASML. It…

— AndyG on July 4, 2026
Intel 18A vs Intel 18A-P: What Is the Difference and Why Does It Matter?
Nice writeup

— Rahul Razdan on June 27, 2026
Available Is Not In Control: Balancing Output, Quality, and Risk in High-Volume Fabs
In a DoD centric III-V fab I had wafers run in a few decades ago, yield was miserable, but adequate…

— PBealo on June 27, 2026
Available Is Not In Control: Balancing Output, Quality, and Risk in High-Volume Fabs
Another thing that can help improve availability is a very old but often overlooked basic bedrock: Having good SPC, that…

— benb on June 24, 2026

WP_Term Object
(
    [term_id] => 15929
    [name] => CEO Interviews
    [slug] => ceo-interviews
    [term_group] => 0
    [term_taxonomy_id] => 15929
    [taxonomy] => category
    [description] => 
    [parent] => 0
    [count] => 339
    [filter] => raw
    [cat_ID] => 15929
    [category_count] => 339
    [category_description] => 
    [cat_name] => CEO Interviews
    [category_nicename] => ceo-interviews
    [category_parent] => 0
)

June 12, 2026June 19, 2026 by Daniel Nenni

CEO Interview with Suresh Vasudevan of Clockwork.io

CEO Interview with Suresh Vasudevan of Clockwork.io
by Daniel Nenni on 06-12-2026 at 6:00 pm
Categories: CEO Interviews

Key takeaways ▼

Suresh Vasudevan is CEO of Clockwork.io, pioneering Software-Driven AI Fabrics™ that recover the 60-80% of cluster capacity that today goes completely unutilized. Previously, he led Nimble Storage to IPO and HPE acquisition, and served as CEO of Sysdig. Prior to that, he was at NetApp and McKinsey & Co.

His focus: making every GPU hour count.

Tell us about your company.

Clockwork.io solves a fundamental bottleneck in AI compute – the gap between what organizations pay for and what they get. We deliver a hardware-agnostic layer that provides nanosecond observability, fault tolerance, and performance optimization across any accelerator, network, or deployment model. Our solution TorchPass is benchmarked by SemiAnalysis as the only solution maintaining full throughput during failures.

We were founded in 2018 on Stanford research by Balaji Prabhakar, Yilong Geng, and Chief Scientist Mendel Rosenblum (VMware co-founder), backed by Diane Greene, John Chambers, and Lip-Bu Tan.

What problems are you solving?

The bottleneck in AI isn’t how fast the GPU computes. It’s how well hundreds and thousands of them can work together. The fabric connecting them was never engineered to make that reliable.

AI training uses globally synchronous collective operations where every GPU rank must complete each step before any proceeds. A GPU falling off the bus, a memory XID error, a driver fault, a link flap, a NIC failure, a straggler, or an NCCL hang can crash the entire job. SemiAnalysis research finds the first failure on a new cluster happens within 26 minutes. Meta’s Llama 3 logged 466 interruptions over 54 days, each 8–24 engineering hours. In a 2,048-GPU cluster, this equates to $6.0M annually.

The result: most organizations convert only 20–25% of paid GPU capacity into useful work… what SemiAnalysis calls “goodput.” The rest is lost to failures and overhead. This is the problem we solve for – helping companies maximize their GPU capacity.

What application areas are your strongest?

Clockwork serves AI builders, including hyperscalers, enterprises, and research institutions – as well as GPU cloud operators.

For AI builders, TorchPass handles failures transparently, so teams need to checkpoint less often, enabling larger batch sizes, fewer OOM errors, and faster time to objective. The deeper shift: AI teams care whether their model finishes, not whether individual nodes are up. The meaningful metric isn’t availability percentage. It’s what fraction of failures are resolved without lost training progress.

For GPU cloud operators, Clockwork delivers faster commissioning, stronger SLAs, and zero-downtime maintenance. This means firmware updates while training continues. It enables operators to offer job-level continuity to tenants: committing not to node uptime but to whether training completes without rollback. This is a meaningful differentiator in a commoditizing GPU market.

Customers include Uber, NScale, Nebius, White Fiber, and DCAI (Denmark), which runs the Gefion supercomputer for quantum computing, drug discovery, and weather research.

What keeps your customers up at night?

Job crashes and checkpoint waste. A GPU falling off the bus, GPU stragglers, memory XID errors, driver faults, link flaps, NIC failures, hung ranks, or misconfigurations can crash a multi-week run. Every crash rolls back to checkpoint. These frequent saves mean high overhead, infrequent saves mean high rollback risk.

Opaque infrastructure. When any of these strikes, root cause requires hours of forensics across networking, compute, and ML teams. This is harder in heterogeneous clusters running RoCE v2, InfiniBand, or new transports like MRC with no unified cross-fabric view. Physical layer failures add to this: mismatched pluggables, firmware drift, and environmental factors produce gray failures invisible to standard checks.

Hardware obsolescence. GPU generations turn fast — Hopper to Blackwell to Vera Rubin — and vendor-specific fabric amplifies re-engineering costs each time.

ROI at scale. At 30% utilization, cost-per-GPU-hour is 3x what it should be. In NVL72 systems, a single NVLink backplane failure takes multiple trays offline with far less observability than scale-out.

What does the competitive landscape look like and how do you differentiate?

AI fabric management is a nascent category with no direct competitor. The closest analog is bespoke solutions built by frontier labs — SpaceX/xAI, Meta, Google — with massive resources and stacks tuned to their topology.

For everyone else, it’s been poor utilization or patching open-source tools. Clockwork.io is hardware-agnostic across NVIDIA and AMD compute, InfiniBand, Ethernet, and RoCE. Broadcom has stated Clockwork helps their platforms “realize their full potential”; AMD has endorsed Clockwork on the MI350X/ROCm stack. SemiAnalysis found TorchPass “the only option that maintains the same training performance as jobs without fault tolerance.”

Clockwork differentiates on three dimensions: hardware neutrality across TorchTitan, Megatron-LM, DeepSpeed, Slurm, and Kubernetes; nanosecond-precision telemetry via Global ClockSync; and stateful fault tolerance via TorchPass — live GPU migration from the exact failed step. Hardware neutrality hedges against inference fragmentation: as KV cache and prefill/decode add RDMA demands, vendor-specific stacks compound with each workload. Fibre Channel gave way to Ethernet for the same reason.

What new features/technology are you working on?

The roadmap targets “autonomic collective communications” – building a fabric that predicts failures, adapts routing, and self-optimizes. Three new FleetIQ capabilities ship this month: Metric-to-Detection Pipeline, Advanced Fleet Monitoring, and Advanced Workload Monitoring. Advanced Fleet Monitoring probes every NIC-to-switch link 100× per second with directional precision, surfacing gray failures that Round Trip Time (RTT) monitoring averages away. Advanced Workload Monitoring instruments at the NCCL layer to identify which rank stalled. TorchPass is evolving into a full training continuity platform — a tiered orchestration layer selecting the least disruptive response to failures under customer-defined policy. The core pillars — ClockSync, State Transfer, and Dynamic Traffic Control — are advancing from reactive to predictive.

How do customers normally engage with your company?

FleetIQ is 100% software overlay — no hardware changes. Customers begin with a free consultation or POC. SemiAnalysis benchmarking and TCO/Goodput calculators support technical evaluation.

As GPU systems grow denser and costlier, the ROI for software that recovers latent capacity grows with it.

clockwork.io — hello@clockwork.io

Also Read:

Q&A Interview with Mo Steinman, Lightelligence’s Senior Vice President and General Manager, U.S.

CEO Interview with Mike Horton CEO of HYFIX

How llmda.ai Coaxed Me Out of Retirement, an Interview with Kurt Shuler

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

TSMC CoWoS versus Intel EMIB Semiconductor Packaging
I think the picture is bit of wrong for the scalability EMIB mentioned as 6X in 26 and CoWoS-L is…

— siliconbruh999 on July 17, 2026
Consolidation and Competition: Who is Winning the $4.5 Billion Interface IP Race?
HPC can be Chiplet. Wondering why UCIe is not considered. Internally AMBA neither

— chiro.lentz on July 11, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Thank you to Daniel Nenni and SemiWiki for publishing my latest article: The Packaging PDK Is the Missing Layer for…

— moh.kolb on July 8, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Very interesting. Thanks.

— U235 on July 8, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
N+3 is denser than N6: https://newsletter.semianalysis.com/p/steel-smic-n3-teardown?open=false

— Fred Chen on July 5, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
Fixed, thank you.

— Daniel Nenni on July 4, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
The article is not correct. EUV equipment is not primarily produced by ASML. It is only produced by ASML. It…

— AndyG on July 4, 2026
Intel 18A vs Intel 18A-P: What Is the Difference and Why Does It Matter?
Nice writeup

— Rahul Razdan on June 27, 2026

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

Tell us about your company.

What problems are you solving?

What application areas are your strongest?

What keeps your customers up at night?

What does the competitive landscape look like and how do you differentiate?

What new features/technology are you working on?

How do customers normally engage with your company?

Also Read:

Comments

Recent Forum Threads

Recent Article Comments