
Suresh Vasudevan is CEO of Clockwork.io, pioneering Software-Driven AI Fabrics™ that recover the 60-80% of cluster capacity that today goes completely unutilized. Previously, he led Nimble Storage to IPO and HPE acquisition, and served as CEO of Sysdig. Prior to that, he was at NetApp and McKinsey & Co.
His focus: making every GPU hour count.
Tell us about your company.
Clockwork.io solves a fundamental bottleneck in AI compute – the gap between what organizations pay for and what they get. We deliver a hardware-agnostic layer that provides nanosecond observability, fault tolerance, and performance optimization across any accelerator, network, or deployment model. Our solution TorchPass is benchmarked by SemiAnalysis as the only solution maintaining full throughput during failures.
We were founded in 2018 on Stanford research by Balaji Prabhakar, Yilong Geng, and Chief Scientist Mendel Rosenblum (VMware co-founder), backed by Diane Greene, John Chambers, and Lip-Bu Tan.
What problems are you solving?
The bottleneck in AI isn’t how fast the GPU computes. It’s how well hundreds and thousands of them can work together. The fabric connecting them was never engineered to make that reliable.
AI training uses globally synchronous collective operations where every GPU rank must complete each step before any proceeds. A GPU falling off the bus, a memory XID error, a driver fault, a link flap, a NIC failure, a straggler, or an NCCL hang can crash the entire job. SemiAnalysis research finds the first failure on a new cluster happens within 26 minutes. Meta’s Llama 3 logged 466 interruptions over 54 days, each 8–24 engineering hours. In a 2,048-GPU cluster, this equates to $6.0M annually.
The result: most organizations convert only 20–25% of paid GPU capacity into useful work… what SemiAnalysis calls “goodput.” The rest is lost to failures and overhead. This is the problem we solve for – helping companies maximize their GPU capacity.
What application areas are your strongest?
Clockwork serves AI builders, including hyperscalers, enterprises, and research institutions – as well as GPU cloud operators.
For AI builders, TorchPass handles failures transparently, so teams need to checkpoint less often, enabling larger batch sizes, fewer OOM errors, and faster time to objective. The deeper shift: AI teams care whether their model finishes, not whether individual nodes are up. The meaningful metric isn’t availability percentage. It’s what fraction of failures are resolved without lost training progress.
For GPU cloud operators, Clockwork delivers faster commissioning, stronger SLAs, and zero-downtime maintenance. This means firmware updates while training continues. It enables operators to offer job-level continuity to tenants: committing not to node uptime but to whether training completes without rollback. This is a meaningful differentiator in a commoditizing GPU market.
Customers include Uber, NScale, Nebius, White Fiber, and DCAI (Denmark), which runs the Gefion supercomputer for quantum computing, drug discovery, and weather research.
What keeps your customers up at night?
Job crashes and checkpoint waste. A GPU falling off the bus, GPU stragglers, memory XID errors, driver faults, link flaps, NIC failures, hung ranks, or misconfigurations can crash a multi-week run. Every crash rolls back to checkpoint. These frequent saves mean high overhead, infrequent saves mean high rollback risk.
Opaque infrastructure. When any of these strikes, root cause requires hours of forensics across networking, compute, and ML teams. This is harder in heterogeneous clusters running RoCE v2, InfiniBand, or new transports like MRC with no unified cross-fabric view. Physical layer failures add to this: mismatched pluggables, firmware drift, and environmental factors produce gray failures invisible to standard checks.
Hardware obsolescence. GPU generations turn fast — Hopper to Blackwell to Vera Rubin — and vendor-specific fabric amplifies re-engineering costs each time.
ROI at scale. At 30% utilization, cost-per-GPU-hour is 3x what it should be. In NVL72 systems, a single NVLink backplane failure takes multiple trays offline with far less observability than scale-out.
What does the competitive landscape look like and how do you differentiate?
AI fabric management is a nascent category with no direct competitor. The closest analog is bespoke solutions built by frontier labs — SpaceX/xAI, Meta, Google — with massive resources and stacks tuned to their topology.
For everyone else, it’s been poor utilization or patching open-source tools. Clockwork.io is hardware-agnostic across NVIDIA and AMD compute, InfiniBand, Ethernet, and RoCE. Broadcom has stated Clockwork helps their platforms “realize their full potential”; AMD has endorsed Clockwork on the MI350X/ROCm stack. SemiAnalysis found TorchPass “the only option that maintains the same training performance as jobs without fault tolerance.”
Clockwork differentiates on three dimensions: hardware neutrality across TorchTitan, Megatron-LM, DeepSpeed, Slurm, and Kubernetes; nanosecond-precision telemetry via Global ClockSync; and stateful fault tolerance via TorchPass — live GPU migration from the exact failed step. Hardware neutrality hedges against inference fragmentation: as KV cache and prefill/decode add RDMA demands, vendor-specific stacks compound with each workload. Fibre Channel gave way to Ethernet for the same reason.
What new features/technology are you working on?
The roadmap targets “autonomic collective communications” – building a fabric that predicts failures, adapts routing, and self-optimizes. Three new FleetIQ capabilities ship this month: Metric-to-Detection Pipeline, Advanced Fleet Monitoring, and Advanced Workload Monitoring. Advanced Fleet Monitoring probes every NIC-to-switch link 100× per second with directional precision, surfacing gray failures that Round Trip Time (RTT) monitoring averages away. Advanced Workload Monitoring instruments at the NCCL layer to identify which rank stalled. TorchPass is evolving into a full training continuity platform — a tiered orchestration layer selecting the least disruptive response to failures under customer-defined policy. The core pillars — ClockSync, State Transfer, and Dynamic Traffic Control — are advancing from reactive to predictive.
How do customers normally engage with your company?
FleetIQ is 100% software overlay — no hardware changes. Customers begin with a free consultation or POC. SemiAnalysis benchmarking and TCO/Goodput calculators support technical evaluation.
As GPU systems grow denser and costlier, the ROI for software that recovers latent capacity grows with it.
clockwork.io — hello@clockwork.io
Also Read:
Share this post via:




Disaggregating AI Compute to Break the Tokens Barrier