Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng

Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng
by Daniel Nenni on 09-26-2025 at 10:00 am

Daniel is joined by Dan Zheng, VP of Partnerships and Operations at Clockwork. Dan was the General Manager for Product and Partnerships at Urban Engines which was acquired by Google in 2016. He has also held roles at Stanford University and Google.

Dan explores the challenges of operating massive AI hardware infrastructure at scale with Daniel. It turns out it can be quite difficult to operate modern GPU clusters efficiently. Communication bottlenecks within clusters and between clusters, stalled pipelines, network issues, and memory issues can all contribute to the problem. Debugging these issues can be difficult and Dan explains that re-starts from prior checkpoints can happen many times in large AI clusters and each of these events can waste many thousands of GPU hours.

Dan also describes Clockwork’s FleetIQ Platform and how this technology addresses these situations by providing nano-second accurate visibility correlated across the stack. The result is more efficient and productive AI clusters allowing far more work to be accomplished. This provides more AI capability with the same hardware, essentially democratizing access to AI.

Contact Clockwork

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.

New podcast banner 400X400
Semiconductor Insiders
Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng
Loading
/
Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.