After many years of hesitancy to jump with both feet in, semiconductor companies are seriously considering implementing cloud strategies and making required investments. Their concern though is, how much investment is it going to take? Some of the block-and-tackle challenges they face in implementing a cloud strategy are listed below.
- Performing a cloud candidature assessment to arrive at the best cloud strategy.
- Identifying the best storage technology to efficiently solve the complex riddle of Hybrid cloud HPC storage.
- Establishing an EDA harness that seamlessly manages hybrid cloud EDA workflows.
- Laying guard rails around cloud costs and establishing mechanics to predict EDA costs of operation.
This blog will discuss the importance of leveraging an analytics platform to drive and dictate the cloud strategy and implementation. A good analytics platform such as TCS NeurEDA™ can help match workloads with existing compute resources and deliver more without additional infrastructure investments. Subsequent blogs will focus on each of the above four challenges and how to cost-effectively overcome those challenges.
The demand for semiconductors has never been greater and is expected to keep increasing due to emerging technologies such as 5G, AI, Acceleration, Edge and the Internet of Things (IoT). This has sparked a new wave of innovation in the semiconductor industry and companies who can meet such ASIC/SoC needs will likely rise to the top in future markets. These growth opportunities drive a significant increase in consumption of HPC infrastructure (CAGR of approximately 17% over the next five years) at a scale that has not been seen in the past. Much of this HPC compute demand could be attributed to EDA design workloads.
Semiconductor companies rely on ‘on-demand’ access to large HPC server farms to meet the chip design and verification processes. The engineering environments are coming under tremendous stress, owing to the increase in complexity and new chip variants getting introduced. The need to verify more complex chips is creating a need to significantly increase infrastructure investments. When faced with aggressive silicon tapeout schedules, EDA compute infrastructure is expected to support at least 1.5X of normal actual capacity. This poses several operational challenges to the team responsible for the EDA infrastructure and logistics. The situation gets more aggravated when the team is asked to support the engineering efforts for producing twice the number of chip variants. This is not the time to go through the age old process of procurement cycles for compute and storage hardware and software licenses.
When faced with a situation like this, two questions may pop up. Can one magically arrange for doing more with less? Is the cloud the magic wand that solves all scalability challenges? Harnessing and analyzing the data from existing HPC server-farms can provide deep insights that can answer these two questions. The analysis will provide a view on the workload characteristics of EDA processes for all phases from front-end runs all the way to tapeout.
Doing More With Less
TCS’ analysis of data from several EDA server-farms has provided them critical insights into evaluating the efficiencies of EDA job management and the sufficiency of existing compute resources. The insights have revealed that lack of advanced analytics on EDA operations and job management typically leads operations teams to decide on augmenting capacity as the only option. In reality, the lack of automated queue management software may be misleading one to believe that the server farms are being optimally utilized. For instance, measuring the utilization of servers as a whole as opposed to the available core hours may lead to the conclusion that EDA servers are always busy. Imagine reserving a multicore server for a single threaded job. When measured at the server utilization level, it will look like the resource is being fully utilized. However, when the utilization is granularly measured in core hours and analyzed, a different picture will emerge.
Workloads characterization is critical to identify the right kind of resources to allocate for optimizing utilization. For example, certain PDV Regression workloads may be more suitable for running on fewer core instances as these workloads are not very good at multi-core/multi-threading.
Is The Cloud The Magic Wand That Solves All Scalability Challenges?
There have been debates in the past on whether semiconductor companies could benefit from the cloud. But nowadays there is predominant belief across the industry that moving to the cloud may be a must for handling EDA workloads.
NetApp and TCS offer EDA transformational software framework to move EDA teams from scarcity of infrastructure to abundance. They offer a set of transformational strategies that could lay the path to a modern digitized EDA environment. Farming the historic server farm data, available as cluster logs, is a critical source of truth, and when analyzed in context could reveal several opportunities for improvement based on the workload characteristics. However, companies struggle to create a foundation platform that continuously collects and analyzes the data without human intervention.
TCS NeurEDA™ Advisor offers a reference architecture to create an event-driven EDA analytics platform, on which ML assisted server farm analytics models could be developed. The objective of these models is to provide actionable insights by analyzing the current utilization, server farm efficiency, barriers to developer productivity, etc. These insights are extracted from the existing infrastructure, scheduler and tool logs.
The next blog will discuss how the parameters extracted through the above framework will help perform a cloud candidature assessment to arrive at the best cloud strategy.