Robert Veltman and Vikash Tyagi of SanDisk Corporation presented at SNUG a few weeks ago on their selection and use of RTDA’s NetworkComputer to manage their workflows.
Like everyone else, SanDisk has a high-performance computing farm (and like everyone else they are coy about how big it is) and lots of licenses for EDA tools, simulation in particular. You probably know that EDA tools use FlexLM to keep track of license use. A load balancer has to direct the workload to suitable execution hosts based on both hardware resource availability and license availability.
There are a number of problems that can occur. First is license under-utilization. If there are per-user limits but the number of users goes down then licenses can go unused. Users will also bypass the load balancer if the submit-to-execute time is too long, such as longer than the job runtime. And licenses cannot be shared among remote sites.
SanDisk’s requirements for a load balancer were:
- can manage the hardware and software resources
- pre-emption capability (stop a current job and run another)
- high-performance job scheduling (very short submit to execute delay when resources are available)
- flexible in adapting to different organization and business models
- global deployment with central software resource management
The two well-known load balancer tools are SGE (from Oracle) and LSF (these days owned by IBM) but they both failed to meet the high performance job scheduling needs, were not integrated with the FlexLM license manager, and hard to deploy globally. SanDisk evaluated and decided to use RTDA’s NetworkComputer (NC) which met all requirements.
So what has been their experience with NC?
Pre-emption is the ability to suspend a workload in order to free up hardware and software resources for another workload. In particular, the licenses are taken back from the pre-empted workload and then, when eventually it is resumed, it needs to re-acquire them and carry on as if the whole pre-emption had never happened.
Fairshare is used for license sharing among users when there is contention. Fast-fairshare uses pre-emption to balance loads immediately, even up to allowing a single user to consume all available licenses but balancing the load among multiple users required. With no user limit this promotes submitting workloads sooner rather than later so they get to benefit from any slack periods.
Fast-fairshare is implemented to allocate any excess licenses to sites where it is day (so the users are presumably around) rather than night (where presumably the workload is all queued up and new jobs are unlikely to arrive before morning).
Global license sharing is supported by NC. Each site gets a minimum license allocation and unused licenses go to the site with the highest demand. Pre-emption forces minimum allocation when required. There are some minor gotchas: sites with insufficient hardware can’t benefit from surplus licenses, there is a limit on the number of sites, and so on.
The results have been good. License utilization has increased since deploying NC. The scheduling is very fast and eliminates the need for out-of-queue jobs. There is an efficent way to share licenses among different sites. Workloads can be balanced instantly.
Bottom line: using NC’s built-in functions with some in-house developed software automation delivers corporate-wide advanced load balancing.
The SanDisk SNUG presentation and the accompanying white paper, both of which contain a lot more technical details, are here.