EDPS was last Thursday and Friday in Monterey. I think that this is a conference that more people would benefit from attending. Unlike some other conferences, it is almost entirely focused around user problems rather than doing a deep dive into things of limited interest. Most of the presentations are more like survey papers and fill in gaps in areas of EDA and design methodology that you probably feel you ought to know more about but don’t.
For example, Tom Spyrou gave an interesting perspective on parallel EDA. He is now at AMD having spent most of his career in EDA companies, fox turned game-keeper if you like. So he gets to AMD and the first thing that he notices is that all the multi-thread features that he has spent the previous few years implementing are actually turned off almost all the time. The reality of how EDA tools are run in a modern semiconductor company makes it hard to take advantage of.
AMD, for example, has about 20,000 CPUs available in Sunnyvale. They are managed by lsf and people are encouraged to allocate machines fairly across the different groups. A result of this is that requiring multiple machines simultaneously doesn’t work well. Machines need to be used when they become available and waiting for a whole cohort of machines is not effective. It is also hard to take advantage of the best machine available, rather than one that has precisely the resources requested.
So given these realities, what sort of parallel programming actually makes sense?
The simplest case is where there is non-shared memory and coarse grained parallelism with separate processes. If you can do this, then do so. DRC and library characterization fits this model.
The next simplest case is when shared memory is required but it is almost all read-only access. A good example of this is doing timing analysis for multiple corners. Most of the data is the netlist and timing arcs. The best way to handle this is to build up all the data, and then fork off the separate processes to do the corner analysis using copy-on-write. Since most of the pages are never written then most will be shared and the jobs will run without thrashing.
The next most complex case is when shared memory is needed for both reading and writing. Most applications are actually I/O bound and don’t, in fact, benefit from this but some do: thermal analysis, for example, which is floating-point CPU-bound. But don’t expect too much: 3X speedup on 8 CPUs is pretty much the state of the art.
Finally, there is the possibility of using the GPU hardware. Most tools can’t actually take advantage of this but in certain cases the algorithms can be mapped onto the GPU hardware and get a massive speedup. But it is obviously hard to code, and hard to manage (needing special hardware).
Another big issue is that tools are not independent. AMD has a scalable simulator that runs very fast on 4 CPUs provided the other 4 CPUs are not being used (presumably because the simulator needs all the shared cache). On multicore CPUs, how a tool behaves depends on what else is on the machine.
What about the cloud? This is not really in the mindset yet. Potentially there are some big advantages, not just in terms of scalability but in ease of sharing bugs with EDA vendors (which maybe the biggest advantage).
Bottom line: in many cases it may not really be worth the investment to make the code parallel, the realities of how server farms are managed makes all but the coarsest grain parallelism a headache to manage.
Share this post via:
Next Generation of Systems Design at Siemens