Jeremy Birch
The problem also varies by tool type. For simulators and other analysis tools there are standard inputs, commands, and a knowable golden result. For synthesis, place and route and other heuristic tools, the result may differ a large amount due to a small difference in the input data, the tool invocation varies widely, and there is generally not a knowable golden result. For these types of tool it is harder to know which tool is best for your type of design, and it is also not so easy to identify metrics by which to measure better or worse tools. For instance we might assume that all P&R results need to be DRC and LVS clean (although the benchmark data may not allow this to be achieved), but what do you measure after that? A result with more wire and more vias might be bad (longer delays) or might be good (higher yield, lower coupling etc). To determine the best result then depends on the use of analysis tools which might vary again in their analysis.
Tools which excel at large square designs with lots of metals may be truly awful at long thin designs with 2 or 3 metals, and vice versa. Producing meaningful benchmarks that help a wide variety of customers would be pretty tricky.