WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 484
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 484
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)
            
14173 SemiWiki Banner 800x1001
WP_Term Object
(
    [term_id] => 15
    [name] => Cadence
    [slug] => cadence
    [term_group] => 0
    [term_taxonomy_id] => 15
    [taxonomy] => category
    [description] => 
    [parent] => 157
    [count] => 484
    [filter] => raw
    [cat_ID] => 15
    [category_count] => 484
    [category_description] => 
    [cat_name] => Cadence
    [category_nicename] => cadence
    [category_parent] => 157
)

Reducing Compile Time in Emulation. Innovation in Verification

Reducing Compile Time in Emulation. Innovation in Verification
by Bernard Murphy on 03-25-2021 at 6:00 am

Is there a way to reduce cycle time in mapping large SoCs to an FPGA-based emulator? Paul Cunningham (GM, Verification at Cadence), Jim Hogan (RIP) and I continue our series on research ideas. As always, feedback welcome.

Reducing Compile Time in Emulation

The Innovation

This month’s pick is Improving FPGA-Based Logic Emulation Systems through Machine Learning. This paper was presented at the ACM Transactions on Design Automation of Electronic Systems in 2020. The authors are from Georgia Tech and Synopsys.

FPGA-based emulation, emulating a large SoC design through an array of FPGAs, is one way to model an SoC accurately yet fast enough to run significant software loads. But there’s a challenge: compiling a large design onto said array of FPGAs is not easy. A multi-billion-gate SoC must map onto hundreds of large FPGAs (300+ in some of the authors’ test cases), through a complex partitioning algorithm followed by multiple place and route (P&R) trials on “hard” partitions. P&R runs can be parallelized, but each still takes many hours. If any run fails, you must start over with a new partitioning or new P&R trials.

Because designers use emulation to optimize cycle-time through chip verification and debug, it is critical to optimize compile wall-clock time, within reasonable compute resources. Figuring out the best partitioning and best P&R strategies requires design know-how and experience from previous designs. Which makes this problem an appealing candidate for machine learning (ML) methods. The authors use ML to predict if a P&R job will be easy or hard and use this prediction to apply different P&R strategies. They also use ML to estimate best resourcing to optimize throughput and to fine-tune partitioning. They’re able to show improvements in both total compute and wall-clock time with their methods.

Paul’s view

Optimizing throughput in emulation is very relevant today as we continue to chase exponential growth in verification complexity. For typical emulation usage debug cycle time is critical and so compile wall-clock time is very important. The authors have shown a 20% reduction in P&R wall clock time which is very good. This paper is very well written, presenting some strong results based on using ML to predict if a P&R job will be “easy” or “hard” and then using these predictions to optimize partitioning and determine P&R strategies.

As with any ML system, the input feature set chosen is critical. The authors have some great insights here, in particular the use of Lloyd Shapley’s Nobel Prize winning techniques in game theory for feature importance weighting. One thought I have on possible further improvements would be to consider some local measures of P&R difficultyin their feature set – the features listed appear to all be global measures such as number of LUTs, wires, clocks. However, a local hotspot on a small subset of a partition can still make P&R difficult, even if these global metrics for the overall partition look easy.

The paper builds up to a strong headline result of reducing wall clock time for the overall P&R stage of emulation compile significantly, from 15 hours to 12 hours. Nice.

Jim’s view

Jim, an inspiration to many of us, passed away while we were working on this blog. We miss you dearly Jim. Wherever you’re watching us from, we hope we’ve correctly captured what you had shared with us live on this paper:

This is some impressive progress by Synopsys on FPGA-based emulation compile times. If it was coming from a startup then for sure I’d invest. “Outside” ML to drive smarter P&R strategies makes total sense, not only for emulation, but also ASIC implementation.

My view

I agree with Jim. This method should also be applicable to other forms of implementation: FPGA prototyping as well as FPGA emulation, ASIC implementation, and even the custom emulation processors that Cadence and Mentor have. Even for prototyping large designs which will go to production in FPGA implementations. I’m thinking of large infrastructure basebands and switches for example. I’m also tickled that while ML more readily finds a home in implementation rather than verification, here verification builds on that strength in implementation!

We know Jim would want us to continue this blog, as do Paul and I. We’re working to find a new partner to join us for next month. Stay tuned!

 

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.