ARM is a well-known semiconductor IP provider and they often create a reference design so that SoC companies can have a starting point to work with. On the GPU side of IP the ARM engineers have an architecture called Mali, and a recent webinar hosted by Synopsys reviewed how the physical design area was minimized by using a combination of tools:
- Logic Synthesis – Design Compiler Graphical
- Place/Route – IC Compiler
Front-end design engineers should be attracted to Design Compiler Graphical over the standard Design Compiler tool for logic synthesis because of the promises of: improved QoR like up to 10% higher clock frequency, congestion prediction and optimization, floorplan exploration, and providing physical guidance to IC Compiler that gives 1.5X faster placement.
Pierre-Alexandre Bou-Ach from ARM talked about how the Mail GPUs were designed and optimized for smallest area or lowest power. The ARM Mali-T820 was a GPU optimized for smallest area. The Implementation Reference Methodology (iRM) for the Mali GPU is based on Synopsys tools and shows how to achieve a specific PPA (Power, Performance Area) result.
Related – Synopsys Eats Their Own Dog food
There are a multitude of both front-end and back-end factors that will affect silicon area for a GPU, like:
For an area-centric design the strategy is to continuously track area using multiple metrics:
- Core area
- Die area
- Physical only cells area
- Hard macro area
- Memories area
- Combinational cells area
- Repeaters area
- Sequential standard cells area
- Standard cells area
An area Pareto chart shows that the larges area contribution was coming from the combinational cells without repeaters. The grey line is cumulative area contribution.
An analysis of area by design hierarchy was performed so that any change to the RTL could be directly related to an area impact, and the biggest modules were identified during the earliest stages of development. The placement of blocks within the hierarchy was studied to understand how to minimize repeater insertions. The IC Compiler tool helps in area reduction by reporting why any new cells are being inserted, so for the shader core the new cells added were to fix hold time violations:
Some best practices in the iRM flow when using the 28HPM process node:
- Apply dont_use constraints on high drive repeaters and complex cells
- Use memories from the ARM compiler
- Manage the cell density with placer_max_cell_density_threshold 0.80
- Design Compiler Graphical
- Use the SPG flow
- Try hierarchy reduction and flattening
- Increase area priority
- Set a realistic clock latency
- Use area recovery
- IC Compiler
- Control repeater insertion during placement
- Refine path group control
- Area recovery enabled
- Layer optimizations
Using multibit registers (2 bit and 4 bit cells) versus no multibit showed a savings up to 32% with standard cell implementation. Using ultra high density memories where appropriate in the shader core provided 25.46% area reduction of the memory, while using UHD memories on the top-level L2 had a 16.37% area reduction. Total area reduction using UHD memories was 4.57% for the shader core and 6.87% for the top-level L2.
Adding up all of the optimizations the Mail-T820 GPU team was able to achieve >4% area savings across the total cell area, while at the same time leakage power was reduced by >4%.
ARM has created an iRM flow that provides a reference Mali-T820 design for minimum area when using the Synospys tools for logic synthesis and place/route. Watch the entire 25 minute archived webinar online here.