SemiWiki – Page 440 – The Open Forum for Semiconductor Professionals

September 25, 2018

Cloud FPGA Optimal Design Closure, Synthesis, and Timing Using Plunify’s AI Strategies

Cloud FPGA Optimal Design Closure, Synthesis, and Timing Using Plunify’s AI Strategies
by Camille Kokozaki on 09-25-2018 at 12:00 pm
Categories: EDA, FPGA

Plunify, powered by machine learning and the cloud, delivers cloud-based solutions and optimization software to enable a better quality of results, higher productivity and better efficiency for design. Plunify is a software company in the Electronic Design Market with a focus on FPGA. It was founded in 2009, has its HQ in Singapore and is a privately funded company focused on applying machine learning algorithms to FPGA timing and space optimization problems to solve design problems and achieve better optimized more efficient designs. Plunify has offices in Singapore, Malaysia, China, and Japan and has a sales representation network covering all major markets. Plunify is part of the Xilinx alliance partner and Intel EDA partners programs.

What Plunify solves
Complex FPGA designs require significantly iterative flows going from static timing analysis back to RTL code modification to achieve timing closure or the desired results. FPGA designers have long used this traditional approach when facing problems, consuming a lot of expensive engineering time in the process. FPGA vendor tools like Intel Quartus II and Xilinx Vivado/ISE provide the standard tool flow and the engineer’s time is then spent on (re-)writing RTL source code and constraints to achieve the target results. Plunify saw opportunities to reduce redundancies and extract the maximum level of optimization from the existing workflow by fully utilizing the inherent optimization directives of the FPGA tools.

Coupled with the emergence of cloud computing and their own machine learning algorithms, designers can achieve or get close to attaining timing closure in a shorter amount of time. This allows design cycles to complete faster and to significantly accelerate the process of getting complex products to market.

Products and Services
InTimeis a machine learning software that optimizes FPGA timing and performance. FPGA tools such as Vivado, Quartus and ISE provide massive optimization benefits with the right settings and techniques. InTime uses machine learning and built-in intelligence to identify such optimized strategies for synthesis and place-and-route. It actively learns from results to improve over time, extracting more than a 50% increase in design performance from the FPGA tools.

Plunify Cloud is a platform that simplifies the technical and security management aspects of accessing a cloud infrastructure. It enables any FPGA designer to compile and optimize FPGA applications on the cloud without having to be an IT expert. This is achieved with a suite of cloud-enabled tools, such as the FPGA Expansion Packand AI Lab, that provide easy accessibility to the cloud without the complexity of infrastructure and software configuration and maintenance.

Plunify Costs for Cloud Usage
Plunify Cloud is a platform that simplifies the technical and security management aspects of accessing a cloud infrastructure for FPGA designers. Free client tools are provided for FPGA designers to access the cloud. The platform uses a pre-paid credits system.
[table] style=”width: 500px”
|-
|
| Pre-paid credits at $0.084 each
Credits cover the cost for
1. Cloud Servers
2. Cloud Data Storage
3. Cloud Bandwidth
4. Tool Licenses, e.g. Vivado, InTime
5. Free use of client tools
|-

Cloud Enabled Tools

FPGA Expansion Pack

Expansion Pack Webpage

AI Lab
AI Lab provides greater access to Vivado and other Xilinx tools. There are no restrictions on the OS. Vivado HLx can run on Macs and even Chromebook. Converts IT to an operating expenditure and thus eliminates capital expenditure and on-premise maintenance. This allows more accurate forecasts and scalability based on actual demand.

AI Lab Webpage

InTime is a machine learning software that optimizes FPGA timing and performance. InTime does this by identifying optimized strategies for synthesis and place-and-route. It actively learns from multiple build results to improve over time, extracting up to 50% increase in design performance from the FPGA tools.

There are usually 3 ways to optimize timing and performance

Optimizing RTL & Constraint. This requires experience, is risky at a late stage, can introduce bugs and re-verification delays.
Optimizing Synthesis & Place and Route Settings
Using faster device types impacting cost and design

It makes sense to use all three for improved overall performance.

InTime works by:

Generating strategies based on database
Implementing all strategies and running them in parallel
Using machine learning to analyze results
Updating the database with new knowledge

Strategies are combinations of synthesis and place-&-route settings, including placement constraints.

InTime machine learning algorithms provide predictable improvements with optimal groups of settings as default settings are rarely adequate.

There are advantages in large deployments where multiple designs are analyzed due to additional data in the database increasing machine learning positive inferences.

Summary Overview of the Tools

Resources:

Plunify Corporate Introduction

September 25, 2018

Retro-uC: on a road to low-cost, (ridiculously) low-volume ASICs ?

Retro-uC: on a road to low-cost, (ridiculously) low-volume ASICs ?
by Staf_Verhaegen on 09-25-2018 at 7:00 am
Categories: EDA
2 Comments

I’m a long time reader of SemiWiki; almost from the start. I’ve sometimes been a passionate commenter but as some of you may have noticed my activity lately on the forum was lower. One of the reasons is a project I am working on and I feel honoured I was invited to present the background here on SemiWiki.

I am currently working at imec in the ASIC design service group. In our department we also have a group providing IC Manufacturing services. In our group we also design test chips and so are in close contact with the IC Manufacturing group. In that way I saw that startup costs for current nodes is high but is low for mature/older nodes. One of my hobbies is programming. I learned to use HP-UX at university and moved to Linux and into open source afterwards. One of the things that attracts me in open source is being part of the community where helping each other – rather than competing – is the norm. The Chips4Makers is a project grown out of these experiences; it wants to kickstart the open source community for custom chips. This should be possible by trading off features and speed for low startup costs so really low-volume production can be done. It is thus meant for niche markets with not enough volume to allow the higher startup costs of more recent nodes.

The goal of Chips4Makers is a real paradigm shift for the microelectronics industry and so the Retro-uC (Retro-microcontroller) pilot project was born. The purpose is to show the feasibility of low-cost and low-volume ASICs. I wanted something representative for the target market; a niche product with low volume not needing the state-of-the-art features of current high volume chips. Recently retrocomputing is in and also the maker movement – where people make their own system using things like an Arduino. The Retro-uC combines both use cases in one chip: a micro-controller using some of the venerable instruction sets – the Zilog Z80, MOS 6502 and Motorola 68000.

The original CPUs were developed in a time the number of transistors that could be put on a chip was limited and so does not take up much real estate on a chip even in a mature node. For the Retro-uC TSMC 0.35um was chosen and by using MPW (multi-project wafer) service the price of the silicon can be kept below $10000. Next to the silicon cost also the design costs are part of the startup costs. These include engineering, software license and IP license costs. If one would need to pay the license fee for a place-and-route tool of one of the EDA vendors it will cost more than the silicon for this project. I looked at existing open source code and although they are currently not able to design a chip in a state-of-the-art technology they are able to implement a low-complexity low-speed chip in a mature node. For the RTL code I used existing open source implementations that were used in emulators of old computers; so they are both free and almost bug free. The hours I spend myself on the project are not accounted for as in good tradition I see this as hobby time; like most of the makers do with their own creations.

The Retro-uC crowdfunding campaign pledge levels

I launched the Retro-uC as a crowdfunding project on Crowd Supply. This allows to test the markets before needing to do significant investments. The funding goal of the project is $22000 which is the break-even point including all production costs but also the processing and handling fees. I hope I’m not the only person in the world who would like to see an open source silicon movement getting traction. If you want to support me first thing you can do is back one or more of the pledge levels. If you have suggestions or want to discuss things don’t hesitate to comment on this article or send me a private message. On the Chips4Makers blog you can also find more detailed information on the project.

September 24, 2018March 21, 2022

Data Management for SoCs – Not Optional Anymore

Data Management for SoCs – Not Optional Anymore
by Alex Tan on 09-24-2018 at 12:00 pm
Categories: Cliosoft, EDA, Security

Design Management (DM) encompasses business decisions, strategies and processes that enable product innovations. It is the foundation for both effective collaboration and gaining competitive advantage in the industry. This also applies in the high-tech space we are in, as having a sound underlying SoC data management for SoC designs is key to a successful silicon roll-out and its subsequent product support.

Design environment – growth promotes challenges
Looking back, analog circuit was a dominant part of IC products until around the mid-eighties, when the introduction of logic synthesis and CMOS technologies facilitated further design tilt towards more digital circuits. With a recent slow down in the chip performance race and an industry inflection towards emerging applications such as IoT, automotive and 5G, the current design landscape is getting reversed again to have increased analog or mixed signal content. The following charts show demand trend in the shipped IC product indicating a richer analog/mixed signals circuits mix.

Consequently, the infrastructure for SoC design implementation and its related supports to mixed-signals/analog IP development have changed. Design environment heterogeneity is one of the primary challenges. It includes process technology (such as diverging process nodes used for analog block vs digital), IP (analog centric and packaging on top of digital IP) and EDA tools (multi-vendors as well as internally developed solutions). We could throw into the equation, potential communication voids as diverse design teams attempt to drive project forward under the constraints of geographical time difference and schedule crunch. It could lead to lack of visibility on the needed design collaterals and IPs across the company –trapped in the artificial silos.

A DM deployment brings efficiency into such an environment as it helps manage all aspects of the SoC design flow such as in dealing with multiple design processes and their associated collaterals (specification, RTL, schematic, layout, scripts, simulation results, etc). Designers do not need to take care of semantics related to file transfers (such as tar-ing or ftp-ing) across design sites. Moreover, DM is capable of finding differences between different versions of the text, schematics and layout.

Ownership, version control and security

An agile development environment could enhance the SoC build process. One primary challenge in a multi-site, multi-team project is design data ownership issue. It has been key in ensuring a smooth project execution as deliverables and accountability can be measured and proper resource load balancing cultivates cohesiveness across the project teams.

DM solution could provide answers to frequently raise questions such as:
→ who made these design changes? who last checked-out the file?;
→ what happened to my last working edit? what is part of this release update?;
→ how come the simulation is failing now while it was OK last night? how is it possible that my timing is off with the same library?);
→ why is it taking longer to compile now?.

With more recent cloud enabled EDA solutions, design data may be scattered not only across multiple platforms such as Windows or Linux but also on the cloud. Supporting such a diverse design environment for seamless access while keeping an adequate level of security to many design data confidentiality is quite a daunting task.

DM integrated environment enables revision control, release management and user access tracking –simplifying both tapeout checklist process and subsequent ECO (Engineering Change Order) steps.

Targeted archiving and disk space usage
Driven by increased functionalities being accommodated by continuous technology shift, design specification are constantly revised and may necessitate a migration to new codes and tool solution. Hence, tools selection can change in a company over several product generations, accompanied by their own collateral formats.

To resolve this issue, a DM should be capable of capturing only the necessary tool and design data. Selective captures of design meta-data also prevents waste of disk space and provides better disk space utilization. In the end, a DM structured database enables taking design snapshots and allows labeling for easy reference. Data repository is backed-up efficiently and design handoffs between teams are seamlessly done.

ClioSoft and DM
ClioSoft SOS7 is the DM solution capable of delivering the previously described features. As the leading developer of SoC DM with over 250 global customers, ClioSoft SOS7 has been architected to meet the performance, security, scalability and optimal network storage requirements of the global design industry. On top of the outlined benefits, project management audit trail or certification conformance queries can be easily addressed through the SOS7 integrated platform with customer design environment.

Similar to a honeycomb cells of a bee-hive, SOS7 DM integrated environment supports many forms of design development for analog, digital and mixed signal SoCs. It also provides socket-like, tight integration with many EDA implementation and analysis tools such as Cadence Virtuoso, Mentor Tanner, Synopsys Custom Designer, etc.

In summary, DM drives methodology across design organization including IT, design teams, program management and application/support teams. A robust, efficient and yet user-friendly DM platform is essential in ensuring successful adoption by all design data stakeholders. ClioSoft SOS7 seems to fit just the criteria.

For further info on ClioSoft SOS7, please check HERE.

Also Read

Managing Your Ballooning Network Storage

HCM Is More Than Data Management

ClioSoft and SemiWiki Winning

September 24, 2018January 10, 2020

Highly Modular, AI Specialized, DNA 100 IP Core Target IoT to ADAS

Highly Modular, AI Specialized, DNA 100 IP Core Target IoT to ADAS
by Eric Esteve on 09-24-2018 at 7:00 am
Categories: AI, Automotive, Cadence, IoT, IP, Mobile

The Cadence Tensilica DNA100 DSP IP core is not a one-size-fits-all device. But it’s highly modular in order to support AI processing at the edge, delivering from 0.5 TMAC for on-device IoT up to 10s or 100 TMACs to support autonomous vehicle (ADAS). If you remember the first talks about IoT and Cloud, a couple of years ago, the IoT device was supposed to collect data at the edge and send it to the cloud through wireless network or internet where the data was processed. And the way back to send the processing result to the edge.

But this model appeared to be a waste of energy (sending data back and forth through networks has high power consumption cost) and a lack of privacy (especially for consumer application). But the worse was probably the impact on latency: can we safely rely on autonomous car if ADAS data processing is dependant to a cloud access. On top of adding unacceptable latency, the data travel through network is clearly dependant of the existence of such network (what about rural areas?).

Hopefully, the industry came back to a reasonable solution, data processing at the edge device! Supporting processing at the edge is not challenges-free: the SoC located at the edge must be as low-cost as possible, it must be performance-efficient to keep power consumption low. Even if a standard GPU could do the job, a SoC integrating DSP IP is the best solution to meet these constraints…

Cadence is launching Tensilica DNA 100 to support on-device AI processing in multiple applications. In fact, AI processing is penetrating in many market segments, in multiple applications. In mobile, the consumer expects to experience face detection and people recognition, at video capture rates. On-device AI will support object detection, people recognition, gesture recognition and eye tracking in AR/VR headsets. Surveillance cameras will need on-device AI for family or stranger recognition and anomaly detection. In automotive, for ADAS and AV, on-device AI will be used to recognize pedestrians, cars, signs, lanes, driver alertness, etc.

But these various markets have different performance requirements for on-device AI inferencing! For IoT, 0.5 TMAC is expected to be enough, when for mobile, the performance range is in the 0.5 to 2TMACs. AR/VR, with 1 to 4TMACs range is slightly higher, when for smart surveillance the need is in the 2 to 10 TMACs. Autonomous vehicle is clearly the most demanding application, as on-device AI inferencing requires from several 10s to 100 TMACs. The solution is to build a DSP core, the DNA 100 processor, which can be implemented as an array of cores, from 1 to a number as high as authorized by the area and power target…

If you look at the DNA 100 block diagram (above picture), you see that the core provides:
Bandwidth reduction, thanks to weight and activation compression,
Compute reduction as the MAC process non-zero operations only,
Efficient convolution through high MAC occupancy rate,
Pooling, Sigmoid, tanh and Eltwise add or sub for non-convolution layers.
Moreover, as the DNA 100 is programmable, it makes the SoC future proof and also extensible, by adding custom layers.

Cadence claims the Tensilica DNA 100 processor performance to be up to 4.7 time better than the competition (CEVA DSP?) thanks to sparse compute and high MAC utilization. The benchmark has been made on ResNet50, the processor running at 1GHz and processing 2550 frames per second. Tensilica DNA 100 processor and competition are both 4TMAC physical array configuration, and DNA 100 processor numbers are with network pruning, assuming 35% sparse weights and 60% sparse activation.

A figure is becoming more and more important, as the industry realize that performance can’t be the only criteria: power efficiency. Cadence is claiming to be 2.3 X better than the competition, in term of TMACs per Watt, for a DNA 100 processor with network pruning and 4TMAC configuration in 16nm (Tensilica DNA 100 delivers 3.4 TMAC per Watt, when the competition only reach 1.5 TMAC/W.

The above figure describe neural network mapping onto DNA 100 processor. This direct us to look at software support, Cadence proposing Tensilica Neural Network Compiler and supporting Android Neural Network App. Dated September 13 2018, this annoucement from Facebook about GLOW: a community-driven approach to AI infrastructure: “Today we are annoucing the next steps in Facebook’s efforts to build a hardware ecosystem for machine learning through partner support of the Glow compiler. We’re pleased to announce that Cadence, Esperanto, Intel, Marvell, and Qualcom Technologies inc. have commited to supporting Glow in future silicon products”

According with Cadence, “Integrating Facebook’s Glow, an open-source machine learning compiler based on LLVM (Low Level Virtual Machine), to enable a modular, robust and easily extensible approach”.
Modular, robust and extensible, the DNA 100 processor is well positioned to support AI on-device inference, from IoT to autonomous car…

The following link to get more information about the DNA 100 DSPs

ByEric Esteve fromIPnest

September 24, 2018

More Negative Semiconductor News

More Negative Semiconductor News
by Robert Maire on 09-24-2018 at 7:00 am
Categories: Semiconductor Advisors, Semiconductor Services
6 Comments

The amount of negative news and information about the semiconductor industry seems to be increasing at a faster rate. Micron put up a better quarter than expected but more importantly guided less than expected. We are surprised that the street is surprised as the decline in memory pricing is well known and Micron has been clear about it. It seems like investors and analysts may not be paying attention or hoping that reality isn’t true. Even if Micron’s earnings get cut in half, its still trading at a low valuation. Investors seem to be pricing in a downside disaster.

There is also a report in the news that Samsung will cut back on memory production in 2019 in order to prop up pricing in the face of slowing demand. Demand is still up just not up as fast as expected. We have written several reports in the past about our “OMEC” (Organization of Memory Exporting Companies) idea and Samsung is the Saudi Arabia of the memory industry. It does have the power to prop up pricing by adjusting supply. It may not be a bad thing for memory makers not such much for memory users.

The problem is that if Samsung is planning on cutting memory supply growth in 2019 it obviously is also going to cut chip equipment purchases even further as there is not a reason to buy equipment to increase supplies further. This also belies the idea of a one quarter downturn, for the September quarter, that some equipment companies and analysts had stated as fact.

Our recent checks indicate that the December quarter for chip equipment is weaker than suggested last quarter and perhaps even weaker for fabrication equipment than the recent “walk back” by KLAC.

We think that the December quarter for LRCX, AMAT and TEL will likely be down, perhaps another 10% sequentially rather than the “positive trajectory” Lam had called for on their last conference call, which would have supported September as the trough quarter.

We are more firmly of the view that September is not the trough and that there is probably further downside from there. At this point we think its difficult to call the trough or bottom as the news flow continues to be negative.

LRCX back track?
We think that Lam will likely have to back track on their September trough comment. At best they may be able to pull business into December to make it look flatter but we think things have deteriorated since their last update. We have been suggesting that $150’s has been a “trough” for the stock but we could potentially break through that support level depending on the level of capitulation. The recent comments attributed to Samsung pulling back on supply add to the risk as they may make the September equipment delay become a cancellation.

Global Semiconductor Alliance Executive Forum
This past week when we were is silicon valley, the GSA held its executive forum of “C level” types from the chip industry. We have heard from several people that the tone and outlook at the conference was much more muted.
It sounds like the consensus is for chip growth to slow from its 20% prior pace to a more leisurely 0% to single digit % rates.

We don’t think this is as bad as some investors may react as memory has been at an unsustainable pace and the expectation is still positive rather than the historical cyclicality which goes more negative in down cycles.
Our sense is that most, if not all, of the cooling has been on the memory side of the industry with foundry/logic being relatively fine.

Monterey Masks
This past week also saw the SPIE photomask and EUV Lithography conference in Monterey. It sounds as if things are progressing very well on 7NM. We have heard that there are a lot of 7NM”tape outs” that will likely build a good backlog of leading edge product. EUV continues to make progress towards a more HVM like model though it still has the well know issues of resist & pelicles etc;.

Lasertec Mask Porn
We heard that Lasertec of Japan showed some racy EUV mask images at SPIE from their mask inspection tool. Mask inspection remains one of the aforementioned issues and it would appear that Lasertec continues to make progress on that front and has come out of “stealth mode” a bit by publicly demonstrating capabilities.

Seen on EBAY- “Two lightly used ASML 3400s, best offer, pick up only”
Now that GloFo has officially canceled their 7NM program there are some idle, hardly used 3400 ASML EUV scanner collecting dust in Malta. There is also a 3300 that was an early tool which is also turned off. It seems a shame to waste these tools and we assume they will find a new home elsewhere. Maybe someone in China will buy them to jump to the head of the line rather than wait for ASML to build a new tool.

This could potentially impact ASML’s delivery schedule or build schedule if they replace tools already in the build queue for another customer. This may reduce the EUV tool count in the near term.

Is Micron dropping out of EUV?
The 3 ASML tools at GloFo may not be the only EUV scanners available. We have heard that Micron may be dropping out of the EUV program and shutting down its ASML 3300. That would bring the number to 4 dead EUV tools.

This may further add to EUV questions but it shouldn’t. We are not surprised as we have never expected Micron to use EUV as memory makers just don’t need it and can’t cost justify it, not today anyway. It was likely a nice R&D program that Micron can shut down to save costs as memory pricing weakens. It just adds another tool to the used tool market. I wonder how much it costs to ship an EUV tool to China? Probably a lot as it takes a $10M crane just to load it into the fab.

The stocks
We don’t see a lot of positive news this past week that suggests a quick bounce back. We think the stocks remain under pressure and could see another down leg after reporting the September quarter. We don’t like being part of the “spear catching” competition in the market and continue to view the downside risk as much larger than the upside potential of most of the stocks.

We still think Micron is cheap and has gotten cheaper but don’t want to put new money to work fighting the tape. KLAC is probably the best defensive play in chip equipment especially after its correction.

September 21, 2018November 22, 2019

Neural Network Efficiency with Embedded FPGA’s

Neural Network Efficiency with Embedded FPGA’s
by Tom Dillinger on 09-21-2018 at 12:00 pm
Categories: eFPGA, Flex Logix, FPGA, IP

The traditional metrics for evaluating IP are performance, power, and area, commonly abbreviated as PPA. Viewed independently, PPA measures can be difficult to assess. As an example, design constraints that are purely based on performance, without concern for the associated power dissipation and circuit area, are increasingly rare. There is a related set of characteristics of importance, especially given the increasing integration of SoC circuitry associated with deep neural networks (DNN) – namely, the implementation energy and area efficiency, usually represented as a performance per watt measure and a performance per area measure.

The DNN implementation options commonly considered are: a software-programmed (general purpose) microprocessor core, a programmed graphics processing unit (GPU), a field-programmable gate array, and a hard-wired logic block. In 2002, Broderson and Zhang from UC-Berkeley published a Technical Report that described the efficiency of different options, targeting digital signal processing algorithms. [1]

The figure below (from a related ISSCC presentation) highlights the energy efficiency of various implementations, with a specific focus on multiprocessing/multi-core and DSP architectures that were emerging at that time:

More recently, Microsoft published an assessment of the efficiency of implementation options for the unique workloads driving the need for “configurable” cloud services. [2] The cloud may provide unique compute resources for accelerating specific workloads, such as executing highly parallel algorithms and/or processing streaming data inputs. In this case, an FPGA option is also highlighted – the relative merits of an FPGA implementation are evident.

The conclusion presented by Microsoft is “specialization with FPGA’s is critical to the future cloud”. (FPGA’s are included in every Azure server, with a unique communication network interface that enables FPGA-to-FPGA messaging without CPU intervention, as depicted above.)

Back to DNN applications, the Architecture, Circuits, and Compilers Group at Harvard University recently presented their “SMIV” design at the recent Hot Chips Conference (link).

The purpose of this design tapeout was to provide hardware-based “PPA+E” metrics for deep neural network implementations, having integrated four major options:

a programmable ARM Cortex-A53 core
programmable accelerators
an embedded FPGA block
a hard-wired logic accelerator

The Harvard design included programmable accelerators, with a unique interface to the L2 memory cache across an ARM AXI4 interface, in support of specific (fine-grained) algorithms. The hard-wired logic pursued a “near-threshold” circuit implementation, with specific focus on optimizing the power efficiency.

The evaluation data from the Harvard team are summarized below, for representative Deep Neural Network “kernels”.

As with the Microsoft Azure conclusion, the efficiency results for the (embedded) FPGA option are extremely attractive.

I was intrigued by these results, and had the opportunity to ask Geoff Tate, Cheng Wang, and Abhijit Abhyankar of Flex Logix Technologies about their collaboration with the Harvard team. “Their design used a relatively small eFPGA array, with four eFLX tiles – two logic and two DSP-centric tiles.”, Geoff indicated. (For more details on the tile-based strategy for building eFPGA blocks, include the specific MAC functionality in the DSP tile, please refer to this earlier Semiwiki article – link.)

“The Harvard team tapeout used the original eFLX DSP tile design, where the MAC functionality is based on wide operators.”, Cheng indicated. Flex Logix has recently released an alternative tile design targeted for common neural network inference engines, with options for small coefficient bit widths (link).

“We are anticipating even greater efficiency with the use of embedded FPGA tiles specifically developed for AI applications. We are continuing to make engineering enhancements to engine and memory bandwidth tile features.”, Geoff forecasted.

Returning to the Harvard results above, although the PPA+E metrics for the attractive, a hard-wired ASIC-like approach is nonetheless still optimal for power efficiency (especially using a near-threshold library). What these figures don’t represent is an intangible characteristic – namely, flexibility of the deep neural network implementation. Inevitably, DNN algorithms for the inference engine are evolving for many applications, in pursuit of improved classification accuracy. In contrast to the eFPGA and processor core designs, a hard-wired logic network would not readily support the flexibility needed to make neural network changes to the depth and parameter set.

“Our customers consistently tell us that design flexibility associated with eFPGA DNN implementations is a critical requirement – that is part of our fundamental value proposition.”, Geoff highlighted.

The analysis data from the Harvard SMIV design contrasting processor, programmable logic, and hard-wired DNN implementations corroborates the high-level trends identified by Berkeley and Microsoft.

The traditional PPA (and licensing cost) criteria for evaluating IP needs to be expanded for the rapidly-evolving application space for a neural network inference engine, and must include (quantifiable) Efficiency and (more subjective)Flexibility. The capability to integrate embedded FPGA blocks into SoC’s offers a unique PPA+E+F combination – this promised to be an exciting technical area to track closely.

-chipguy

[1]Zhang, N., Broderson, R.W., “The cost of flexibility in systems on a chip design for signal processing applications.”, Technical Report, University of California-Berkeley, 2002.

[2]Putnam, A., “The Configurable Cloud — Accelerating Hyperscale Datacenter Services with FPGA’s”,2017 IEEE 33rd International Conference on Data Engineering (ICDE),
https://ieeexplore.ieee.org/document/7930129/ .

September 21, 2018June 17, 2021

Systems Design vs Integrated Circuit Design

Systems Design vs Integrated Circuit Design
by Daniel Nenni on 09-21-2018 at 7:00 am
Categories: Wally Rhines

This is the sixteenth in the series of “20 Questions with Wally Rhines”

Electronic design automation (EDA) began and grew with the integrated circuit (IC) design business probably because IC design grew in complexity faster than printed circuit boards. The race for superiority in PCB design evolved in parallel, however, and has become increasingly important as system design moves to more advanced EDA.

Daisy, Mentor and Valid, founded in 1980 and 1981, supported a combination of IC and PCB design. Both technologies required schematic capture and layout but simulation was primarily an IC design technology. Mentor and Daisy targeted both IC and PCB design while Valid specialized in PCB. At the same time, companies like Racal Redac (Europe), Cadence, SciCards (on VAX), Intergraph and others competed for the PCB market. As much as the IC market, competitive advantage in PCB design and layout (and eventually manufacturing) resulted from strategic acquisitions as well as organic technology development.

Computervision, Calma and Applicon were the “Big 3” electronic design environments that preceded the Daisy, Mentor, Valid era. But the GE acquisition of Calma, which had a very strong IC layout capability, demonstrated how large companies can easily mismanage the acquisition of fast moving, small, high tech companies, and the value of Calma was quickly lost. Daisy and Mentor went head to head and Mentor ultimately won the majority of the systems companies (and even owns the remnants of Daisy today through Mentor’s acquisition of Veribest from Intergraph), a historical event that gives Mentor its strength today in system design as systems companies (particularly aerospace, defense and automotive) rarely changed EDA suppliers, even as they adopted IC design tools to complement their PCB tools.

A critical shift occurred in the early 1990’s. Mentor’s PCB capability came from the acquisition of CADI. Cadence had acquired tools as well and both Zuken and Racal Redac had strong positions grown from organically developed tools. In 1990, Cadence and Mentor had approximately equal market shares, with Zuken and Racal Redac making up much of the remainder of the PCB market. Cadence made a very bold move, taking advantage of the fact that Mentor was in a period of weakness due to its struggles with Version 8.0. Cadence acquired Valid, announcing that the overlap between Cadence and Valid PCB design tools would be quickly resolved by eliminating the losers and crowning the winners. This turned out to be a difficult strategy since ALL of the users from both Cadence and Valid lost some portion of their design flow. That forced all the Cadence and Valid users into a competitive re-evaluation of all the alternatives. Zuken gained a little and Mentor gained a lot, while Cadence kept some. The result: By 1999, Mentor had 20% of the PCB market, Cadence 17% and Zuken, who had acquired Racal Redac to complement its Japan strength with a European supplier, had 16%. By this time, the dot com crash was beginning and Zuken reduced investment while Cadence focused on IC design. Mentor, who was still troubled by the Version 8.0 problems, continued a heavy rate of investment in “system design” including PCB, as an area of #1 market strength, and continued to gain market share in PCB, peaking at a market share of about double its nearest competitor.

Over the next two decades, all this history had an effect on strategic evolution. The original companies that needed to move toward EDA standardization in the 1980’s were largely systems companies. They needed standardization in design methodologies, libraries and tools across their disparate divisions. Even though two thirds of Mentor’s revenue ultimately came from IC design, the original adopters of EDA remained as a stable base of customers, particularly those who manufactured cars, planes and trains, or were involved in aerospace and defense. Mentor was able to capitalize upon that large market share and, thanks to some developments along the way, developed a leading position in electronic wiring and embedded software for those kinds of systems.

As much as anything, this systems capability is what made Mentor so attractive to Siemens’ software division as they looked to extend their “digital twin” platform from design, product life cycle management, mechanical CAD and manufacturing simulation to the electronic dimension of the digital twin.

There was another reason that Mentor’s system design businesses succeeded despite the difficulties of the Falcon Version 8.0 transition. Russ Henke, who managed the PCB business at that time, did not believe that Version 8.0 would ever work. So he followed a path, common in many companies, of quiet non-compliance. He instructed his PCB team to develop a “wrapper” to interface to Version 8.0, just in case it worked, and then proceeded to invest in the traditional PCB design business, consistently growing PCB revenue throughout the period of Version 8.0 chaos and into the 1990’s.

There was another beneficial offshoot of the Version 8.0 transition problems. The Mentor sales force had very little to sell after the announcement that Version 7.0 would not be extended but would be replaced by Version 8.0 whenever that environment became available. An innovative sales team working with the “Value Added Services” group sought out new users for the existing products that were not affected by the Version 8.0 transition. PCB schematic capture was one of those products. They found a local customer in Portland, Freightliner, who manufactured trucks and is now owned by Daimler. Convincing them to move from manual wiring design to EDA can’t have been easy but they became the first adopters of a “field-developed” product named “LCable”, a name that reflected its use in the design and verification of cabling and wire harnesses for trucks and cars. Adoption by other automotive and aerospace companies proceeded slowly but, over the decade starting in 1992, the complexity of automotive and aerospace electronics increased so much that the need for EDA became apparent. By year 2000, the business was blossoming but had outgrown its original roots in PCB design and layout. Martin Obrien joined Mentor from Raychem and brought with him a detailed knowledge of how automotive, aerospace and defense companies thought about electrical wiring architectures. That became one of the valuable core businesses of Mentor over time. Today, the “Capital” family of integrated electrical system design products has become the leading system connectivity design environment, extending from concept through simulation, topology, bill of materials, factory form boards for manufacturing and maintenance after the sale. Siemens has become a teaching customer but the Capital family is intensely focused on providing an open environment that can help Siemens’ competitors as much as it helps Siemens.

The 20 Questions with Wally Rhines Series

September 20, 2018

Apogee Pipelining in Real Time

Apogee Pipelining in Real Time
by Alex Tan on 09-20-2018 at 12:00 pm
Categories: EDA

Pipelining exploits parallelism of sub-processes with intent to achieve a performance gain that otherwise is not possible. A design technique initially embraced at the CPU micro-architectural level, it is achieved by overlapping the execution of previously segregated processor instructions –commonly referred as stages or segments. Pipelining for timing fixes has become mainstream option in design implementation space, especially when designers had exhausted other timing closure means at the physical design step (such as optimizing wire utilization or resource sharing in the logic cone).

Anatomy of pipelining
Pipeline involves the use of flip-flop and repeater insertion –although some designers tend to focus on flip-flop insertion part, as it is assumed that the implementation tools are to perform repeater insertion by default (such as during synthesis stage or placement/route optimization).

Ideal pipelining should consist of equal stage latency across the pipeline with no resource sharing between any two stages. The design clock cycle is determined by the time required for the slowest pipe stage. Pipelining does not reduce the time for individual instruction execution. Instead, it increases instruction throughput or bandwidth, which can be characterized by how frequent an instruction exits the pipeline.

Pipelining can be applied on either the datapath or control signals and requires potential hazards monitoring. Ideally speaking, pipelining should be done closer to the micro-architectural step as adding flip-flops further down the design realization translates to perturbing many design entities and iterating through the long implementation flow.

SoC Design and Pipelining Challenges
With the recent emerging applications such as AI accelerators, IoT, automotive and 5G, two top challenges encountered by the SoC design teams are scalability and heterogeneity. The former demands an increased latency in the design, while the later requires a seamless integration of interfaces and models.

In the context of timing closure, there are two entry points for injecting pipelining to manage latency. The first is done post static timing analysis (STA). By identifying large negative slack margin among logic stages, designers could provide concrete data points to the architecture team (or RTL owner) for pipelining. This reactive step may be costly if excessively done as implied iteration to the RTL-change translates to resetting the design build.

On the other hand, pipelining can be also performed early on the RTL codes, during micro-architectural inception. While doing it at this stage provides ample flexibility, code architect tends to be conservative due to lack of an accurate delay estimation and being critically aware of increased flop usage impact to the overall design PPA. Hence, some designers have adopted a semi-manual method. It involves rule-of-thumb formulas combined with some SPICE simulations and spreadsheet tracking to arrive at pipeline budgeting plus an involved placement constraints to manage its implementation. This approach is tedious, iterative and prone to over-design as it may include guard-banding to cover backend optimization inefficiencies such as detours due to placement congestion or wire resource shortages.

Avatars and automatic pipelining
At DAC 2018 designer/IP track poster session, Avatar and eSilicon showcased the outcome of their collaboration in achieving successful pipelining through the use of automatic stage-flop insertion performed by Avatar’s Apogee. Avatar Apogee is a complete floor-planning tool that enables fast analysis of design hierarchy and early floorplanning exploration. It shares common placement, routing, and timing engines with Aprisa, the block level, complete placement and route tool (please refer to my earlier blog for other discussion on these tools). Based on its customer feedback, Avatar has introduced in its 18.1 release an automatic pipeline flip-flop insertion feature. This feature automatically adds stage flops on feedthrough nets during floorplanning stage using Avatar’s Apogee new command insert_stage_flop.

Delving further into the feature, first the long top level nets are routed by Virtual Flat Global Router by taking into account any existing congestions, blockages and macros inside the hierarchical partitions. Next, feedthroughs are assigned to the appropriate partitions and stage flops are added based on user specification that includes distance or flop count.

Similar to the mainstream method of pushing buffer into the hierarchy, after its addition the pipeline flops will be pushed into its hierarchical partition with the push_cell command. Subsequently, the module level netlist are automatically updated with the new hierarchical cell instance and the corresponding port gets created at this level as illustrated in Figure 3.

Results and Comparison

Using a large Mobile SoC design as a test case and Apogee’s automatic approach, the design was implemented and routed. The tabulated results show that there were 18% fewer stage flops needed and a 22% saving in flip-flop power with minimal DRC and timing violation (significant reduction in both TNS and WNS slacks).

The total process takes about 2 hours to auto-insert as compared to 3 weeks of manual efforts and multiple iterations to reach to the final flop count. On top of that, timing and routability were challenging with the manual approach. With Apogee, timing and congestion aware automatic placement ensure both routability and timing convergence of the design.

In summary, designers can use Apogee’s new automatic stage flop insertion feature to reduce iterations and also get better stage flop count leading to lower sequential power. The flow also supports netlist update and reports that simplifies downstream formal verification process. According to Avatar Integrated Systems, it plans to expand the capability to auto insert or delete pipeline flops at the block-level placement optimization step in Aprisa –to further improve QoR at block level.

For more details on Avatar’s Apogee please check HERE and Aprisa HERE.

September 20, 2018

Supporting ASIL-D Through Your Network on Chip

Supporting ASIL-D Through Your Network on Chip
by Bernard Murphy on 09-20-2018 at 7:00 am
Categories: Arteris, Automotive, IP

The ISO 26262 standard defines four Automotive Safety Integrity Levels (ASILs), from A to D, technically measures of risk rather than safety mechanisms, of which ASIL-D is the highest. ASIL-D represents a failure potentially causing severe or fatal injury in a reasonably common situation over which the driver has little control. Certification to one or more of these levels requires demonstrating that a system can guarantee better than a specified probability of failure, generally requiring increasing levels of failure mitigation, analysis and supporting documentation with each level.

Semiconductor component providers have opted in many cases for certification to component levels below D for cost and/or time-to-market reasons. But this is changing. As electronic content in our cars is increasing, distinctions between what does and does not critically affect safety are blurring. An ECU that might be used in an ASIL-B function today could become interesting for use in an ASIL-D function next year. Consequently, OEMs are extending their demands for ASIL-D certification across more components, to ensure they’re covered no matter what the application.

This ramps up effort demanded in safety assurance in more designs. Certainly component functions must demonstrate mitigation to an appropriate level, but also special care is needed in integrating those components together, commonly through a NoC interconnect. There are three approaches that designers can take to ASIL-D compliance for that network-on-chip:

Complete replication (in this case the NoC), either duplication with comparison checks which will report a failure (where it is sufficient to warn the driver that a function has failed) or triplication with majority voting where a failure in one function can be overridden by outputs from two good functions (where simply knowing a system has failed is not enough). Still, complete duplication or replication of a NoC would be a very expensive option.
Another acceptable approach is to provide path diversity. If a component fails in one path, there should be other paths through which system operation can continue to work, possibly after updating routing tables. In effect the system can heal around bad nodes and continue to operate. The challenge here is that the designer must prove that all possible paths have backup paths, again likely to be expensive. A second consequence is that performance impact from possible rerouting has to be characterized, potentially as rigorously as for the good part. And finally updating the routing is a software function which will take time and adds further risk, and that must also be characterized and mitigated.

A less expensive and disruptive approach is to replicate only those functions within the NoC that must be replicated, such as control blocks, and to use ECC on interconnect wires to correct single-bit errors and flag 2-bit errors. This approach still meets the ASIL-D requirement, avoids all the complications associated with path diversity and has limited area overhead since control blocks represent a relatively small percentage of the overall NoC area.

For a way to meet these new stringent requirements, the last method above is difficult to beat. Adding 8-bits of ECC to a 64-bit link increases the size of the NoC by a little over 10%, versus doubling the size if duplicating the NoC. The solution is entirely in hardware and errors are corrected instantaneously with no need for software reconfiguration and no extra latencies. Finally, validating fault coverage and building a comprehensive FMEDA for the configured network can be completed with existing tools, compared with a path diversity approach which would require analysis over both hardware and network reconfiguration software.

To learn more about Arteris solutions for ASIL-D-compliant NoC interconnect, go HERE.

September 19, 2018June 17, 2021

Semiconductor IP Reality Check

Semiconductor IP Reality Check
by Daniel Nenni on 09-19-2018 at 12:00 pm
Categories: AI, eSilicon, IP
3 Comments

A robust, proven library of IP is a critical enabler for the entire semiconductor ecosystem. Without it, ASIC design is pretty much impossible, given time-to-market pressures. Said another way, designing IP for your next chip simply doesn’t fit the schedule – most teams have barely enough time to integrate and validate pre-existing IP. Without solid IP coverage, new process nodes also become somewhat irrelevant for the same reasons. So, designers and foundries care about IP a lot.

7nm is where a lot of the action is these days regarding IP delivery. Datacenter, networking, AI and 5G infrastructure all have a thirst for the power and performance delivered by this node. So, there are lots of claims out there regarding 7nm IP. “World’s first, industry-leading, silicon-proven, robust” are just some of the words you’ll find in all the marketing material available for 7nm IP. The question is, how do you separate the hype from reality, and more importantly how do you truly reduce risk?

Simulation results, silicon data and number of tape outs are all important parts of the homework needed to find IP that is truly “robust”. Lately, there is another dimension to the problem worth considering as well. Beyond the IP working in silicon, does all the IP work well together? Before you bet the farm on your next 7nm design project, are you confident that all the IP will play well together? A completely validated library of IP can still cause huge headaches of it all doesn’t work together. Integration risks are very real, as are the risks associated with modifying IP to hit the required power, performance or area target.

The concept of IP that works well together and supports customization for a target application makes a lot of sense. Recently, eSilicon announced two such IP offerings for data center and AI chips, which SemiWiki covered here. eSilicon calls the concept an “IP platform”. I’m sure other marketing terms will emerge.

Recently, there have been a couple of announcements from eSilicon that bring these platforms closer to home. It turns out a high-performance SerDes is a critical enabler for both platforms. Last week, eSilicon announced very encouraging results from the silicon validation of its 56G 7nm SerDes. Their press release stated: “… lab measurements confirm that the design is meeting or exceeding the target performance, power and functionality. Based on these results eSilicon has begun to demonstrate its test chip to key customers.” So, contact eSilicon if you want to see what their SerDes can do, first hand.

This week, eSilicon announced that theirneuASICä platformis available for customer designs. Some detail about the what’s in neuASIC were disclosed:

“The neuASIC IP platform has been through several 7nm tapeouts. The platform includes the following compiled, hardened and verified functions:

Configurable multiply-accumulate (MAC) blocks
Single-port SRAM
Pseudo two-port and pseudo four-port SRAM
Ternary content-addressable memory
Pitch match memory
GIGA memory
WAZPS (word all zero power saving) memory
Transpose memory
Re-mapper – low power cross-bar
Convolution engine
56G SerDes
HBM2 PHY

The platform also provides a software AI Accelerator Builder function that provides PPA estimates of the chosen ASIC architecture before RTL development starts.”

So, another reason to contact eSilicon if you’re considering a 7nm AI ASIC. I would check out both of these platforms if you want to reduce integration risks, absolutely.

Also read: eSilicon Announces Silicon Validation of 7nm 56G SerDes