Bronco Webinar 800x100 1

Using Sequential Testing to Shorten Monte Carlo Simulations

Using Sequential Testing to Shorten Monte Carlo Simulations
by Tom Simon on 12-27-2017 at 7:00 am

When working on an analog design, after initial design specs have been met, it is useful to determine if the design meets specs out to 3 or 4 sigma based on process variation. This can serve as a useful step before going any further. It might not be a coincidence that foundries base their Cpk on 3-sigma. To refresh, Cpk is the ratio of the lesser of the upper or lower process parameter specification boundary and their 3-sigma deviation– making a Cpk of 1 working out to meeting process specs at 3 sigma. Higher Cpk’s point toward meeting spec out to a higher sigma – providing better yields. Still, running Monte Carlo analysis on a design across process variations to validate proper performance out to 3 or 4 sigma can be a daunting task.

During the MunEDA user group meeting in Munich during November I had the opportunity to hear a presentation on an interesting technique that can possibly reduce the number of Monte Carlo runs necessary to reject or qualify a design during Monte Carlo variation analysis. The name of the technique is Sequential Testing. The short version is that it uses the results from a smaller number of samples to determine the likelihood of the final result being above or below thresholds for acceptance or rejection. Let’s break this down a bit.

If you have a jar of 100 randomly mixed black or white marbles and you draw a small number, you will start to get an idea of the composition of the entire jar. Of course, there will be some uncertainly, but if you are willing to accept a range as your answer, you can get a pretty good idea of the percentage of black or white balls with just a few samples. In essence, we are talking about using a smaller number of samples to get a probability that we meet spec at a specific sigma.

This process works better when the design in question is further away, either better or worse, from the target sigma. Any way you look at it when you can have confidence that a design is either failing or beating its target sigma, you can save a lot of time running Monte Carlo simulations. So, as you might gather, the key is selecting the right level for acceptance or rejection. These are known as the acceptance quality limit (AQL) and the rejection quality limit (RQL), respectively. Given that we are chip designers and not statisticians it’s nice that MunEDA offers some help here. Their Dynamic Sampling option in their Monte Carlo simulator will help automatically set the percentages for AQL and RQL.

So how does this translate into time savings on Monte Carlo analysis? Their presentation contained some examples of applying this feature in their tool. If we look at a circuit that has a 3 sigma requirement and run a full Monte Carlo we expect to have 5000 runs. However, if the circuit we are analyzing only has a sigma-to-spec robustness of 2.5 sigma, we can expect to learn this after only 192 simulations when we use the sequential testing feature. This results in an impressive 26x speed up. Though we won’t be happy to learn the design fails, at least significant Monte Carlo simulation time is saved.

The same effect would be observed if the circuit exceeded the target sigma by a margin. If the circuit yielded out to 3.5 sigma, this can be predicted with only 318 runs. Still far fewer than 5000. To use the interface, users specify the desired yield and then choose dynamic as the sampling method. Simulations will be run until one of the specs is rejected, or until all specs are accepted.

MunEDA offers the sequential testing option in both their WiCkeD Monte Carlo and their BigMC tools. In the WiCkeD tool they offer pass/fail and sigma-to-spec sampling. In BigMC they offer sigma-to-spec sequential testing. Both help with automatic determination of RQL and AQL. In particular, BigMC is interesting because it can handle very large netlists, ~100MB or 500k devices. Overall MunEDA’s prowess in statistical analysis show through quite clearly. During the user group meeting in Munich there were many papers presented on diverse topics – from flip/flop optimization to using their worstcase analysis to model a MEMS design. For more information on this and the other topics, I suggest looking at their website.


Neural Networks Leverage New Technology and Mimic Ancient Biological Systems

Neural Networks Leverage New Technology and Mimic Ancient Biological Systems
by Tom Simon on 12-26-2017 at 12:00 pm

Neural networks make it possible to use machine learning for a wide variety of tasks, removing the need to write new code for each new task. Neural networks allow computers to use experiential learning instead of explicit programming to make decisions. The basic concepts related to neural networks were first proposed in the 1940’s, but sufficient technology to implement them did not become available until decades later. We are now living in an era where they are being applied to a multitude of products, most notable of these being autonomous vehicles. In a presentation from ArterisIP, written by CTO Ty Garibay and Kurt Shuler, they assert that the three key ingredients for making machine learning feasible today are big data, powerful hardware, and a plethora of new NN algorithms. Big data makes available vast amounts of training data, which can be used by the neural networks to create the weights or coefficients for the task at hand. Powerful new hardware is also making it possible to perform processing that is optimized for the algorithms used in machine learning. This hardware includes classic CPU’s, as well as GPU’s, DSP’s, specialized math units, and dedicated special purpose logic. The final ingredients are the algorithms that are used to assemble the NN itself. The original basis for the design of all machine learning is the human brain, which uses large numbers of computing elements (neurons) connected in exceedingly elaborate and ever-changing ways. Looking at the human brain, it is clear that much more real estate is dedicated to the interconnection of processing elements than to the processing elements themselves. See the amazing image below to compare the regions of gray and white matter.


The ArterisIP presentation offers a dramatic chart showing prominent NN classes as of 2016. Again, we see that the data flow and interconnection between the fundamental processing units, or functions, is the significant characteristic of Neural Networks. In machine learning systems, performance is bounded by architecture and implementation. On the implementation side we see, as noted above, that hardware accelerators are frequently used. Also, implementation of the cache and system memory has a profound effect on performance.


SoC’s for machine learning are being built with IP that supports cache coherency and also with a large number of targeted function accelerators that have no notion of memory coherency. To optimize these SoC’s such that they can achieve the highest performance, it is helpful to add a configurable coherent caching scheme which allows these blocks to communicate efficiently on-chip. ArterisIP, as part of their interconnect IP solution, offers a proxy cache capability that can be custom-configured and added to support non-cache coherent IP.


ArterisIP points out in their presentation that data integrity protection is also needed in many of the applications where NNs are being used. For instance, in automotive systems, the ISO 26262 standard calls for rigorous attention to ensuring reliable data transfer. ArterisIP addresses this requirement with configurable ECC and parity protection for critical sections of an SoC. Also, their IP can duplicate hardware where needed in the interconnect system, in order to dramatically reduce the likelihood of a device failure. ArterisIP has extensive experience providing interconnect IP to the leading innovators developing Neural Network SoCs. The company recently publicly announced nine customers that are designing machine learning and AI SoC’s. The application areas targeted by Arteris IP’s customers include data centers, automotive, consumer and mobile. Neural networks will continue to become increasingly important for computing systems. As the need to write application specific code diminishes, the design of the neural network itself will become the new key design challenge, including both the NN software and the specific hardware implementation implementing the system. ArterisIP’ s interconnect IP can address many of the design issues that arise in the development of these SoCs.

I found their presentation, which is available on their website, to be very informative, and it provides a unique perspective on the topics relating to efficient and reliable NN systems.


HLS Rising

HLS Rising
by Bernard Murphy on 12-26-2017 at 7:00 am

No-one could accuse Badru Agarwala, GM of the Mentor/Siemens Calypto Division, of being tentative about high-level synthesis. (HLS). Then again, he and a few others around the industry have been selling this story for quite a while, apparently to a small and not always attentive audience. But times seem to be changing. I’ve written elsewhere about expanding use of HLS. Now Badru has written a white paper which gets very aggressive – you’d better get on-board with HLS if you want to remain competitive.


Now if that was just Badru, I might put this passion down to his entirely understandable belief that his baby (Catapult) is beautiful. But when some very heavy hitting design teams agree, I have to sit up and pay attention. Badru cites public quotes from Qualcomm, Google and NVIDIA, backed up by detailed white papers, to make his case. Naturally these applications center around video, camera, display and related applications. But for those who haven’t been paying attention, these areas represent a sizeable percentage of what’s hot in chip-design today. The references back that up, as do applications in the cloud, gaming, and real-time recognition in ADAS.

NVIDIA was able to cut the schedule of a JPEG encoder/decoder by 5 months while also upgrading, within 2 months, two 8-bit video decoders to 4K 10-bit color for Netflix and YouTube applications. In their view, these objectives would never have made it to design in an RTL-based flow. An important aspect of getting to the new architecture with a high QoR was the ability to quickly and incrementally refine between a high-level functional C model and the synthesizable C++ model, and run thousands of tests to ensure compatibility between these models, something that simply would not have been possible if had to run on RTL. In an integration with PowerPro, they were also able to cut power by 40%. NVIDIA say that they are no longer HLS skeptics – they plan to use this Catapult-based flow on future video and imaging designs, whether new or re-targeted.

Google are big fans of open-sourcing development, as much in hardware as in software. They are supporting a project called WebM for the distribution of compressed media content across the web and want to provide a royalty-free hardware decoder IP for that content, called the VP9 G2 decoder. Again, this must handle up to 4K resolution playback, on smart TVs, tablets, mobile devices and PCs/laptops. In their view, C++ is a more flexible starting point than RTL for an open-source distribution, allowing users to easily target different technologies and performance points. To give you a sense of how big a deal this is, VP9 support is already available in more than 1 billion endpoints (Chrome, Android, FFmpeg and Firefox). So, hardware acceleration for the standard has a big customer base to target. Starting an open-source design in C++ rather than RTL is likely, as they say, to move the needle. Did I mention that Google uses Catapult as their HLS platform?

Beyond the value of C++ being a more flexible starting point for open-source design, Google observed a number of other design advantages:

  • Total C++ code for the design is about 69k lines. They estimate an RTL-based approach would have required ~300k lines. No matter how you slice it, creation, verification and debug time/effort scales with lines of code. You get to a clean design faster on a smaller line-count.
  • Simulation (in C++) runs about 50X faster than in RTL. They could create, run verification and fix bugs in tight loops all day long rather than spinning their wheels waiting for RTL simulation runs to complete.
  • Using C++ they could use widely available tools and flows to collaborate, share enhancements to the same file and merge when appropriate. You can do this kind of thing in RTL too, but in a system-centric environment there must be a natural pull to using tools and ecosystems used by millions of developers rather than thousands of developers.
  • An HLS run on a block (14 of them in the design) took about an hour. Which allowed them to quickly explore different architectures through C++ changes or synthesis options. They believe this got them to the final code they wanted in 6 months rather than a year.

Qualcomm has apparently been using HLS and high-level verification (HLV, based on C++/ SystemC testbenches) for several years, on a wide range of video and image processing IP, some of which you will find in SnapDragon devices. They apparently started with HLS in the early 2000’s, partnering with Calypto, the company that created Catapult and is now a part of Mentor/Siemens. A big part of the attraction has been the fast turns that are possible in verifying an architecture-level model; they say they are seeing an even more impressive 100-500X performance improvement over RTL. Qualcomm also emphasizes that they do the bulk of their verification (for these blocks) in the C domain. By the time they get to HLS, they already have a very stable design. Verification at the C level is dramatically faster and proceeds in parallel with design, unlike traditional RTL flows where, no matter how much you shift left, verification always trails design. Verification speed means they can also get to very high coverage much faster. And they can reuse all of that verification infrastructure on the synthesized RTL produced by HLS. The only tweaking they have to do on the RTL is generally at the interface level.


I know we all love our RTL, we have lots of infrastructure and training built up around that standard and we unconsciously believe that RTL must be axiomatic for all hardware design for the rest of time. But we’re starting to see shifts, in important leading applications and in important leading companies. And when that kind of shift happens, doggedly refusing to change because we’ve always used RTL or “everyone knows” that C++ -based design is never going anywhere – these viewpoints may not be healthy. Might want to check out the white-paper HERE.


China is right: The world doesn’t need Silicon Valley

China is right: The world doesn’t need Silicon Valley
by Vivek Wadhwa on 12-25-2017 at 7:00 am

Ever since the Chinese Government banned Facebook in 2009, Mark Zuckerberg has been making annual trips there attempting to persuade its leaders to let his company back in. He learned Mandarin and jogged through the smog-filled streets of Beijing to show how much he loved the country. Facebook even created new tools to allow China to do something that goes against Facebook’s founding principles — censor content.

But the Chinese haven’t obliged. They saw no advantages in letting a foreign company dominate their technology industry. China also blocked Google, Twitter, and Netflix and raised enough obstacles to force Uber out.Chinese technology companies are now amongst the most valuable — and innovative — in the world. Facebook’s Chinese competitor, Tencent, eclipsed it in market capitalization in November, crossing the $500 billion mark. Tencent’s social-media platform, WeChat, enables bill payment, taxi ordering, and hotel booking while chatting with friends; it is so far ahead in innovation that Facebook may be copying its features. Other Chinese companies, such as Alibaba, Baidu, and DJI, are racing ahead in e-commerce, logistics, artificial intelligence, self-driving cars, and drone technologies. These companies are gearing up to challenge Silicon Valley itself.

The protectionism that economists have long decried, which favors domestic supplies of physical goods and services, limits competition and thereby the incentive to innovate and evolve. It creates monopolies, raises costs, and stifles a country’s competitiveness and productivity. But this is not a problem in the Internet world.

Over the Internet, knowledge and ideas spread instantaneously. Entrepreneurs in one country can easily learn about the innovations and business models of another country and duplicate them. Technologies are advancing on exponential curves and becoming faster and cheaper — so every country can afford them. Any technology company in any country that does not innovate risks going out of business, because local startups are constantly emerging that have the ability to challenge them

Chinese technology protectionism created a fertile ground for local startups by eliminating the fear of foreign predators. And there was plenty of competition — coming from within China.

Silicon Valley’s moguls openly tout the need to build monopolies and gain unfair competitive advantage by dumping capital. They take pride in their position in an economy in which money is the ultimate weapon and winners take all. If tech companies cannot copy a technology, they buy the competitor.

Amazon, for example, has been losing money or earning razor-thin margins for more than two decades. But because it was gaining market share and killing off its brick-and-mortar competition, investors rewarded it with a high stock price. With this inflated capitalization, Amazon raised money at below market interest rates and used it to increase its market share. Uber has used the same strategy to raise billions of dollars to put potential global competitors out of business. It has been unscrupulous and unethical in its business practices.

Though this may sound strange, copying is good for innovation. This is how Chinese technology companies got started: by adapting Silicon Valley’s technologies for Chinese use and improving on them. It’s how Silicon Valley works too.

Steve Jobs built the Macintosh by copying the windowing interface from the Palo Alto Research Center. As he admitted in 1994, “Picasso had a saying, ‘Good artists copy, great artists steal’; and we have always been shameless about stealing great ideas.”

Apple usually lags in innovations so that it can learn from the successes of others. Indeed, almost every Apple product has elements that are copied. The iPod, for example, was invented by British inventor Kane Kramer; iTunes was built on a technology purchased from Soundjam; and the iPhone frequently copies Samsung’s mobile technologies — while Samsung copies Apple’s.

Facebook’s origins also hark back to the ideas that Zuckerberg copied from MySpace and Friendster. And nothing has changed since: Facebook Places is a replica of Foursquare; Messenger video duplicates Skype; Facebook Stories is a clone of Snapchat; and Facebook Live combines the best features of Meerkat and Periscope. Facebook tried mimicking Whatsapp but couldn’t gain market share, so it spent a fortune to buy the company (again acting on the Silicon Valley mantra that if stealing doesn’t work, then buy).

China opened its doors at first to let Silicon Valley companies bring in their ideas to train its entrepreneurs. And then it abruptly locked those companies out so that local business could thrive. It realized that Silicon Valley had such a monetary advantage that local entrepreneurs could never compete.

America doesn’t realize how much things have changed and how rapidly it is losing its competitive edge. With the Trump administration’s constant anti-immigrant rants, foreign-born people are getting a clear message: Go home; we don’t want you. This is a gift to the rest of the world’s nations, because the immigrant exodus is boosting their innovation capabilities. And America’s rising protectionist sentiments provide encouragement to other nations to raise their own walls.

Here is an India-focused version of this article in Hindustan Times.

For more, visit my website: www.wadhwa.com and read my book, The Driver in the Driverless Car: How Our Technology Choices Will Create the Future– A 2017 Financial Times & McKinseyBusiness Book of the Year and Nature Magazine “Best Science Pick


2017 Semiconductors +20%, 2018 slower

2017 Semiconductors +20%, 2018 slower
by Bill Jewell on 12-24-2017 at 7:00 am

The global semiconductor market in 2017 will finish with annual growth of about 20%. Recent forecasts range from 19.6% to 22%. World Semiconductor Trades Statistics (WSTS) data is finalized through October, thus the final year results will almost certainly be within this range. We at Semiconductor Intelligence have raised our forecast to 21% from 18.5% in September. 2017 will be the highest annual change since 32% in 2010. Memory, specifically DRAM and NAND flash, is the major market driver. WSTS projects the memory market will grow 60% in 2017, while the semiconductor market excluding memory will increase 9%. Memory was 23% of the semiconductor market in 2016 but accounts for two-thirds of the $70 billion change in the 2017 semiconductor market versus 2016.

The announced forecasts for 2018 range from Mike Cowan’s 3.2% to Future Horizons’ 15.6%. We at Semiconductor Intelligence raised our 2018 projection to 12% from 10% in September. The assumptions behind our forecast are:

· Steady or improving demand for key electronic equipment
The market for PCs and tablets has been weak, with declines in 2016 and 2017. Gartner expects a slight improvement in 2018 to a roughly flat market at 0.2% change. IDC believes smartphone unit growth will accelerate from 1.4% in 2017 to 3.7% in 2018. IC Insights projects ICs for automotive and internet of things (IoT) applications will be key market drivers over the next few years. These two categories should each show robust increases of 16% in 2018.

[table] border=”1″ cellspacing=”0″ cellpadding=”0″ align=”center”
|-
| style=”width: 184px; height: 19px” | Annual Growth
| style=”width: 98px; height: 19px” | 2017
| style=”width: 92px; height: 19px” | 2018
| style=”width: 184px; height: 19px” | Source
|-
| style=”width: 184px; height: 19px” | PC & Tablet units
| style=”width: 98px; height: 19px” | -3.2%
| style=”width: 92px; height: 19px” | 0.2%
| style=”width: 184px; height: 19px” | Gartner, Oct. 2017
|-
| style=”width: 184px; height: 19px” | Smartphone units
| style=”width: 98px; height: 19px” | 1.4%
| style=”width: 92px; height: 19px” | 3.7%
| style=”width: 184px; height: 19px” | IDC, Nov. 2017
|-
| style=”width: 184px; height: 19px” | Automotive IC $
| style=”width: 98px; height: 19px” | 22%
| style=”width: 92px; height: 19px” | 16%
| style=”width: 184px; height: 19px” | IC Insights, Dec. 2017
|-
| style=”width: 184px; height: 19px” | Internet of Things IC $
| style=”width: 98px; height: 19px” | 14%
| style=”width: 92px; height: 19px” | 16%
| style=”width: 184px; height: 19px” | IC Insights, Dec. 2017
|-

Slight improvement in global economic growth
The International Monetary Fund (IMF) October 2017 economic outlook called for global GDP to rise 3.6% in 2017, an acceleration of 0.4 percentage points from 3.2% in 2016. 2018 is expected to show a slight acceleration to 3.7% growth. Advanced economies are projected to decelerate from 2.2% GDP change in 2017 to 2.0% in 2018. Among the advanced economies, an acceleration in U.S. GDP growth is more than offset by slower increases in the Euro area, United Kingdom, and Japan. The acceleration in global GDP in 2018 will be driven by emerging and developing economies, moving from a 4.6% change in 2017 to 4.9% in 2018. A deceleration in China from 6.8% in 2017 to 6.5% in 2018 is more than offset by accelerating change in India, steady growth in the ASEAN-5 (Indonesia, Malaysia, Philippines, Thailand and Vietnam), and a continuing recovery in Latin America.

· Moderating, but continuing strong memory demand
In the last 25 years, there have been four cycles where the memory market has shown at least one year of growth over 40%. These cycles usually ended with a major decline the memory market, ranging from 13% to 49%. The exception was 2004-2005, when the memory market went from 45% growth to a modest 3% change. The upside of the memory cycles generally lasted two to four years. The exception to this was 2010, where a 55% memory increase was preceded by a 3% decline in 2009 and followed by a 13% decline in 2011. Based on this history, we will most likely see one more year of solid memory growth in 2018 before a decline in 2019.

· Strong quarterly growth set in 2017 drives healthy 2018
The 2017 semiconductor market has exhibited robust gains in each quarter versus a year ago, starting at 18% in 1Q 2017 and peaking at 24% in 2Q 2017. 3Q 2017 grew 22% from a year ago and 10.2% versus the prior quarter. 3Q 2017 was only the second double-digit quarter-to-quarter increase in the last eight years (after 11.6% in 3Q 2016). This quarterly pattern will drive healthy year 2018 growth with only modest quarter-to-quarter change in each quarter of 2018. The quarterly forecast below supports our 12% annual target for 2018.

We have not yet finalized our forecast for 2019. The most probable scenario is a low single-digit increase or a slight decrease as memory demand eases. After the 2019 correction, the semiconductor market should recover to moderate growth in 2020.


IEDM 2017 – Intel Versus GLOBALFOUNDRIES at the Leading Edge

IEDM 2017 – Intel Versus GLOBALFOUNDRIES at the Leading Edge
by Scotten Jones on 12-22-2017 at 9:00 am

As I have discussed in previous blogs, IEDM is one of the premier conferences to learn about the latest developments in semiconductor technology.

Continue reading “IEDM 2017 – Intel Versus GLOBALFOUNDRIES at the Leading Edge”


"The Year of the eFPGA" 2017 Recap

"The Year of the eFPGA" 2017 Recap
by Tom Dillinger on 12-22-2017 at 7:00 am

This past January, I had postulated that 2017 would be the “Year of the Embedded FPGA”, as a compelling IP offering for many SoC designs (link). As the year draws to a close, I thought it would be interesting to see how that prediction turned out.

The criteria that would be appropriate metrics include: increasing capital investment; increasing customer adoption; support for a diverse set of applications; and, an emerging set of standard product offerings to accelerate adoption. To be sure, qualified test vehicles fabricated on multiple foundry process nodes are also crucial, as is a solid methodology flow for design synthesis and physical personalization.

If you have been following eFPGA technology, you have no doubt seen recent press releases highlight the growing investment and the customer endorsements. In addition, previous Semiwiki articles have described how eFPGA features are addressing both high-performance and low-power requirements, as well as the ease with which the IP block is connected to the pervasive AMBA bus protocols (link, link). So far, the prediction is looking pretty good. 🙂

The last metric – the introduction of standard product offerings – has received less attention, perhaps. To gain a better understanding of the eFPGA product strategy, I recently met up with Aparna Ranachandran, Tony Kozaczuk, and Cheng Wang at Flex Logix. I asked how their technology offerings are evolving, as the customer interest grows.

Cheng indicated, “A key requirement is to address the applications where programmable eFPGA functionality also incorporates significant memory storage. Many customers are seeking a product that optimally integrates SRAM within the eFPGA logic tiles. They do not intend to invest a lot of resource in physical implementation – i.e., designing and floorplanning SRAM blocks adjacent to the eFPGA IP. These customers want a flow from their HDL description through synthesis to an off-the-shelf eFPGA product with programmable logic and memory.”

“To that end, we will soon be releasing an integrated design for silicon qualification, as a standard product.”, Aparna highlighted.

Tony added,“With lots of customer input, we have selected a combination of programmable logic capacity and array storage that will span a wide range of upcoming customer designs. We are leveraging the existing HDL synthesis flow support that provides block RAM’s in the output netlist, inferring the array topology from the HDL model. Our EFLX compiler maps each BRAM in the synthesis netlist to a corresponding configuration of SRAM macros integrated in the eFPGA IP.”

The use of Block RAM’s is the standard representation for synthesizing and implementing arrays for commercial FPGA products – so, this flow is a natural extension for eFPGA IP.The initial Flex Logix programmable logic + array offering is illustrated below.

Aparna is the lead designer, and provided a description of some of the technical features:

  • eFPGA array macros are based on qualified TSMC bit cells. (The initial process node will be 28nm.)
  • MBIST test controller design logic is provided.
  • The array macros are optimally configured between tiles – specific attention is given to the I/O connections from the tiles to the arrays, without adversely impacting the logic signal routing capacity between tiles.
  • The EFLX placement algorithm will automatically assign the BRAM netlist instances to the integrated SRAM macros, leveraging timing-driven optimization calculations. (Unused array macros are tied to inactive levels.)

The overall flow for realizing the eFPGA logic + memory design is illustrated in the figure below.

The initial front-end EFLX analysis step provides customers with resource estimates, for both the programmable logic LUT usage and the array macro utilization. The subsequent steps complete the physical personalization, including the array macro connectivity.

“Our customers are seeking silicon-proven IP products – this offering will expand the application base to designs requiring integrated storage.”, Cheng said. (For specific customers who are interested in a unique integrated configuration, the Flex Logix team would assist them with preparation of the flow input descriptions shown as “optional” in the figure above, as well as the IP physical implementation.)

So, it looks like the eFPGA technology market is indeed expanding to offer customers with product(s) that will accelerate adoption, combining complex logic and storage requirements with a well-defined implementation flow. This past year has indeed been the “year of the eFPGA” – it will be interesting to see what 2018 brings.

For more information on the Flex Logix logic + array offering, please follow this link.

Have a Happy Holiday season!

-chipguy


Embedded In-chip Monitoring, Webinar Recap

Embedded In-chip Monitoring, Webinar Recap
by Daniel Payne on 12-21-2017 at 12:00 pm

Six years ago I first interviewed Stephen Crosher, CEO and Co-founder of Moortecas they were in startup mode with some new semiconductor IP for temperature sensing, and earlier this month I attended their webinar all about embedded in-chip monitoring to get caught up with their technology and growing success. Ramsay Allen is their VP of Marketing and he talked about how their business started out in 2005 based in the UK, focused as an IP supplier of Process, Voltage and Temperature (PVT) sensing.


Stephen Crosher, Ramsay Allen – Moortec

Stephen presented the bulk of the webinar and introduced the need for embedded in-chip monitoring:

  • Can I meet my power consumption requirements?
  • Is my chip operating in a reliable fashion?
  • How are the transient thermal levels within my SoC operating?

FinFET transistors became widespread starting at 22nm and continuing into smaller nodes because compared to planar CMOS technologies it is offering lower leakage, lower operating voltages, higher silicon density, faster speeds and improved channel control. With the increase in density come new challenges of thermal hot spots, electromigration causing reliability issues, and leakage concerns. Even packaging costs become an issue as you can spend between $1 and $3 per watt consumed in the SoC.

As voltage supply levels scale ever downwards then chip engineers need to design for worst-case IR drops and account for increased resistance values in interconnect. There is even an industry segment on the high-end that is mining for Bitcoin, and their chip performance is bound by power delivery and air conditioning costs, so being able to run your chips cooler is a big financial benefit.

Smaller process geometry nodes like 28nm and below have reliability issues to contend with like NBTI (Negative Bias Temperature Instability) where the Vt value shifts over time, so IC designers need to know how far Vt values have changed during aging. Trying to reach timing closure is now complicated by process variations within a single die where one chip region has a unique PVT corner, while another chip region is operating in a different PVT corner:

Mr. Crosher shared a use case from AMD on their Athlon II Quad core CPU, designed at 45nm where they placed thermal sensors in each of the cores and then distributed the workload across the cores based upon the thermal readings from each core, making sure that no one core became too hot, balancing the core reliability.

In the second use case the challenge was to optimize voltage scaling by measuring the power and speed of each IC, then finding the lowest functional voltage possible, saving the unique settings in each device. Moortec even supports Adaptive Voltage Scaling (AVS) in a closed loop format by placing multiple Voltage Monitors (VM) or Process Monitors on each chip:

There was even a use case where an enterprise data center used embedded chip monitoring to do real-time temperature monitoring to allow power optimization, provide a failure prediction of devices, and to protect each CPU by providing a safety shutoff limit. This is a big deal for data centers because they are such large consumers of power from our electrical grid, and their projected growth is staggering. Today, about 2% of our total electricity is taken by Data Centers, and with a CAGR of 12% these power stations will produce more greenhouse gas than airlines by 2020.

The actual monitoring IP from Moortec has both hard macros and a soft PVT controller as shown below:

This IP is already used in many nodes: 40nm, 28nm, 16nm, 7nm.

The number of monitors and their placement is dependent on each unique application, so the engineers at Moortec are happy to give you a hand on where to place Process, Voltage and Temperature monitors.

Summary
The challenges in our modern SoC chips can be met through the use of PVT in-chip monitors. You could try and create your own IP to do this, or just task the experts that have been doing this for over a decade and re-use their silicon-proven monitoring IP.

Q&A
Q: How do I test your IP?
A: Thermal – a reference is required to test accuracy, so this is done by probing on die, and we have test chip programs to ensure no self-heating. There’s only a .003C temperature rise from adding a sensor. Yes, we have correlated silicon versus simulation data.

Q: Where do you store info coming out of PVT sensors?
A: Register sets in the control block. You would store output in your own SoC design, not in our IP.

Q: Is the voltage monitor immune to Vdd fluctuations?
A: Our voltage monitor is looking at Vdd supply across its full range, designed to be immune to ripples, and it’s robust.

Webinar Recording
To view the entire 42 minutes webinar, visit this link.


Aldec and High-Performance Computing

Aldec and High-Performance Computing
by Bernard Murphy on 12-21-2017 at 7:00 am

Aldec continues to claim a bigger seat at the table, most recently in their attendance at SC17, the supercomputing conference hosted last month in Denver. I’m really not sure how to categorize Aldec now. EDA company seems to miss the mark by a wide margin. Prototyping company? Perhaps, though they have a much stronger focus on end-applications than a general-purpose prototyping solution, witness also recent attendance at the Trading Show in Chicago this year, where they were showing off platforms to support high-frequency trading (HFT).


In at least some of these applications it isn’t even clear that the Aldec solution is limited to prototyping. In low-volume applications (for example HFT), the Aldec boards may well be the final implementation. This is certainly apparent in some of the solutions they talked about at SC17: a DES code-breaker, a ViBe motion detector and a solution for short reads alignment in genome sequencing, as close to live applications as you can get.

Starting with the DES code breaker, I’m sure Aldec isn’t planning to enable hackers, also DES is no longer considered a secure encryption standard. However, this demo is a good example of using Aldec boards to build accelerators. In this demo, they show off a brute-force code-breaker to crack 6144 56-bit DES instances in ~20 hours using their HES-HPC accelerator with 6 Xilinx UltraScale chips. That’s a pretty powerful demonstration of the level of computation that is possible in an FPGA-based accelerator.


A more directly applicable demo shows off ViBe-based motion detection. ViBe is a popular method to detect and subtract background in video sequences, making it especially important in detecting moving objects in video, for example other cars or pedestrians. In this example, they are processing 1080p video at 39 frames per second and using the same HES-HPC platform to run ViBe background subtraction in real-time. This would naturally be useful in ADAS and autonomous driving applications and would be equally useful in security/surveillance applications and autonomous drone applications as just a few examples.


Their third demo is one of the coolest uses of an accelerator I have seen, to accelerate gene sequencing. As I understand it, today sequencing a whole genome in one shot is still a challenging (and expensive) problem. Sequencing methods more widely available for production applications tend to do something called short reads, reading a small set (a few hundred base-pairs) at a time (base pairs being pairs of the famous nucleotides A, C, G and T). These must then be mapped to a reference genome through a process of approximate string matching. This way the sequencer flow can build up a reconstruction of the actual genome sequence.

Of course, there are several challenges in this task. First, human DNA (as an immediately interesting application) has about 3 billion base pairs. Second you don’t expect an exact match to the reference genome. Mutations of various kinds are part of what makes us different and are a contributor to many ailments. There are also repeats/ redundancies in the genome. Matching has to take account of all of these potential differences. But at the same time, it has to be super-accurate. Human genomes are 99.9% similar across all types of humans so there’s really very little room for error.

ReneLife, a faculty enterprise of the Indian Institute of Science in Bangalore, has developed a solution (ReneGene) to sequencing short reads that is faster, more accurate and significantly more cost effective than existing solutions and they have done so building on an HES-HPC platform. They compare an earlier software version of their solution with existing solutions and show it is more accurate and faster when running on a supercomputer cluster supported by a GPU cluster. OK, but hardly scalable to mass usage (at an estimated cost of $400k/year). Then they ported their solution to an HES-HPC implementation, running at an annual cost of less than 1% of the supercomputing solution, and it runs faster still. That sounds like a very compelling option for mass-market deployment.

I have to believe there are many more applications that could benefit from massive acceleration, for which the economics of an ASIC solution (and the skill-sets required) don’t make sense. FPGAs are a perfect fit in this cases and ready-made accelerator boards are even better (qv Raspberry Pi, Adafruit, etc.). Aldec seems to align very well with these needs. Perhaps we should call their products application-specific accelerator platforms. ASAP – not bad and certainly closer than EDA to the mission that is apparent in their trade-show and customer footprint.


Test Compression for Mission Critical SoCs

Test Compression for Mission Critical SoCs
by Mitch Heins on 12-20-2017 at 12:00 pm

With the advent of the Internet-of-Things (IoT), Industry 4.0, Cognitive Computing, and autonomous vehicles and robots we are seeing an unprecedented number of systems-on-a-chip (SoCs) going into mission-critical applications. To accomplish the complexity of these applications, SoCs are being manufactured in leading-edge processes where manufacturing tolerances are being pushed to their limits. Not only are the devices more complex, but the processes required to manufacture them have more subtle defects mechanisms than in the past. All of this has led to designs with exploding test data volumes and associated testing costs that could threaten their viability.

The test and electronic design automation (EDA) industries have done a brilliant job so far to keep up with Moore’s Law through the introduction of testing technologies like SCAN synthesis, ATPG, Built-in-Self-Test (BIST), and Embedded Deterministic Test (EDT). EDT has scaled to well beyond the 100X range for data compression, but the demand for more compression continues as test pattern data volumes are driven by more complex design structures, greater combinatorial depth, more complicated clocking schemes, and the use of new fault models appearing at advanced nodes. Mentor, a Siemens business, recently released a new white paper that touches on these points and introduces their next generation of test compression dubbed Tessent VersaPoint Test Point Technology.

VersaPoint is a hybrid of Mentor’s Tessent TestKompress (TK) and Tessent LogicBIST methodologies that combines EDT and LBIST. EDT provides the high-quality test needed for mission-critical requirements while LBIST is used for in-system testing. Both these technologies (EDT and LBIST) use something known as ‘test points’ to improve the overall testability of a circuit. If you remember back to IC-Test 101, to be able to detect a fault on a given node, you must be able to both control and observe that node. Test points are extra logic inserted into a circuit that lets you do just that. See figure for examples of typical logic used for both control-type and observation-type test points.

Previous to VersaPoint, the insertion of test points for both EDT and LBIST required a two-step process. VersaPoint enables a one-pass process in which test points for both types of tests are added concurrently. Both the analysis the insertion steps for the test points can be performed on a gate-level netlist either before or after scan insertion. While this makes for a simpler test flow, more importantly it also provides for better test compression results.

As evidence of the better compression achieved, Mentor added the following table of 15 designs ranging in size from 1.4M gates up to 23.3M gates with an average size of 7.3M gates. The average compression ratio for these designs when using standard EDT methodologies is 46X. However, with the new VersaPoint test points, the average pattern count reduction achieved for these designs is 5.2X vs only 3.9X when using EDT test points. This may not seem like much but when you look at the compression achieved from using the VersaPoint test points you get a whopping 240X compression as compared to a 46X compression from EDT test points alone. That’s a lot!


One key aspect of adding test points is to minimize any negative effects of the test points on circuit timing closure. Mentor has added several features to support this including the ability to exclude test points from any false and multi-cycle paths using a functional SDC file. Test points can also be excluded from critical paths extracted from static timing analysis and the number of control-type test points added to a single path can be limited to a specific value.

While VersaPoint test points give great results for Stuck-At Faults (SAF), they are also effective for all other types of fault models as well. The white paper has tables showing similar if not better results when using VersaPoint test points with Transition Delay Fault (TDF) models. VersaPoint test points also have benefit when used with so-called ‘low-power ATPG’ algorithms that try to reduce switching activity. These low-power ATPG algorithms tend to create higher pattern counts which can be mitigated by using VersaPoint test points.

All in all, the new VersaPoint capability seems impressive and with the new era of Mission-Critical SoCs that are coming to market, it’s good news that Mentor continues to attack the test problem. Making sure SoCs for these types of designs are manufacturing defect free has got to be high on a system design company’s list of priorities, which bodes well for Mentor’s Tessent family of products.

See also:
White Paper: Improving Test Pattern Compression with Tessent VersaPoint Test Point Technology
Mentor Tessent Products web page