RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

GLOBALFOUNDRIES Ready for IPO in 2022?

GLOBALFOUNDRIES Ready for IPO in 2022?
by Daniel Nenni on 09-28-2019 at 6:00 am

Hard to believe but it’s the 10th anniversary of Globalfoundries. What a journey this has been. It truly has been an honor to work with GF over the years as they invested many billions of dollars in the fabless semiconductor ecosystem and added a colorful chapter in semiconductor history, absolutely.

We have written hundreds of articles about GF that have been viewed by more than one million people. GF also has a chapter in our first book “Fabless: The Transformation of the Semiconductor Industry” which, in the 2019 update, documents the appropriately named GF pivot of 2018.

GF CEO Dr. Thomas Caulfield keynoted this year’s “Future of Innovation” event. Today GF has more than 16,000 employees and $6B in revenue making them the second largest pure-play foundry and the largest “specialty foundry”.

Tom made some interesting points in his opening:

  • World economy $85T
  • Electronics Industry $2T
  • Semiconductor Industry $475B
  • Semiconductor Foundry $63B

It really is interesting to know that five semiconductor foundries support the majority of an $85T world economy. Seriously, take away semiconductors and what do we have besides fire and the wheel?

It is also interesting to know that (according to LinkedIn) there are only 521,816 people worldwide who list themselves as “semiconductor related” professionals. So, congratulations to all of the hardworking semiconductor people like myself who made this miracle we call modern electronics possible.

Tom rightfully pointed out that 75% of the semiconductor devices shipping today are based on mature technologies (12nm and above) which is where the GF pivot has focused them. Tom also highlighted that the current GF output of 2.3M wafers ($6B in revenue) can be easily expanded to 3M wafers with an expected revenue of $8B. This is not a giant leap, in fact, I think that revenues will be even higher based on the platform strategy that was outlined in the presentation.

Please note that the equation in the figure above is a product (x) versus a sum (+) meaning that if any one of the variables is 0 the result is zero. This plays directly to the fabless systems companies which is the richest customer segment for the foundries.

Tom mentioned 15 different platforms utilizing 14 application features and 1,000s of silicon proven IP which will result in thousands of specialized application solutions. Again, the target here is electronics systems companies that are making their own chips.

The most interesting news out of the conference however was that GF is planning a public offering in 2022.  I’m a big fan of disruptive moves and while the GF pivot of 2018 was not what I would call disruptive, this IPO certainly will be.

A technology company IPO is definitely a rite of passage into corporate adulthood as it comes with a healthy level of transparency. Given the open media (fake news) we have today it is far too easy to become delusional from PR gone wild. Wall Street however is less easily fooled if you are playing by their rules, absolutely.

The big swing here is the legal action GLOBALFOUNDRIES filed against TSMC and some of their top customers. If the Wall Street bankers can make a silk purse out of this sow’s ear some serious IPO money will exchange hands and Abu Dhabi can finally put this one in the win column.

The semiconductor industry is full of incredibly smart people and it is an honor to work amongst them. One thing I can tell you is that the moves GF has made since Tom took the helm have been rock solid so I would not bet against him, not today.


Chapter Twelve – The Future

Chapter Twelve – The Future
by Wally Rhines on 09-27-2019 at 6:00 am

Content of this book has focused upon predictability of trends in the semiconductor industry based upon past trends, experience and ratios.  What about newly emerging applications of semiconductors?  After all, the entire history of the semiconductor industry is driven by emergence of new applications.

Artificial Intelligence
One of the most exciting new applications affecting semiconductor technology is the broad adoption of AI related analytics. AI is not a new technology.  Figure 1 is the cover of High Technology magazine in July 1986.  I am the person on the left and George Heilmeier, former head of DARPA, is the one on the right.  We tried hard in the 1980s but the infrastructure had not developed to a level where AI would provide profitable opportunities.

Figure 1. Artificial Intelligence technology heavyweights of the 1980s

What’s different today? In the 1980s, we lacked the computing power to handle big data.   We didn’t have very much big data to analyze partly because there was no internet of things. More sophisticated algorithms were needed.  Most of all, there were no obvious near term ways to make money using AI.

Today we have overcome all these limitations.  AI and machine learning have taken on a life of their own.  They have become limited, however, by the processing power available.  Traditional von Neuman general purpose computing architectures are inadequate to handle the complex AI algorithms. The result: a new generation of computer architectures is evolving.

Figure 2 shows the trend in venture capital funded fabless semiconductor companies in recent years. In 2018, a new record of $3.4 billion total investment was set, far above the previous record of $2.5 billion in the year 2000.

Figure 2.  Venture capital funded fabless semiconductor startups

Venture capitalists have been focused on social media and software companies over the last twenty years.  Where is all this new money going?  The answer can be seen in Figure 3. AI and machine learning have dominated the fabless semiconductor industry investment by venture capitalists since 2012 with $1.9 billion invested.

What kind of chips are being funded?  The largest share is in the area of pattern recognition.  Chinese investments in facial recognition chips developed at companies like Sensetime and Face++ constitute a very large share.  There are seventy-five other companies developing chips for pattern recognition that have been funded by venture capital.  These include companies focused on pattern recognition for audio, smell and other applications.

Figure 3.  Venture funded startups since 2012 by application. AI and machine learning constitute the majority of applications

Beyond pattern recognition, the largest share of new fabless semiconductor companies are being funded for data center analytics or edge computing.

Edge Computing
Intelligence historically flows downhill (Figure 4). In the 1960s, mainframe computers dominated our computing capacity.  Dumb terminals connected us to our mainframe computing power.  By the 1980s, minicomputers were well established as an intermediate computing layer between the user and the mainframe.  Twenty years later, the personal computer became the local computing resource.  In another twenty years, the current environment has evolved.  Large cloud-based server resources handle the heavy computing but in between us and the cloud is the fog made up of gateways that collect, aggregate and locally process data.  Beneath that layer are the edge nodes in the mist, collecting and pre-processing the data.  As time passes, the lower layers will inevitably gain more intelligence as semiconductor technology allows us to build more intelligence into the local nodes. Those nodes will become increasingly complex as they incorporate disparate technologies – analog, digital, RF, MEMs, etc.  (Figure 5).  This creates major design and verification challenges.  Most of EDA history is focused upon digital logic and memory.  Edge nodes may require mixed technologies.  Simulating digital logic connected to analog, RF and other technologies is not easy.  A whole new family of EDA tools is required.

Figure 4.  Intelligence flows downhill

5G Wireless Technology
In the next decade, wireless communication will move to the next generation of technology, 5G.  This transition is more significant than past generations.  It affects a wide variety of the infrastructure beyond consumer communications.  Significant impact will be felt in applications involving industrial control, non-real time automotive analytics, urban infrastructure and much more.

One of the great opportunities for the semiconductor industry is the increased number of base stations required to support the infrastructure of 5G and the larger number of antennas in a phone. The number of semiconductor components required will grow dramatically as the world builds a 5G infrastructure. Connectivity to the cloud makes a wide variety of capabilities possible, especially in the factories of the world.  Gateways, which already generate more than three percent of worldwide semiconductor revenue, will be needed.

This connected world will be dependent upon more semiconductors for communications and computing.  For many years the semiconductor industry measured its revenue from the computing and communications industries which were each about 35% of the total.  Now the two are indistinguishable.  Seventy percent of the revenue in the semiconductor industry comes from one or the other or a unique combination of both.

Figure 5.  Diverse technologies like digital, analog, RF and MEMs will be required as edge nodes become more intelligent

Automotive Applications
During the last ten years, sales of semiconductors for automotive applications has increased to about 12% of the total semiconductor market as the electronic content of vehicles increased.  Some traditional electronic functions like engine control will not be needed in electric vehicles but there will be new requirements as well as the continued growth of infotainment and automotive driver assistance (ADAS) that require electronic controls.

Figure 6.  As of June 2019, 463 companies have announced intent to introduce electric cars or light trucks.  211 companies have announced autonomous drive programs

The number of companies planning to build electric cars or light trucks has now grown to 463, more than half of which are based in China (Figure 6).  Two hundred eleven companies have announced autonomous driving programs.  This number of suppliers is not needed and many, or even most, will drop out as we move closer to production. Meanwhile, one would expect an incredible bubble in demand for automotive ICs followed by a rapid decline.

It’s likely that no more than a dozen companies will lead the way in supplying the complex image processing subsystems required for autonomous vehicles.  It’s difficult to predict which ones will succeed but likely that companies that have not been traditional automotive OEMs will make up most of the total.

Other Predictable Futures
Lots of other technologies offer promise for growth.  Quantum computing is interesting because it has some capabilities like encryption that are not solved easily through other means.  The time lag for technologies like this tend to be longer than the evolutionary ones but they will eventually emerge in some form.

The history of the semiconductor industry is driven by major new applications.  Waves of growth were initiated by defense electronics in the 1950s, mainframe computers in the 60s, minicomputers in the 70s, personal computers in the 80s, laptops in the 90s and wireless communications in the most recent two years. Each wave has been accompanied by the emergence of new semiconductor competitors followed by a shakeout that leaves one supplier dominant and shuffles the top ten ranking of companies by revenue (Figure 10 in Chapter 5).

At the same time, the semiconductor industry, like most industries, moves back and forth from standardized versus customized solutions.  This has been referred to as “Makimoto’s Wave” after Tsugio Makimoto, former CEO of Hitachi Semiconductor, who popularized the phenomenon.  As we move into the third decade of the twenty-first century, the semiconductor industry is moving into a customization wave.  Standard von Neuman computer architectures that operate on a string of standard instructions have dominated the computer and semiconductor industries.  Architectures like the Intel 808X and ARM RISC will continue.  Domain-specific architectures tailored for specific tasks like facial recognition are emerging. There will be dozens more as AI and machine learning usher in new opportunities.

What should we consider as the future possibilities for the semiconductor industry? As we showed in Chapter 4, the semiconductor industry is likely to grow through evolutionary means through about 2040 or when demands for lower power or higher performance usher in a new technology for information “switching”.  Carbon nanotubes, bio-switches, or many other possibilities could fill in the switching learning curve of Figure 5, Chapter 3.  Chances are that this “switch” will happen gradually as the need arises for a new application.  In addition, non-silicon materials like Gallium Nitride, Silicon Carbide and other materials will take on increasingly important roles driven by need for characteristics like larger band gaps, i.e. roles like power switching, microwave communications and existing ones like solid state lighting.

Just as steel is still a primary material for construction one hundred fifty years after the booming growth of the steel industry, semiconductors will be at the foundation of business and technology growth for a long time.  Those of us who participated in the last fifty years of exciting growth of semiconductors are still surprised when we see our “mature” industry generate another wave of growth to accompany an emerging application.  I’m confident that there will be many more to come.


AI Hardware Summit, Report #1: Doing More to Cost Less

AI Hardware Summit, Report #1: Doing More to Cost Less
by Randy Smith on 09-26-2019 at 10:00 am

I recently had the pleasure of attending the AI Hardware Summit at the Computer History Museum in Mountain View, CA. This two-day conference brought together many companies involved in building artificial intelligence solutions. Though the focus was on building the hardware for this area, there was naturally much discussion around software and applications as well. The first session I want to summarize was presented by Dr. Carlos Macián, Senior Director, AI Strategy and Products at eSilicon.

When I saw an eSilicon presentation on the agenda, naturally I assumed it would be about their recently announced neuASIC™ IP platform. If you don’t know about that yet, you may want to read about their AI IP platform first. Instead, we were treated to a much broader presentation on controlling the total cost of ownership (TCO) of an AI hardware solution. The presentation was quite insightful and showcased just how much depth and experience eSilicon has when it comes to building these types of ASIC products.

TCO is an important concept. When deciding how to address the challenges of building a hardware solution for a specific AI application, one needs to understand how each decision affects the total cost of the product. Some decisions carry more cost in area (die cost), yield (die cost), effort (person-hours), quality (sales, reputation, returns, etc.), power (packaging and other costs) and so many other factors. The list of traits and their associated costs is quite long. Given that most companies should have a grasp of the common TCO drivers, this presentation focused on the key items to consider for state-of-the-art AI products.

From the slide above, you can see that AI designs for data centers have some familiar drivers that are exacerbated by the need to move to massive parallelism – hyperscale. Hyperscale computing refers to the systems and architecture in distributed computing environments that must efficiently scale from a few servers to thousands of servers. Hyperscale computing is used in environments such as big data and cloud computing – today’s massive data centers.

Carlos clearly explained the biggest challenges to AI hyperscale implementation, along with the enabling technologies that have been rolled out at several companies now. Recent announcements, such as Intel’s announcement at HotChips of their Lakefield processor built using Foveros 3D technology, are a clear sign that these technologies are available now. The challenge is to find a partner who understands all of these enabling technologies, something that eSilicon has already demonstrated.

The presentation then went on to focus on an example of solving these AI design challenges by utilizing one of the enabling technologies – 3D memory overlays. The presentation demonstrated if you stack parts of the solution vertically (e.g., xRAM, SRAM+IO, and compute) on different die in the same package that there are huge efficiencies to be gained. One dramatic gain is yield. Manufacturing several smaller die that can be stacked increases yield dramatically. In the example shown at the event, yield improved from 15.7% to 68.6%. This yield improvement provides a tremendous decrease in the cost of production and therefore a dramatic improvement in the TCO.

Despite the difficulties some will encounter in getting these hyperscale AI designs to function at a reasonable cost, I think eSilicon has shown it has the expertise to get them across the finish line. They also disclosed that they are already working with suppliers on the next set of challenges as the degree of scaling increases – new die bonding technologies, vertical signal density, thermal density, combined yield, and many others. I will be anxious to hear more on these items when eSilicon is ready to discuss them.

eSilicon seems well prepared to deliver AI hardware designs. You can learn more about their NeuASIC AI capabilities here. You can learn more about their 2.5D/HBM2 packaging solutions here. As I have mentioned before, as an IP vendor, I referred my licensees to eSilicon before where their success lead to us getting our clients to volume quickly. That is why I recommend them highly.


Virtually Verifying SSD Controllers

Virtually Verifying SSD Controllers
by Bernard Murphy on 09-26-2019 at 5:00 am

Datacenter storage

Solid State Drives (SSDs) are rapidly gaining popularity for storage in many applications, in gigabytes of storage in lightweight laptops to tens to hundreds of terabyte drives in datacenters. SSDs are intrinsically faster, quieter and lower-power than their hard disk-drive (HDD) equivalents, with roughly similar lifetimes, though SDDs are (currently) more expensive. All appealing characteristics in a datacenter, perhaps in some suitable mix with cheaper HDDs. However there are other challenges with SDDs which make them in some ways more difficult to manage.

The memory cells inside an SSD wear out through repeated read/write/erase actions. Also writes to an SSD are at a page or block level (I’ll use block from here on for simplicity). You can’t just update one word; if you want to update a block already containing data you have to copy the block to SRAM, make the update, write the updated block to a new empty block and mark the original block for deletion. So far no problem, but those marked blocks have to be deleted so they can be recycled back into the system. That’s handled by garbage collection, which the controller will run in the background to avoid slowing down host reads and writes.

There’s an obvious challenge here when write-traffic becomes significant and scattered. Demand for new blocks to write can exceed the pace at which marked blocks are being deleted, in which case writing stalls waiting for garbage collection to catch up. And the supposed fantastic performance of the SSD takes a hit until the backlog is cleared. Which is not great for XaaS providers who want to claim reliably superior throughput.

In managing this problem, storage technologists have come up with a concept of predictable latency through which long tails in this distribution can be limited or even eliminated. Characterizing for this latency for a new controller design under a wide range of demands obviously requires running with a host model which will faithfully represent realistic datacenter traffic as a driver. Here Mentor have further extended their VirtuaLAB concept for Veloce emulation to provide an SSD reference lab, providing all the necessary virtualized operation components, allowing for a host OS such as Qemu, along with debug interfaces. The controller model runs on the Veloce emulator.

What I find particularly interesting about this is the natural fit of a virtualized version of the production storage interface working together with the emulator based DUT model. In the right contexts I’m a fan of ICE-based modeling, where you connect the emulator to real hardware to get all the real-life variability and odd behavior you will have to accommodate. But dealing with massively complex system loads by building large hardware test systems is impractical and inevitably very incomplete. Here virtualized modeling seems a better solution, given easier scalability to a wide range of applications. This is similar I think to the work VirtuaLAB interface Mentor have with Ixia/Keysight for network testing under a wide range of possible loads.

None of which means you’re going to get everything right pre-silicon in this kind of modeling. I’m not sure the old “first-silicon” goal applies any more in complex system devices. But you can shake out a lot of problems to ensure that validation with that first silicon build will be catching real-life corner-cases and not problems you should have caught in design.

You can read Mentor paper on this capability HERE.


High-Speed PHY IP for Hyperscale Data Centers

High-Speed PHY IP for Hyperscale Data Centers
by Tom Dillinger on 09-25-2019 at 10:00 am

A new designation has recently entered the vernacular of the computing industry – a hyperscale data center.  The adjective hyperscale implies the ability of a computing resource to scale corresponding to increased workload, to maintain an appropriate quality of service.

The traditional enterprise data center is often characterized as a back room warehouse of data processing and storage resources, with components of varying capacity and performance.  Customers commonly request resource allocations.  There is typically a long leadtime for hardware upgrades and resource growth.

Conversely, the hyperscale data center is by nature modularized and distributed.  The large cloud computing service providers are the models usually associated with hyperscale data centers, yet any IT operation with the following characteristics would apply:

  • modular facilities for power and cooling delivery

An analogy for the modularity of a hyperscale data center would be the construction of a housing development, where the overall facilities infrastructure is divided into phases, each consisting of individual building lots.

  • workload balancing

The footprint of the hyperscale data center assumes a typical thermal dissipation, to provide the facility cooling – planning for cooling to support maximal dissipation throughout the center would be cost-ineffective.  Balancing the utilization of resources involves thermal monitoring and support for workload relocation.

  • high availability

Hyperscale architectures include the capability to replicate/restart workloads across servers, in case of a failure.

The modularity in hyperscale data center architectures is associated with the ubiquitous server rack, as depicted in the figure below.

Figure 1.  Common server rack hardware configuration, illustrating optical module or Direct Attach Copper connectivity to the Top-of-Rack switch.  (Source:  Synopsys)

Top of Rack (ToR) is a common position for the network switch hardware.  The figure above also indicates the increasing network switch bandwidth required – e.g., 25.6 Tb/sec and 51.2 Tb/sec – and the network interface card technologies used in these rack configurations.

When describing the connection bandwidth, the key parameters are:

  • serial (SerDes) data rate and the number of SerDes lanes

The effective datarate is reduced (slightly) from the SerDes specification due to the additional bits added to the payload as part of the data encoding algorithm.

  • insertion loss and crosstalk loss of the connection medium, and the range of the connection

The key overall specification to achieve is the bit error rate (BER), which is determined by a number of factors – e.g., Tx equalization, Rx adaptation to optimize signal sampling time, and especially, the frequency-dependent insertion loss and crosstalk interference of the connection.

For these very high-speed data rates, individual specifications for these losses are often provided (the loss acceptance mask versus frequency), for different configurations – e.g., chip-to-chip (short reach); backplane with 2 connectors (~1m), and Direct Attach Copper cable (~3m).  Increasingly, above a 100 Gbps serial rate, copper cabling in the rack may be displaced by low-cost optics and/or a transition of the network switch to a middle-of-rack (“MoR”) position.

I had the opportunity to chat with Manmeet Walia, Senior Marketing Manager for High-Speed PHY Development at Synopsys, about the characteristics of the hyperscale data center, the increased data communications bandwidth, and the ramifications of these trends on hardware design.

“There are several key trends emerging.”, Manmeet indicated.  “For improved efficiency, Smart network interface cards (“SmartNIC”) are being offered, with additional capabilities for network packet processing to off-load the host.”  

“The intra-rack bandwidth requirements are increasing – 56G and 112G Ethernet are required.”, Manmeet said.   The figure below highlights how these IP are used in support of various aggregate Ethernet speeds, using multi-lane configurations.  The targets for bandwidth between data centers are also shown below.

Figure 2.  Evolution of Ethernet speeds, targets for DC-to-DC bandwidth.  (Source:  Synopsys)

“Switch designs are integrating electro-optical conversion and optical fiber connectivity for the Ethernet physical layer even in medium- and short-range configurations.  Inter-rack and data center-to-data center bandwidth must also increase to accommodate the network traffic.”

Manmeet provided the figure below to illustrate how electro-optical conversion is transitioning from a distinct network card module to an integral part of advanced packages, with optical fiber used locally.  (The electrical SerDes signal conditioning retiming functionality required at high data rates is thereby eliminated.)

Figure 3.  Electro-optical conversion transition from a module to an integrated function.  (Source:  Synopsys)

He continued, “The 56G and 112G Ethernet communications requires PAM-4 signaling – conventional NRZ signal transitions for these networking applications maxes out at 28G.”

Figure 4.  PAM-4 Signaling Eye Diagram and Test Challenges.   (Source:  Teledyne LeCroy)

Briefly, PAM-4 signaling implies there are 4 different possible voltage levels to be sensed at the center of the signal eye.  The PAM-4 signal sensing window is therefore 1/3rd of the NRZ (PAM-2) height.  The linearity of the 4 signal levels is a critical parameter.

As with an NRE serial signal, minimizing crosstalk noise is crucial, especially with the reduced voltage sense differences with PAM-4.

The time sampling in the center of the eye opening for the PAM-4 signal is more complex, as well.  The jitter at the edges of the eye is magnified in PAM-4, due to the varying transitions between individual levels in successive unit intervals.  Separate Tx and Rx clock sources are used for training and auto-negotiation, potentially on a per-lane basis.  Additionally, there is a common requirement to support varying speed settings, again potentially for each lane.

The IEEE 802.3cd working group is establishing 56G/112G Tx/Rx signal specifications and PAM-4 standards for various network topologies, from ultra-short to long-reach, and for shielded/balanced copper and optical fiber cables.

“What’s new in the Synopsys PHY group?”, I asked.

Manmeet replied, “At the TSMC Open Innovation Platform symposium, we are highlighting our N7 56G and 112G PHY IP.  We are also providing reference design evaluation hardware and software evaluation platforms for customers.”

Manmeet included the following roadmap for large network switch SoCs, integrating 256 lanes of the 56G and 112G PHY’s.

Figure 5.  Example configuration of high-speed PHY’s, for large network switch SoC designs.  (Source:  Synopsys)

“The 56G PHY IP is provided in an X4 lane increment.  The DesignWare Physical Coding Sublayer (PCS) enables the networking protocol to span a wide range of data rates.  The 112G PHY is offered in an X1 lane unit, with similar PCS flexibility.”

Figure 6.  Synopsys 56G and 112G Ethernet PHY implementation architecture.  (Source:  Synopsys)

Manmeet added, “Synopsys will be providing customers with additional design materials.  SoC physical layout PHY tiling and power delivery recommendations are included, based on the results of our package escape studies.  IBIS-AMI models are provided.  A software toolkit enables evaluation of Tx and Rx settings and signal eye characteristics.  A test chip evaluation card is also available.”

 

Figure 7.  Design kit materials for 56G and 112G IP evaluation:  IBIS-AMI model, software toolkit for lab evaluation, reference card.   (Source:  Synopsys)

 

Several trends are clear:

  • Computing models are rapidly adopting the characteristics of flexible “hyperscale” data centers.

 

  • The volume of network traffic demands an increase in bandwidth to 400G, enabled by the availability of 56G and 112G Ethernet PHY IP utilizing PAM-4 signaling, whether for short-reach or long-reach configurations, utilizing copper or optical cable physical layer interconnects. (For ultra-short reach interfaces, these Synopsys PHYs also include options to optimize power for low-loss channels.)

 

  • The complexities of integrating a large number of high-speed 56G/112G Ethernet PAM-4 SerDes lanes to optimize signal losses and crosstalk requires more than just “silicon-proven” test chips from the IP provider. A strong collaboration between customers and the IP provider is needed to adapt to the SoC metallization stack and to leverage the available software/hardware reference materials.

 

Here are links to additional information that may be of interest.

Synopsys 112G Ethernet PHY IP press release –  link.

Article:  “Shift from NRZ to PAM-4 Signaling for 400G Ethernet” – link.

Youtube video on 7nm 56G Ethernet PHY IP performance results – link.

Synopsys DesignWare 112G Ethernet PHY IP – link.

 

-chipguy

 


How Should I Cache Thee? Let Me Count the Ways

How Should I Cache Thee? Let Me Count the Ways
by Bernard Murphy on 09-25-2019 at 5:00 am

cache hierarchy

Caching intent largely hasn’t changed since we started using the concept – to reduce average latency in memory accesses and to reduce average power consumption in off-chip reads and writes. The architecture started out simple enough, a small memory close to a processor, holding most-recently accessed instructions and data at some level of granularity (e.g. a page). Caching is a statistical bet; typical locality of reference in the program and data will ensure that multiple reads and writes can be made very quickly to that nearby cache memory before a reference is made outside that range. When a reference is out-of-range, the cache must be updated by a slower access to off-chip main memory. On average a program runs faster because, on average, the locality of reference bet pays off.

For performance, that cache memory has to be close to the processor and therefore small; it can only hold a limited number of instruction or data words. As programs and data size get bigger, added levels of caching (still on-chip) became important. Each hold a larger amount of data at the price of progressively slower accesses, but still much faster than main memory access. These levels of cache were named, imaginatively, L1 (for level 1, closest to the processor), through L2, L3 and in some cases L4. Whichever of these is closest to the external memory interface is called the last-level cache.

Then we started to see multi-processor compute. Each processor wants its own cache for performance but they all need to work with the same main memory, introducing a potential coherence problem. If processor A and processor B read/write to the same memory address, you don’t want them getting out of sync because actually they’re each working with a local copy in their own caches.

To avoid chaos, all accesses still have to be synchronized with the central main memory address space. Delivering this assurance of coherence between caches depends on a lot of behind-the-scenes activity to snoop on locations written/read in each cache, triggering corrective action before a mismatch might occur.

For SoCs, Arm provides a cache-coherent communications solution between their processor IPs called the DynamIQ Shared Unit (DSU), solving the problem for fixed clusters of Arm CPUs. But for the rest of the SoC you need a different solution. Think of an AI accelerator for example, used to recognize objects in an image. Ultimately the image (or more likely stream of images) is going to come through the same memory path for processing by AI and non-AI functionality alike.

So AI accelerator accesses must be made coherent with the rest of the coherent subsystem. Arteris IP provides a solution for this through their Ncore cache coherent NoC interconnect which provide proxy caches to coherently connect non-coherent AI accelerators with the SoC coherent subsystem. You can do this with multiple accelerators, even multiple AI accelerators, a configuration which apparently is becoming popular in a number of devices.

Next consider that a number of these accelerators are becoming pretty elaborate and pretty big in their own right. Now the Ncore interconnect is not just a way to connect the accelerator to the SoC cache subsystem but a full coherent interconnect supporting multiple caches inside the accelerator.

This is needed because AI accelerators depend even more heavily on localized memory for throughput; a grid-based accelerator might have local cache associated with each processing element or group of processing elements. This coherent interconnect performs a similar function to the Arm coherent DSU but inside a non-Arm subsystem and using a NoC architecture which can more practically extend over long distances.

Then think further on those long distances and the fact that some of these accelerators are now reticle sized. Maintaining coherence directly over that scope would be near-impossible without significantly compromising the performance advantages of caching. What does any good engineer do at that point? They manage the problem hierarchically with multiple domains of local coherence connecting to higher-level coherent domains.

Finally and more back down to earth, what about the last-level cache, the one right before you have to give up and go out to main memory? Arteris IP provides the CodaCache solution here which can sit right by the memory controller. This is highly configurable for size, partitioning, associativity and even scratchpad memory to allow AI users to optimize tuning to the dataflow they expect to have with their AI app, perhaps to pre-fetch data they know they will need soon.

Caching has come a long way from those early applications. Arteris IP is working with customers in each of these areas. You can learn more about their Ncore cache coherent interconnect HERE and CodaCache HERE.


Semiconductors back to growth in 2020

Semiconductors back to growth in 2020
by Bill Jewell on 09-24-2019 at 11:30 am

Semiconductor market forecasts 2019 to 2021

The global semiconductor market is headed for the largest decline in 18 years. The market dropped 32% in 2001 when the Internet bubble burst. The 2019 decline should be around 15%, the third largest annual drop after 2001 and a 17% drop in 1985. The current weakness is largely due to excess memory capacity (DRAM and NAND flash) relative to demand. WSTS expects the memory market will decline 31% in 2019 while semiconductors excluding memory will decline only 4%. The market weakness is also due to an uncertain global economy and weakness in key demand drivers.

The second half of 2019 is showing signs of a turnaround. Below are the 2nd quarter 2019 revenues reported by major semiconductor companies and their guidance for 3rd quarter 2019. The non-memory companies showed revenue growth in 2Q 2019 versus 1Q 2019, ranging from 0.3% from Qualcomm to 16.2% from Nvidia. The memory companies (SK Hynix, Micron and Toshiba) experienced revenue declines, except for Samsung which grew 11.2%. A few companies expect healthy growth in 3Q 2019 revenues, ranging from 9.1% from Intel to 15.3% from STMicroelectronics. TI, Infineon and NXP project low single digit growth. Micron and Qualcomm expect revenue declines.

Recent semiconductor market forecasts call for 2019 to be decline about 13% to 15%. We at Semiconductor Intelligence are sticking with our June forecast of a 15% decrease. Forecasts for 2020 are in a relatively narrow range, from 4.8% by WSTS to 8% by Semiconductor Intelligence. For 2021, IHS Markit projects accelerating growth to 10% from 6% in 020. Our Semiconductor Intelligence forecast is for slightly slower growth in 2021 of 7% compared to 8% in 2020.

What are the drivers of the semiconductor forecast? One key is gross domestic product (GDP) which measures overall economic activity. The September forecast from Euromonitor shows a slowing of World GDP growth from 3.7% in 2018 to 3.1% in 2019. GDP growth picks up slightly to 3.3% in 2020 and 2021. The advanced economies (including the U.S., Euro area, UK, Canada and Japan) decelerate from 2.2% growth in 2018 to 1.7% in 2019 and 1.5% in 2020 and 2021. Two major factors behind the 2019 slowdown are the trade dispute between the U.S. and China and uncertainty over the UK’s exit from the European Union (Brexit). Growth is stronger in the emerging economies (including China, India, Russia, Southeast Asia and Latin America) at 4.3% in 2019, picking up to 4.6% in 2020 and 2021. Slowing growth in China is offset by accelerating growth in India, Southeast Asia and Latin America.

The two largest applications for semiconductors are smartphones and PCs/tablets. IDC projects smartphone units will decline 2.2% in 2019 after falling 3.4% in 2018. Growth should turn positive in 2020 and 2021 as 5G smartphones enter the market. Gartner expects the combination of PC and tablet units to decline 1.5% in 2019 after a 2.5% decline in 2018. The rate of decline will slow to 1.4% in 2020 and 0.6% in 2021.

Against this backdrop, the semiconductor market is unlikely to show strong growth in the next few years. Our Semiconductor Intelligence forecast of 8% semiconductor market growth in 2020 is largely due to a bounce back from the 15% decline in 2019. We expect growth to moderate to 7% in 2021 due to continued economic uncertainty and lackluster end equipment market growth.


Webinar of Recent NCTU CDM/ESD Keynote Talk by Dundar Dumlugol – Thursday September 26th

Webinar of Recent NCTU CDM/ESD Keynote Talk by Dundar Dumlugol – Thursday September 26th
by Daniel Nenni on 09-24-2019 at 10:00 am

With many design teams still searching for an effective means of identifying Charged Device Model (CDM) issues early in the design process, it comes as no surprise that events on this topic generate a lot of interest and are well attended. In July Magwel’s CEO Dr. Dundar Dumlugol had the honor of being invited by Professor Ming-Dou Ker to speak at Taiwan’s National Chiao-Tung University (NCTU) on the topic of ESD and CDM event simulation. NCTU’s esteemed electrical engineering department serves as a center of expertise on the topic of ESD.

The session was well attended (myself included) and provided an excellent overview of the larger scope of ESD in general and on CDM specifically. Dundar reviewed the existing methods of analyzing an IC design for weaknesses that could lead to CDM related failures. He also discussed Magwel’s approach to solving these challenging design problems. During the talk he showed several examples of CDM discharge scenarios and what tools should be looking for in order to properly simulate the events.

The July seminar was recorded and is replayed HERE. This is a good opportunity to learn about the most effective ways to prevent CDM failures in finished designs. Magwel has applied their expertise in solver-based extractors and dynamic simulation to develop their CDMi tool. CDMi simulates current flows during high speed CDM pulses and determines which ESD devices are triggered. Overvoltage or overcurrent violations are reported through an intuitive UI.

About Us
Magwel® offers 3D field solver and simulation based analysis and design solutions for digital, analog/mixed-signal, power management, automotive, and RF semiconductors. Magwel® software products address power device design with Rdson extraction and electro-migration analysis, ESD protection network simulation/analysis, latch-up analysis and power distribution network integrity with EMIR and thermal analysis. Leading semiconductor vendors use Magwel’s tools to improve productivity, avoid redesign, respins and field failures. Magwel is privately held and is headquartered in Leuven, Belgium. Further information on Magwel can be found at www.magwel.com


Cadence Celsius Heats Up 3D System-Level Electro-Thermal Modeling Market

Cadence Celsius Heats Up 3D System-Level Electro-Thermal Modeling Market
by Tom Simon on 09-24-2019 at 6:00 am

A few years back people were saying that the “EDA” problem was solved and that design tools had become commodity. At the same time people hailed ADAS, smart homes, mobile communication and AI as the frontiers of electronics.  Perhaps it could be said that layout tools, routers, placers, and circuit simulators had largely matured and often could be used interchangeably to successfully complete electronic designs. However, there were new challenges arising due to the increasing complexity of electronic systems – like the very ones used in cars, smart homes, mobile and AI.

At more advanced nodes and with increasing computational requirements, ICs have become bigger consumers of power. ICs are also being packaged tightly together in configurations using interposers, 3D ICs, etc. Heat buildup in these systems is a very real concern. IC performance and behavior are affected by temperature, and of course too much heat can lead to failure over time or rapidly in some cases. We all have experienced laptops heating up on our laps or cell phones warming up in our pockets. If the exterior is getting hot, what is happening on the inside, from silicon, to package, board, heat sink and chassis?

Cadence Celsius

System level thermal performance cannot be left to chance, it needs to be simulated and well understood during the design process. Up until now there has not been a solution for system level behavior. Cadence has just announced a system level electro-thermal simulator that includes multi-physics and interfaces with other design tools to effectively solve this problem. IC designs are in Virtuoso, packages are in SiP and boards are in Allegro. 3D assemblies come from 3D CAD packages. Celsius can import the full system design and then model thermal behavior.

Power information is obtained from Cadence Voltus, then steady state or transient thermal results can be obtained through Celsius. With accurate temperature information, Voltus can then provide more accurate power information, improving accuracy of the final results. Celsius uses Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) to model internal thermal flux and airflow inside of the fully assembled system.

At smaller scales the problem of electro-thermal analysis has been tractable, but with the sizes of current and future generations of electronics systems, there has not been a feasible solution. Cadence Celsius is using massively parallel execution to increase capacity and boost performance. Interestingly, they are using a similar capability for their Clarity EM solver for system level analysis. Celsius can add hundreds of CPUs to effectively increase performance. An example provided by Cadence shows that with 320 CPUs, runtime is 10.2X faster than legacy solvers running on 40 CPUs. Even with the baseline 40 CPUs, Celsius is 2.5X faster.

Building today’s advanced electronic products and systems requires integration of many disciplines at once to ensure good results. For instance, an electronic system might appear to have adequate heat dissipation capacity, but transitions from one operating mode to another or changes in power consumption due to elevated temperatures may lead to exceeding allowable design parameters. Simulation and analysis that includes both thermal and electronic domains is necessary to avoid costly late cycle iterations to fix design problems.

With Celsius, Cadence is delivering its second system-level high-capacity and high-performance analysis tool. Both Clarity and Celsius use highly scalable solver technology to boost capacity so real-world systems can be fully analyzed. They also both accept inputs from 3D modeling, board, package an IC design flows, to permit full integration of all design elements. What they are doing today is a far cry from the era when EDA tools were siloed and only looked at a small part of the overall problem. With this move, Cadence is showing how EDA vendors, with the right vision, can leave behind the notion that EDA is a “solved problem” that offers little differentiation. Companies that adopt these tools will have a significant advantage in their design process that will lead to more successful products. Full information on Celsius is available on the Cadence website.


AI Inference at the Edge – Architecture and Design

AI Inference at the Edge – Architecture and Design
by Tom Dillinger on 09-23-2019 at 10:00 am

In the old days, product architects would throw a functional block diagram “over the wall” to the design team, who would plan the physical implementation, analyze the timing of estimated critical paths, and forecast the signal switching activity on representative benchmarks.  A common reply back to the architects was, “We’ve evaluated the preliminary results against the power, performance, and cost targets – pick two.”  Today, this traditional “silo-based” division of tasks is insufficient.  A close collaboration between architecture and design addressing all facets of product development is required.  This is perhaps best exemplified by AI inference engines at the edge, where there is a huge demand for analysis of image data, specifically object recognition and object classification (typically in a dense visual field).

The requirements for object classification at the edge differ from a more general product application.  Raw performance data – e.g., maximal operations per section (TOPS) – is less meaningful.  Designers seek to optimize the inference engine frames per second (fps), frames per watt (fpW), and frames per dollar (fp$).  Correspondingly, architects must address the convolutional neural network (CNN) topologies that achieve high classification accuracy, while meeting the fps, fpW, and fp$ product goals.  This activity is further complicated by the rapidly evolving results of CNN research – the architecture must also be sufficiently extendible to support advances in CNN technology.

Background

AI inference for analysis of image data consists of two primary steps – please refer to the figure below.

Feature extraction commonly utilizes a (three-dimensional) CNN, supporting the two dimensions of the pixel image plus the (RGB) color intensity.  A convolutional filter is applied to strides of pixels in the image.  The filter matrix is a set of learned weights.  The size of the filter coefficient array is larger than the stride to incorporate data from surrounding pixels, to identify distinct feature characteristics.  (The perimeter of the image is “padded” with data values, for the filter calculation at edge pixels.)  The simplest example would use a 3×3 filter with a stride of 1 pixel.  Multiple convolution filters are commonly applied (with different feature extraction properties), resulting in multiple feature maps for the image, to be combined as the output of the convolution layer.  This output is then provided to a non-linear function, such as Rectified Linear Unit (ReLU), sigmoid, tanh, softmax, etc.

The dimensionality of the data for subsequent layers in the neural network is reduced by pooling – various approaches are used to select the stride and mathematical algorithm to reduce the dataset to a smaller dimension.

Feature extraction is followed by object classification.  The data from the CNN layers is flattened and input to a conventional, fully-connected neural network, whose output identifies the presence and location of the objects in the image, from the set of predefined classes used during training.

Training of a complex CNN is similar to that of a conventional neural network.  A set of images with identified object classes is provided to the network.  The architect selects the number and size of filters and the stride for each CNN layer.  Optimization algorithms work backward from the final classification results on the image set, to update the CNN filter values.

Note that the image set used for training is often quite complex.  (For example, check out the Microsoft COCO, PASCAL VOC, and Google Open Images datasets.)  The number of object classes is large.  Objects may be scaled in unusual aspect ratios.  Objects may be partially obscured or truncated.  Objects may be present in varied viewpoints and diverse lighting conditions.  Object classification research strives to achieve greater accuracy on these training sets.

Early CNN approaches applied “sliding windows” across the image for analysis.  Iterations of the algorithm would adaptively size the windows to isolate objects in bounding boxes for classification.  Although high in accuracy, the computational effort of this method is extremely high and the throughput is low – not suitable for inference at the edge.  The main thrust for image analysis requiring real-time fps throughput is to use full-image, one-step CNN networks, as was depicted in the figure above.  A single convolutional network simultaneously predicts multiple bounding boxes and object class probabilities for those boxes.  Higher resolution images are needed to approach the accuracy of region-based classifiers.  (Note that a higher resolution image also improves extraction for small objects.)

Edge Inference Architecture and Design

I recently had the opportunity to chat with the team at Flex Logix, who recently announced a product for AI inference at the edge, InferX X1.  Geoff Tate, Flex Logix CEO, shared his insights into how the architecture and design teams collaborated on achieving the product goals.

“We were focused on the optimizing inferences per watt and inferences per dollar for high-density images, such as one to four megapixels.”, Geoff said.  “Inference engine chip cost is strongly dependent on the amount of local SRAM, to store weight coefficients and intermediate layer results – that SRAM capacity is balanced against the power associated with transferring weights and layer data to/from external DRAM.  For power optimization, the goal is to maximize MAC utilization for the neural network – the remaining data movement is overhead.”

For a depiction of the memory requirements associated with different CNN’s and image sizes, Geoff shared the chart below.  The graph shows the maximum memory to store weights and data for network layers n and n+1.

The figure highlights the amount of die area required to integrate sufficient local SRAM for the n/n+1 activation layer evaluation — e.g., 160MB for YOLOv3 and a 2MP image, with the largest activation layer storage of ~67MB.  This memory requirement would not be feasible for a low-cost edge inference solution.

Geoff continued, “We evaluated many design tradeoffs with regards to the storage architecture, such as the amount of local SRAM (improved fps, higher cost) versus the requisite external LPDDR4 capacity and bandwidth (reduced fps, higher power).  We evaluated the MAC architecture appropriate for edge inference – an “ASIC-like’ MAC implementation is needed at the edge, rather than a general-purpose GPU/NPU.  For edge inference on complex CNN’s, the time to reconfigure the hardware is also critical.  Throughout these optimizations, the focus was on supporting advanced one-step algorithms with high pixel count images.”

The final InferX X1 architecture and physical design implementation selected by the Flex Logix team is illustrated in the figure below.

The on-chip SRAM is present in two hierarchies:  (1) distributed locally with clusters of the MAC’s (providing 8×8 or 16×8 computation) to store current weights and layer calculations, and (2) outside the MAC array for future weights.

The MAC’s are clustered into groups of 64.  “The design builds upon the expertise in MAC implementations from our embedded FPGA IP development.”, Geoff indicated.  “However, edge inference is a different calculation, with deterministic data movement patterns known at compile time.  The granularity of general eFPGA support is not required, enabling the architecture to integrate the MAC’s into clusters with distributed SRAM.  Also, with this architecture, the internal MAC array can be reconfigured in less than 2 microseconds.”

Multiple smaller layers can be “fused” into the MAC array for pipelined evaluation, reducing the external DRAM bandwidth – the figure below illustrates two layers in the array (with a ReLU function after the convolution).  Similarly, the external DRAM data is loaded into the chip for the next layer in the background, while the current layer is evaluated.

Geoff provided benchmark data for the InferX X1 design, and shared a screen shot from the InferX X1 compiler (TensorFlowLite), providing detailed performance calculations – see the figure below.

“The compiler and performance modeling tools are available now.”, Geoff indicated.  “Customer samples of InferX X1 and our evaluation board will be available late Q1’2020.”

 

I learned a lot about inference requirements at the edge from my discussion with Geoff.  There’s a plethora of CNN benchmark data available, but edge inference requires laser focus by architecture and design teams on fps, fpW, fp$, and accuracy goals, for one-step algorithms with high pixel images.  As CNN research continues, designs must also be extendible to accommodate new approaches.  The Flex Logix InferX X1 architecture and design teams are addressing those goals.

For more info on InferX X1, please refer to the following link.

-chipguy