DAC2025 SemiWiki 800x100

Exponential Innovation: HFSS

Exponential Innovation: HFSS
by Matt Commens on 02-21-2023 at 10:00 am

evolution of hfss simulation capacity

The old adage: “If it ain’t broke, don’t fix it,” is as offensive to innovators as it is to grammarians. Just because something works well, doesn’t mean it cannot work better. As times change and technology advances, you either move forward or get left behind.

If you haven’t upgraded to the latest Ansys HFSS electromagnetic simulation software, you don’t know what you’re missing. Imagine if you could solve huge, complete electromagnetic designs while preserving the accuracy and reliability provided by HFSS. How will that change your design methodology? How much faster will you get to market? How much better products will you deliver?

Electromagnetic Simulation Evolves

The need for speed and capacity continues to increase significantly, and HFSS has kept pace throughout its three-decades plus history. Today, the evolution of hardware and its exponentially higher performance and design specifications has driven the need to solve staggeringly large and complex designs inconceivable only three years ago.

As simulation demands evolved, HFSS high-performance computing (HPC) technology has evolved right along with them to meet the demand. Desktop computers with multiple processors were introduced in the late 1990s. With this innovation, HFSS delivered Matrix Multiprocessing (MP) to enable HFSS users to simulate faster, driving faster time to market.

Next came the groundbreaking Domain Decomposition Method (DDM) technology in 2010. This allowed a single HFSS design to be solved in elastic hardware across distributed memory, resulting in an order of magnitude increase in problem size. As is always the case with HFSS, it was achieved in an uncompromised fashion with respect to solving a fully coupled electromagnetic system matrix. Beware of other solutions claiming parallel DDM, as they could be secretly coupling the so-called “domains” via internal ports and thereby risking the rigor and accuracy needed for cutting edge designs. If they are only benchmarking simple transmission line centric models you should be curious and concerned.

Matrix multi-processing is not limited to a single machine. In 2015 HFSS distributed memory matrix (DMM) solver was introduced, providing access to more memory on elastic hardware without compromising the rigor. This enables the greatest accuracy, lowest noise floor and best efficiency for extremely large many-port models with uncompromised accuracy.

We continue to refine DMM in HFSS. As a result of continuous innovations such as HFSS Mesh Fusion introduced in 2020, the capacity increase in HFSS has been exponential, ranging from 10,000 unknowns in 1990 to over 800 million unknowns in 2022, and we anticipate crossing the 1B threshold soon.

Figure 1 – the Evolution of HFSS Electromagnetic Simulation Capacity

Three recent innovations that contribute to such impressive speed boosts are IC Mode and meshing , a new distributed Mesh Fusion solver option in HFSS 3D Layout, and the integration of ECADXplorer capacity into 3D Layout, improving the capacity and ease-of-use for GDS based simulation flows.

We have also sped up the frequency sweeps. Introduced in the early 2000s, the Spectral Decomposition Method (SDM) allows the points in the frequency sweep to be solved in parallel on both shared and elastic hardware. Since SDM, we have continuously improved the algorithms and introduced new innovations, such as the S-Parameters Only (SPO) matrix solve. By providing a smaller memory point for frequency sweep points, we can achieve a speed up per every solution point. This memory reduction provides further payoff by enabling you to solve more frequency points in parallel with the freed-up memory, resulting in faster frequency sweeps without compromising accuracy.

Ansys continually innovates in HFSS. The technological breakthroughs of MP, SDM, DDM, DMM, and SPO along with Mesh Fusion demonstrate an Ansys commitment to continued improvement in capacity and performance, all without compromising accuracy. HFSS workflow and solver technology now enables massive system-scale capacity; IC plus package plus PCB, fully coupled and uncompromised, is now doable and routine. HFSS elastic compute solves problems eight times larger than just two years ago, and 40 times larger than the competition. Together, these leading capabilities in Computational Electromagnetic simulation are enabling today’s most cutting-edge design work, ranging from 3D-IC to MIMO and phased antenna array designs for 5G/6G. This is in fact why top semiconductor companies universally rely on HFSS to verify their designs. If you haven’t used the latest HFSS, you don’t know what you’re missing.

Come see the latest capabilities of HFSS in this webinar on March 3rdAnsys 2023 R1: Ansys HFSS What’s New | Ansys

Also Read:

IDEAS Online Technical Conference Features Intel, Qualcomm, Nvidia, IBM, Samsung, and More Discussing Chip Design Experiences

Whatever Happened to the Big 5G Airport Controversy? Plus A Look To The Future

Ansys’ Emergence as a Tier 1 EDA Player— and What That Means for 3D-IC


Hardware Security in Medical Devices has not been a Priority — But it Should Be

Hardware Security in Medical Devices has not been a Priority — But it Should Be
by Andreas Kuehlmann on 02-21-2023 at 6:00 am

iStock 136200269
Picture of medical monitors inside the ICU

Rapid advances in medical technology prolong patients’ lives and provide them with a higher standard of living than years past. But the increased interconnectivity of those devices and their dependence on wired and wireless networks leave them susceptible to cyberattacks that could have severe consequences.

Whether it’s an implantable defibrillator that transmits data to a cardiologist, an infusion pump that allows a nurse to monitor a patient’s vital signs, or even a smartwatch that logs wellness routines, these instruments broaden the attack surface bad actors can exploit.

In 2017, hospitals around the world were victimized during the large-scale WannaCry ransomware attack. The National Health Service assessed that across England and Scotland, 19,500 appointments were canceled, 600 computers were frozen, and five hospitals had to divert ambulances. In August 2022, a French hospital was subject to a ransomware attack on its medical imaging and patient admission systems, and a similar ploy targeted another nearby hospital a few months later. HIPAA Journal, citing data from Check Point Research, reported in November that an average of 1,426 cyberattacks on the healthcare industry took place each month in 2022 — a 60% increase year over year.

The medical devices themselves are rarely of interest to cyber criminals, who use them as a way to access network infrastructure and install malware or obtain data. They have also recognized that software isn’t the only way in: The hardware that powers all devices, semiconductor chips, is drawing increased attention due to security vulnerabilities that can be remotely accessed and exploited.  Vulnerabilities in software can be patched, but it’s more complex and costly to fix hardware issues.

Limited oversight of the cybersecurity of medical devices has created an environment that’s ripe for exploitation. We must begin asking questions that lead to proactively addressing these hardware vulnerabilities and develop ways to overcome the complications associated with securing a vast array of instruments before something dramatic happens.

Range of devices, shared networks among security issues

The Food and Drug Administration (FDA) has periodically released reports on the importance of securing medical devices — including in March 2020, when it raised awareness of a vulnerability discovered in semiconductor chips that transmit data via Bluetooth Low Energy. But the modern patient care environment is so heavily reliant upon interconnectivity that minimizing cybersecurity risks can be a monumental task.

In its warning, the FDA urged the seven companies that manufactured the chips to talk to providers and patients about how they can lessen the risks tied to that vulnerability. It also acknowledged that any repairs wouldn’t be simple because the affected chips appear in pacemakers, blood glucose monitors, insulin pumps, electrocardiograms, and ultrasound devices.

According to a report issued by Palo Alto Networks’ Unit 42 cyber threat research department, medical instruments and IT devices share 72% of health care networks, meaning malware can spread between computers and imaging machines — or any combination of electronics — rather seamlessly.

Medical devices’ long lifecycles can also make securing them challenging. Although they can still function as intended, they may run on an outdated operating system (OS) that can be costly to upgrade. Scanners such as MRI and CT machines are targeted because of their outdated OS; according to the Unit 42 report, only 16% of the medical devices connected to networks were imaging systems, but they were the gateway for 51% of attacks. The Conficker virus, first detected in 2008, infected mammography machines at a hospital in 2020 because those devices were running on Windows XP — an OS that no longer received mainstream support from Microsoft as of 2014.

And, because of their seemingly niche functions, many medical devices weren’t constructed with cybersecurity in mind. Few security scanning tools exist for instruments that run on a proprietary OS, making them ripe for attacks. In September, the FBI issued a warning to healthcare facilities about the dangers associated with using outdated medical devices. It highlighted research from cybersecurity firms that showed that 53% of connected medical devices have known critical weaknesses stemming from hardware design and software management. Each susceptible instrument has an average of 6.2 vulnerabilities.

When we consider the number of devices in use around the world, the way they are used, and the varying platforms they operate on, it’s apparent that such a broad attack surface presents a significant threat.

Documenting vulnerabilities offers a path forward

Fixing hardware flaws is complicated. Replacing affected semiconductor chips, if even possible given the age and functionality of the device, takes considerable resources and can lead to a disruption in treatment.

Hospitals and other patient care centers aren’t often prepared to defend the broad attack surface created by their use of hundreds of medical devices. Guidance from organizations such as the FDA — the latest of which was released in April, two months before a bipartisan bill that mandated the organization update its recommendations more frequently was introduced in the Senate — only goes so far. Manufacturers must prioritize the security of the semiconductor chips used in medical devices, and consumers throughout the supply chain must ask questions about vulnerabilities to ensure greater consideration is being put into the chips’ design and large-scale production.

A hardware bill of materials (HBOM), which records and tracks the security vulnerabilities of semiconductor chips from development through circulation, is an emerging solution. It can help ensure defective or compromised chips aren’t used — and if they are, in the case of Apple’s newest M1 chips, which have noted design flaws, would allow the weaknesses and repercussions to be thoroughly documented. Plus, even if a vulnerability is identified in the future, manufacturers can undertake a forensic review of the semiconductor chip’s design to determine which devices are susceptible to certain attacks.

By knowing the specific weaknesses in the hardware, you can prevent it from being exploited by cyber criminals and causing devastation across medical facilities.

Risks, outcomes show a high level of urgency

Emerging technology has gotten in the way of the safe operation of medical devices before. In 1998, the installation of digital television transmitters caused interference with medical devices at a nearby hospital because the frequencies they used overlapped. What’s different today, however, is that outside actors can target the power they exert over these instruments — but it’s preventable.

The increasing potential of attacks on semiconductor chips in networked medical devices demonstrates how savvy cyber criminals are becoming. Although advances in technology have made these devices a routine part of care around the globe, they’re also introducing security vulnerabilities given their interconnected nature. Patients can be exposed to serious safety and cybersecurity risks, and we must act now to shore up those vulnerabilities before something catastrophic occurs.

Also Read:

ASIL B Certification on an Industry-Class Root of Trust IP

Validating NoC Security. Innovation in Verification

NIST Standardizes PQShield Algorithms for International Post-Quantum Cryptography


AMAT- Flat is better than down-Trailing tool strength offsets memory- backlog up

AMAT- Flat is better than down-Trailing tool strength offsets memory- backlog up
by Robert Maire on 02-20-2023 at 10:00 am

AMAT CVD 3

-Strength in trailing tools offsets weak memory resulting in flat
-Order book very volatile but backlog surprisingly still grew
-Trailing edge VS Leading edge = 50/50 – Foundry/logic over 2/3
-Not nearly as bad as Lam but not as good as ASML

AMAT posts good quarter & guide – Flat for three quarters

Applied Materials reported revenue of $6.74B and EPS of $2.03, more or less flat with last quarter, versus street of $6.23 and $1.93. Guidance is for $6.4B +-$400M and EPS of $1.84 +-$0.18 versus current street estimates of $5.86B and $ 1.75 in EPS.

At this point in the industry turbulence, being flattish or slightly down is good performance. We would not complain.

Trailing edge to the rescue

The product mix between leading and trailing edge was roughly 50/50 as continued strength in trailing edge offset what is clearly a very sharp drop in memory as evidenced by Lam Research.

Applied has done a good job of predicting the need for trailing edge tools as it had previously created the ICAPS group to focus on non leading edge. They noted particular strength in implant as we have seen with Axcelis. We would be slightly concerned that Applied may make more headway in trailing edge implant which Axcelis has done very well with as they had less competition than the leading edge. With Applied’s renewed focus we could see the competition heat up and Axcelis share may be vulnerable.

In a perverse way, China sales in the trailing edge are somewhat safe from government embargoes. Though we are a little bit curious about how much is truly leading edge as Applied had previously talked about reclassifying some sales to China to to trailing edge to escape the embargo.

Backlog grew overall- fueled by trailing edge

The backlog was clearly very volatile, as we had suggested, with a lot of puts and takes. Takes from memory, replaced by puts in trailing edge. Whereas some others may live off of and reduce their backlog using it as a buffer during rainy season, Applied will also eventually reduce backlog by catching up to customer demand and re-orienting to trailing products. We would expect that there is likely a lot of disturbance in Applied supply chain given the large shift from leading edge to trailing edge.

No handle on recovery timing

As we have heard with others, management was not willing to be specific about any recovery timing other than a vague thought about DRAM improving in the later part of the year.

The memory market is still clearly in a downward trend which doesn’t seem to be slowing much. We are still stuck with excess inventory and production creating a poor pricing environment which is the worst we have seen in a very long time.

Management also said that foundry/logic is also weak but clearly not nearly as much as memory. It kind of feels to us like memory could easily be down 40% or so while foundry logic may be down closer to 10% overall. Foundry/logic was over 2/3 of Applied business which is a good mix to have when memory is off as much as it is now.

Patterning product at SPIE

The company mentioned several times a patterning product announcement at the upcoming SPIE show that we will be attending at the end of this month in San Jose. Our guess is that its a new reticle inspection tool to try to resuscitate their flagging sales in this area. Both Applied and KLA have been hurt by Lasertec in reticle inspection and Applied has lost a number of customers in the space.

MKS a $250M hit to Applied

Management pointed out that the data breach impact at MKS on Applied may be as high as $250 which will be made up over time. Probably not very meaningful to Applied but obviously a black eye on an otherwise well run MKS.
We would imagine that all tool makers are probably going to look for stricter controls on their major critical suppliers

The stocks

While Applied was weak during the daily session, it was up about 1% in the aftermarket as the news of flatness was received as better than the down experiences of others.

Obviously 2023 will be a down year overall for the industry but less so for Applied and that’s not bad. It doesn’t want to make us go out and buy the stock but it may limit potential future downside.

The overall semiconductor rally has run out a bit of steam as reality of earnings has set in. As Applied reports a month behind others there may not be a lot of appetite left to go out and buy a semiconductor stock right now given overall sentiment.

Clearly ASML remains the best performer of the group with Applied and KLA somewhat in the middle with Lam , the memory poster child, as the laggard for obvious reasons.

We don’t expect any relief on the China embargo and the CHIPS act is very slow to start even though everyone has announced projects there are many years between project announcements and actual spend.

So we don’t see any factors to rescue 2023 from being negative. The bigger question not yet in focus is what 2024 will look like and so far we have no clue nor are any companies guessing.

About Semiconductor Advisors LLC
Semiconductor Advisors is an RIA (a Registered Investment Advisor),
specializing in technology companies with particular emphasis on semiconductor and semiconductor equipment companies. We have been covering the space longer and been involved with more transactions than any other financial professional in the space. We provide research, consulting and advisory services on strategic and financial matters to both industry participants as well as investors. We offer expert, intelligent, balanced research and advice. Our opinions are very direct and honest and offer an unbiased view as compared to other sources.

Also Read:

KLAC- Weak Guide-2023 will “drift down”-Not just memory weak, China & logic too

Hynix historic loss confirms memory meltdown-getting worse – AMD a bright spot

Samsung- full capex speed ahead, damn the downturn- Has Micron in its crosshairs


IEDM 2023 – 2D Materials – Intel and TSMC

IEDM 2023 – 2D Materials – Intel and TSMC
by Scotten Jones on 02-20-2023 at 6:00 am

Slide1

Intel and TSMC make up two of the three leading edge logic companies. At IEDM held in December 2022, Intel presented a paper on 2D Materials and TSMC presented 6 papers. Clearly 2D materials are of great interest at least to two of the three leading edge logic companies. Before diving into the papers, some background context is needed.

Logic Scaling

Logic designs are made up of standard cells, if you are going to scale logic to increase density, the standard cells must shrink.

The height of a standard cell is typically characterized as the Metal-2 Pitch (M2P) multiplied by the number of tracks. While this is a useful metric, it glosses over the fact that the cell height must also encompass the devices that make up the cell. Figure illustrates a 7.5-track standard cell and shows the M2P and tracks on the left of the cell and also to the right of the cell is a cross sectional view of the corresponding device structure.

The width of a standard cell is made up of some number of Contacted Poly Pitches (CPP) with the number depending on the cell type and how the diffusion breaks at the edges of the cell are handled. Once again, CPP is made up of a device structure that must shrink when CPP shrinks. Figure 1 illustrates CPP and at the bottom has a cross sectional view of the device structure.

Figure 1. Standard Cell.

 Intel, Samsung, and TSMC, have all made the switch from planar devices to FinFETs and are now at the beginning stages of the transition to Horizontal Nano-Sheets (HNS). Samsung is in production with HNS now, and Intel and TSMC have announced HNS production targets of 2024 and 2025 respectively.

Figure 2 illustrates the device structure and dimensions that make up cell height.

Figure 2. Standard Cell Height.

 The change over to HNS offers multiple opportunities to shrink cell height. HNS can replace multiple fin nFET and pFET devices with single nano-sheet stacks shrinking the height impact of the devices. Forksheet and CFET enhancements to HNS can reduce or even eliminate n-p spacing.

 CPP is made up of Gate Length (Lg), Spacer Thickness (Tsp) and Contact Width (Wc), see figure 3.

Figure 3. Contacted Poly Pitch.

CPP can be scaled down by reducing Lg, Tsp, or Wc or any combination of the three. Lg is limited by the devices ability to provide acceptable leakage. Figure 4 illustrates Lg length for various devices.

Figure 4. Gate Length Scaling.

 From figure 4 constraining the channel thickness and/or increasing the number of gates enables shorter Lg.

So called 2D materials are made up of a monolayer of material less than 1nm thick improving gate control over the channel and enabling Lg down to ~5nm. At these dimensions silicon has poor mobility and other materials are used that have higher mobility and higher band gap further reducing leakage. The ability to scale Lg to ~5nm enables a significant shrink of CPP and therefore smaller standard cells.

2D Material Challenges

Transition Metal Dichalcogenides (TMD) such as MoS2, WS2, or WSe2, have been identified as materials of interest with high mobility at monolayer thicknesses (silicon has poor mobility at these dimensions). There are several challenges/questions that need to be addressed for practical use of these materials and they are explored in the 7 papers that will be discussed:

  1. Device performance – do devices fabricated with these materials really provide good drive current and low leakage at short Lg.
  2. Contacts – 2D TMD films are atomically smooth and hard to make good low resistance contact to.
  3. Film formation – currently MOCVD at high temperature on a sapphire substrate is used to form the 2D films and then the resulting film is transferred to a 300mm silicon wafer for further processing. This is not a practical production process.

Presented Results

In paper 7.5, “Gate length scaling beyond Si: Mono-layer 2D Channel FETs Robust to Short Channel Effects,” C. J. Dorow, et. Al., of Intel explored device performance.

The ultimate goal for 2D material based devices is a stack of 2D layers similar to the HNS stacks but with each channel being thinner enabling shorter Lg and more layers in a stack. Figure 5 illustrates the difference.

Figure 5. HNS Versus 2D Stack.

 Intel did a wet transfer of an MBE grown MoS2 film over a back gate and then evaluated the device with a back gate and also with an added front gate down to a source-drain distance of 25nm. Figure 6 illustrates the device structure.

Figure 6. Intel 2D Device Structure.

 Intel encountered some delamination issues in their experiments but were able to experimentally confirm their modeling result and conclude that a double gated device should be able to scale down to at least 10nm with low leakage, see figure 7.

Figure 7. Experimental Results (left side) and Simulation Results (right side).

In paper 28.4, “Comprehensive Physics Based TCAD Model for 2D MX2 Channel Transistors,” D. Mahaveer Sathaiya, et. al., of TSMC, discussed a comprehensive simulation model of 2D devices and calibrated the model against 3 datasets. Having the ability to model 2D devices accurately will be key to the further development of the technology.

In paper 28.1, “Computational Screening and Multiscale Simulation of Barrier-Free Contacts for 2D Semiconductor pFETs,” Ning Yang, et. al., of TSMC, used ab initio calculations to screen contact materials for 2D devices.

The best reported experimental results for contact resistance to WSe2 are 950 Ω·μm and in this work Co3Sn2S2 is projected to be able to achieve 20 Ω·μm approaching the quantum limit. Furthermore, simulated devices are projected to produce ~2 mA/μm on state current. Sputtering on a sapphire substrate followed by a high-temperature annealing process (800 ̊C) was shown to produce Co3Sn2S2 with the expected chemical composition and crystalline structure.

In paper 7.2, “High-Performance Monolayer WSe2 p/n FETs via Antimony-Platinum Modulated Contact Technology towards 2D CMOS Electronics,” Ang-Sheng Chou, et. al., of TSMC, presented experimental results for Sb-Pt modulated contacts that achieve record contact resistance of 750 Ω·μm for pFET and 1,800 Ω·μm for nFET on WSe2. An on current of ~150 μA/μm was achieved. These results are not as good as the projections from paper 28.1 but represent experimental results versus simulations.

In paper 7.3, “pMOSFET with CVD-grown 2D semiconductor channel enabled by ultra-thin and fab-compatible spacer doping,” Terry Y.T. Hung, et. al, of TSMC, work towards a production type of pFET is presented. A lot of 2D material work is done on Schottky diodes but MOSFETs have lower access resistance. In order to create practical MOSFETs a CVD grown channel is needed with doped spacers. In this paper broken bandgap doped spacers are created by treating WSe2 with O2 plasma to create WOx as a dopant. The process is self-aligned and self-limiting as illustrated in figure 8.

Figure 8. Self-aligned and Self-limited doped spacer formation.

The CVD grown 2D layers are still grown separately and then transferred but other parts of the process are production compatible. The devices achieved one of the lowest Rc ~ 1,000 Ω·μm among

transistors with WSe2 channel and relatively high Ion > 10-5 A/μm for a good S.S. < 80mV/dec.

In paper 7.4, “Nearly Ideal Subthreshold Swing in Monolayer MoS2 Top-Gate nFETs with Scaled EOT of 1 nm,” Tsung-En Lee, et. al., of TSMC showed an ALD grown Hf-based gate oxide of ~1nm EOT on CVD grown MoS2 with a top gate and achieve low leakage and a nearly ideal subthreshold swing of 68 MV/dec. Pinhole free oxide on TMD materials are very difficult to achieve and this work showed excellent results.

The final paper is 34.5, “First Demonstration of GAA Monolayer-MoS2 Nanosheet nFET with 410 μA/μm ID at 1V VD at 40nm gate length,” Yun-Yan Chung, et. al. of TSMC showed an MoS2 device with good performance fabricated with an integrated process flow.

Figure 9. illustrates a simulation of the process flow for a two-layer device stack.

Figure 9. Simulated two layer device process.

Although further research is still needed in this paper 2 and 4 stacks of TMD and sacrificial material were sequentially deposited.

Figure 10. shows TEM images of the resulting stacks.

Figure 10. TEM of the deposited 2D/sacrificial material stacks.

 Sequential deposition of the 2D materials and sacrificial layers is a far more production type of process versus film transfer and likely to be lower cost as well.

The resulting stacks were then etched into fins using a metal hard mask. Figure 11. illustrates the “fin” formation results.

Figure 11. 2D Material stack fins.

 As is the case with horizontal nanosheet stacks, an inner spacer is needed to reduce capacitance. To form the inner spacer an additional sacrificial material is needed to prevent collapse of the 2D layers. Figure 12. illustrates the inner spacer process.

Figure 12. Inner Spacer formation.

Finally, metal edge contacts are formed, and the channels are released. Figure 13. illustrates the metal edge contacts.

Figure 13. Metal Edge Contacts.

The resulting devices have high contact resistance due to lack of doping in the contacts and extension regions. A 1 layer device with 40nm Lg was demonstrated with a Vth of ~0.8 volts, SS of ~250 mV/dec and drive current of 410 μA/μm.

Conclusion

These 7 papers illustrate both the excellent progress being made toward 2D devices and the level of interest at two of the leading-edge device producers. Some recent projections I have completed suggest that 2D CFETs can achieve logic density 5x of the current densest production standard cells. 2D CFETs are likely a technology for the 2030s as opposed to the 2020s and illustrate that logic scaling is nowhere near being at an end.

Also Read:

IEDM 2022 – Imec 4 Track Cell

IEDM 2022 – TSMC 3nm

IEDM 2022 – Ann Kelleher of Intel – Plenary Talk


Podcast EP144: How Andes Supplies RISC-V Cores to the World with Frankwell Lin

Podcast EP144: How Andes Supplies RISC-V Cores to the World with Frankwell Lin
by Daniel Nenni on 02-17-2023 at 10:00 am

Dan is joined by Frankwell Lin. Frank co-founded Andes Technology in 2005 and served as President from 2006. He became Chairman and CEO in 2021. Under his leadership, Andes is recognized as a top supplier of embedded CPU IP in the semiconductor industry.

Dan explores how Andes became such a strong supplier of RISC-V cores with Frank. Frank explains how Andes chose the RISC-V architecture and the vast array of applications that Andes supports with high quality, proven IP. Dan discusses the future with Frank as well. Where will Andes take its portfolio and expertise next?

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


CEO Interview: Axel Kloth of Abacus

CEO Interview: Axel Kloth of Abacus
by Daniel Nenni on 02-17-2023 at 6:00 am

Axel Kloth13181

A physicist by training, Axel is used to the need for large-scale compute. He discovered over 30 years ago that scalability of processor performance was paramount for solving any computational problem. That necessitated a new paradigm in computer architecture. At Parimics, SSRLabs and Axiado he was able to show that new thinking was needed, and what novel practical solutions could look like. Axel is now repeating that approach with Abacus Semi.

What is Abacus Semiconductor Corporation’s vision?
Abacus Semi envisions a future in which supercomputers can be built with Lego-like building blocks – mix and match any combination of processors, accelerators and smart multi-homed memories. We believe that supercomputers today do not fulfill the requirements of the users. They do not scale nearly linearly. Oftentimes, 100,000 servers making up a supercomputer can be found to provide just 5,000 times the performance of a single server. That is largely due to the fact that today’s supercomputers in essence are commercial off the shelf (COTS) devices, without any consideration of communication between those servers for instruction- and data-sharing at low levels of latency and a high level of bandwidth. Another drawback is that accelerators for special-purpose applications do not integrate easily into supercomputers. We have a different view on the basic building blocks – very similar to Legos. If the programmable elements such as processors are used for orchestration of the workload, then accelerators carry out the work, and data comes in and exits through dedicated I/O nodes, while large-scale smart multi-homed memory subsystems keep the intermediate data at hand at all times.

How did Abacus Semiconductor Corporation begin?
Axel is a physicist and computer scientist by training, and as such has used supercomputers for decades and was frustrated by the complexity of deploying and using them, by the lack of linear scaling, and by the enormous cost associated with them. As a result, he set out to fix what could be fixed, always assuming a few basic fundamentals. He started on this journey with Parimics, a vision processor company in 2004, and then with Scalable Systems Research Labs, Inc (SSRLabs) in 2011, with a short detour to a secure processor startup, and now to Abacus Semiconductor Corporation in 2020.

A modern supercomputer should allow the integration of accelerators easily both in hardware and in software, it should be able to provide very large memory configurations in both exclusive and shared memory partitions, and it should be on par in cost with COTS-based systems while keeping operating costs down. Especially the integration of accelerators for numerically intensive applications, for matrix and tensor math, for Artificial Intelligence (AI) and Machine Learning (ML) as well as the need for very large cache coherent memory shared across many processors prove to be good and future-proof calls as today’s requirements for GPT-3 and ChatGPT call for memory arrays of sizes that are not supported in today’s processors.

As a computer scientist, it was clear to Axel that fixed-function devices provide a vastly superior performance, use less power and less silicon real estate than programmable elements, and as such a modern supercomputer should allow for the integration of all kinds of accelerators while keeping the programmability of a processor at hand for orchestration of workloads and for executing those tasks for which no hardware exists.

You mentioned you have some recent developments to share. What are they?
We are very excited to let you know that we have assessed all of the code and the building blocks that we have created over the past more than a decade, and our requirements are all met. With our Server-on-a-Chip, our smart multi-homed memory subsystems, our math and database accelerators we have shown in simulations that we will achieve a vastly better linearity of scale-out. For most applications and configurations, it seems that we will hit an 80% scale-out factor, i.e. a supercomputer consisting of 100,000 servers should provide roughly 80,000 the performance of a single one. Our interface will provide enough bandwidth per pin to allow for over 3.2 TB/s of bandwidth into and out of our accelerators and processors. The smart multi-homed memory subsystem will provide nearly 1 TB/s of bandwidth into and out of the chip. The security and coherency domains can be set for each memory subsystem. We have made progress in building our team – both engineering and management – and we have a term sheet in hand. We are still assessing the validity and veracity of this term sheet, but at this point in time the conditions look good.

Tell us about these new chips you are building?
As stated before, we believe that in order to build a new generation of supercomputers, new processors, accelerators and smart multi-homed memories are needed. We also touched on the fact that today’s cores are incredibly good, and that the problem in supercomputers are not the processors cores, but nearly everything around them. We are using RISC-V processor cores that we modified as the basic programmable building element. Doing that allows us to partake in the growth of the ecosystem around RISC-V, which I believe shows the fastest growth of any processor that I have seen in my career. We removed all of the performance-limiting factors around RISC-V, added hardware support for virtualization and hypervisors, optimized the cache interfaces, and made sure it can connect to our internal processor information superhighway. We are also using accelerators for all I/O and legacy interfaces, and because we do this in a Lego-like fashion, these blocks are being reused in our Server-on-a-Chip and in our integer database processor and the orchestration processor, which are in fact the same hardware with different firmware. The Lego-like principles extend to our smart multi-homed memory subsystem as well. As such, our development effort is relatively low compared to other companies that focus on processor design and supercomputers. Due to our philosophy of parallelism instead of having to crank up the clock frequencies we do not need to spend tons of money on the old cat-and-mouse game of physical design with dynamic timing closure going through multiple iterative rounds to squeeze out one more Hertz of clock frequency. All of that simplifies code and building block reuse, and that is why we try to build our own IP in-house and keep it that way.

What are the chips in the Abacus Semi family?
The chips we are designing are the Server-on-a-Chip that effectively combines an entire server onto one processor, the identical Supercomputer I/O Frontend, an Orchestration Processors, an Integer Database Processor (both of these deploy the same hardware but use different firmware), and a math accelerator as well as a set of smart multi-homed memories.

How are the Abacus Semi chips programmed?
Since we use a RISC-V processor as the underlying programmable element, we can call on the existing ecosystem. Our Server-on-a-Chip, the integer database processor and the orchestration processor are all fully RISC-V Instruction Set Architecture compatible. In other words, they all run Linux and FreeBSD, with GCC and LLVM/CLANG as compilers available for a while now. In fact, the entire LAMP (Linux/Apache/mySQL/PHP) and FAMP (FreeBSD/Apache/mySQL/PHP) stack is available for them, and as such, any PHP and Perl application runs on them unchanged. Due to the fact that we use a DPU-plus approach to networking, we have a piece of firmware available for our processors that acts like a filtering Network Interface Card (NIC) with offload capabilities and with DMA and Remote DMA functions, as well as with direct memory access to the applications processors. A similar offload for mass storage is available and offloads the applications processors from mass storage tasks, thereby making more of the applications processors’ time available for the user applications, with or without a hypervisor. Since the Server-on-a-Chip doubles as an I/O frontend for supercomputers, the supercomputer core does not need to carry out I/O or legacy interface functions; these are all relegated to the Server-on-a-Chip. That allows the users of a supercomputer to deploy the core in a bare-metal fashion, if so desired. The math accelerator for matrix and tensor math as well as for transforms uses openACC and openCL as outward-facing APIs, but we have a translation layer available that converts CUDA into our native command set.

Can you tell us more about your technology behind the scale-out improvement?
We believe that communication is key in scale-out, and more importantly, low-latency and high-bandwidth communication. As a result, we reviewed everything we had built for unnecessary layers of hierarchy of communication through bridges and interface adapters and interface converters. We removed all of them as necessary and possible. As a result, the communication between any two or more elements in our architecture provides the highest possible bandwidth given the restrictions in bump and ball count, and the need to traverse Printed Circuit Boards (PCBs), which necessitates CML-type High Speed Serial Links. However, we use the shortest possible FLITS and commensurate encoding, both of which enable optical and electrical communication. The interface that we have designed is available for broader adoption by anyone who is interested in using it, for a nominal licensing fee. It is wide enough to provide class-leading bandwidth while allowing resilience and error-detection features for system availability in the six nines region. It is also a smart interface in that it can recognize the topology of the network up to three deep in hierarchy autonomously, and it is designed to be on its own chiplet in case we find a partner that wants it but cannot design it into their own designs.

When will the Abacus Semi chips be available?
We are working with customers and partners to ensure a prototype tapeout in Q3 of 2025, and a volume-production set for FCS in Q1 of 2026.

Also Read:

CTO Interview: John R. Cary of Tech-X Corporation

Semiwiki CEO Interview: Matt Genovese of Planorama Design

CEO Interview: Dr. Chris Eliasmith and Peter Suma, of Applied Brain Research Inc.


Speeding up Chiplet-Based Design Through Hardware Emulation

Speeding up Chiplet-Based Design Through Hardware Emulation
by Kalar Rajendiran on 02-16-2023 at 10:00 am

Barriers on the Continuum to SiP

The first chiplets focused summit took place last month. So many accomplished speakers gave keynote talks on what direction should and would the Chiplets ecosystem evolution take. Corigine presented the keynote on what direction hardware emulation should and would evolve for speeding up chiplet- based designs. During a pre-conference tutorial session, Corigine shared customer-based case studies to highlight how Corigine’s MimicPro prototyping and emulation solutions addressed challenges introduced by chiplet-based designs.

The Chiplet Summit introduced a new tag line, “Chiplets Make Huge Chips Happen.” With large monolithic SoCs losing favor in the face of Moore’s Law slowing down, the new tag line highlights how chiplets make large SoCs possible. Of course, tag lines by themselves don’t make things happen. It takes an ecosystem, the companies within the ecosystem and the people at these companies that make things happen. One of those companies is Corigine. Corigine is a fabless semiconductor company that designs and delivers leading edge EDA tools.

Corigine presented insightful thoughts and discussed their innovative solutions during various sessions at the conference. If you missed these sessions, the following is a synthesis of the salient points from those sessions.

Chiplet-based Design Benefits, Challenges and Solutions

Aside from the economic benefit derived from an yield perspective compared to a large monolithic SoC, chiplets bring many additional benefits to the table. These benefits are namely, architectural partitioning, enabling of re-use, time-to-market and product family scalability. Of course, there are many challenges too. The following diagram shows the continuum of barriers when implementing a chiplet-based chip.

With Corigine’s focus on addressing the front-end barriers, the following are its learnings during the course of its chiplet-based data processing unit (DPU) chip development work.

Chiplets-based Chip Development and Emulation Requirements

A key consideration for a chiplet is the decision on where to place its various I/O ports. This of course is driven by the system requirements such as machine language (ML) processing functionality and datapath SIMD or MIMD organization. With an effective architectural decomposition of the system, the next set of requirements revolves around the interconnect’s attributes. The interconnects should be open, extensible and backwards compatible.  For example, as UCIe is being driven as a standard for the D2D interconnects, as the UCIe standard evolves, UCIe V2.0 should also support V1.0 based chiplets.

With the interconnects addressed, the next requirement is a pre-tapeout platform to support integration and verification of heterogeneous chiplets. The platform should be able to support a very large number of transistors and ensure IP protection and segmentation. Finally, none of the above matter if silicon and software co-development cannot be accomplished rapidly and successfully. The co-development platform must provide built-in logic analyzers with complex trigger mechanism capabilities to insert waveforms during software debug.

Corigine’s MimicPro Prototyping and Emulation Solutions

To address the co-development platform, Corigine developed a series of FPGA-based prototyping and emulation platforms by working with the silicon and software teams developing their own chiplet-based DPU chip. These platforms are essentially combined prototyping and emulation systems that can provide faster software turnaround time. They include functionality for collecting and analyzing data and introducing design-for-test and design-for-manufacturing features, thereby enabling software verification before tapeout.

The MimicPro solutions deliver an order-of-magnitude performance improvement over traditional emulators of similar class. Corigine’s patented distributed routing and fine-grain multi-user clocking enable linear performance scaling irrespective of the size of a block being emulated. The dedicated scalable clock/routing infrastructure enables higher utilization of resources for logic emulation.

Corigine MimicPro was initially optimized for performance and scalability, enhanced with visibility, portability and security. It essentially combines rich debugging features and confidential information protection and 10-100MHz level performance of prototyping. It continues to grow with Corigine in house SmartNIC / Data Processing Unit chiplet design.

The following chart showcases the resource utilization efficiency of a MimicPro system in real life use by a SmartNIC.

The following is what Corigine is addressing for chiplets with its MimicPro solutions.

MIMIC Product Information  

MimicPro™ 32

The Corigine MimicPro Prototyping System provides performance and speed for ASIC and software development for both enterprise and cloud operation, with utmost security and scalability. The MimicPro solution provides scalability from 4 to 32 FPGAs. The system also provides easy upgradeability to the latest available FPGAs. The Corigine MimicPro system is the industry’s next-generation platform for automating prototyping including manual partitioning operations, while providing a system-level view for optimum partitioning and performance. In addition, the MimicPro system adds deep local debug capabilities providing much greater visibility and faster elimination of bugs. Thus, the MimicPro system reduces the overall development time and cost-effectively accelerates software development without the dependence on costly emulation.

For more detailed MimicPro™-32 information, you can refer to Corigine’s product page.

MimicPro-32

MimicTurbo™ GT Card

Corigine MimicTurbo GT card based on the UltraScale+™ VU19P FPGA is designed to simplify the deployment of FPGA based prototyping at the desktop. The card can support up to 48 million ASIC gates each, has onboard DDR4 component memory and can be configured to operate with additional connected MimicTurbo GT cards. The card supports 64 GTY transceivers (16 Quads) along with the essential I/O interfaces.

Corigine MimicTurbo GT board is available from the Xilinx website. You can find more detailed product information on AMD/Xilinx FPGA-based Corigine MimicTurbo GT card on this page.

MimicTurbo GT Card

Corigine MimicTurbo GT 1 FPAG board is available from the Xilinx website. You can find more detailed product information on Xilinx FPGA-based Corigine MimicTurbo GT card on this page.

Corigine at DVCon US 2023

Corigine is at DVCon demonstrating its MimicPro-32 this month.
Time: February 27 th – March 1 st
Location: DoubleTree by Hilton Hotel San Jose.
Registration: https://dvcon.org/registration/

Also Read:

Alphawave Semi at the Chiplet Summit

Who will Win in the Chiplet War?

The Era of Chiplets and Heterogeneous Integration: Challenges and Emerging Solutions to Support 2.5D and 3D Advanced Packaging


ML-Based Coverage Acceleration. Innovation in Verification

ML-Based Coverage Acceleration. Innovation in Verification
by Bernard Murphy on 02-16-2023 at 6:00 am

Innovation New

We looked at another paper on ML-based coverage acceleration back in April 2022. Here is a different angle from IBM. Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome. And don’t forget to come see us at DVCon, first panel (8am) on March 1st 2023 in San Jose!

The Innovation

This month’s pick is Using DNNs and Smart Sampling for Coverage Closure Acceleration. The authors presented the paper at the 2020 MLCAD Workshop and are from IBM Research in Haifa and the University of BC in Canada.

The authors intent is to improve coverage on events which have been hit only rarely. They demonstrate their method for a CPU design, based on refining instruction set (IS) test templates for an IS simulator. Especially interesting in this paper is how they manage optimization in very noisy low statistics data where conventional gradient-based comparisons are problematic. They suggest several methods to overcome this challenge.

Paul’s view

Here is another paper on using DNNs to improve random instruction generators in CPU verification which, given the rise of Arm-based servers and RISC-V, is becoming an increasingly hot topic in our industry.

The paper begins by documenting a baseline non-DNN method to improve random instruction coverage. This method works by randomly tweaking instruction generator parameters and banking the tweaks if they improve coverage. The tweaking process is based on a gradient-free numerical method called implicit filtering (see here for a good summary), which works kind of like zoom out-then-in search: start with big parameter tweaks and zoom in to smaller parameter tweaks if the big tweaks don’t improve coverage.

The authors then accelerate their baseline method using a DNN to assess if the parameter tweaks will improve coverage before going ahead with costly real simulations to precisely measure the coverage. The DNN is re-trained after each batch of real simulations, so it is continuously improving.

The paper is well written, and the formal justification for their method is clearly explained. Results are presented on two arithmetic pipes of the IBM NorthStar processor (5 instructions and 8 registers). It’s a simple testcase and sims are run for only 100 clock cycles measuring only 185 cover points. Nevertheless, the results do show that the DNN-based method is able to hit all the cover points with half as many sims as the baseline implicit filtering method.  Nice result.

Raúl’s view

As Paul says, we are revisiting a topic we have covered before. In April 2022 we reviewed a paper by Google which incorporated a Control-Data-Flow-Graph into a neural network. Back in December 2021we reviewed a paper from U. Gainesville using Concolic (Concrete-Symbolic) testing to cover hard to reach branches. This month’s paper introduces a new algorithm for coverage-directed test generation combining test templates, random sampling, and implicit filtering (IF) with a deep neural network (DNN) model. The idea is as follows:

As is common in coverage directed generation, the approach uses test templates, vectors of weights on a set of test parameters that guide random test generation. Implicit filtering (IF) is an optimization algorithm based on grid search techniques around an initial guess to maximize chances to hit a particular event. To cover multiple events, the IF process is simply repeated for each event, called the parameter-after-parameter approach (PP). To speed up the IF process, the data collected during the IF process is used to train a DNN, which approximates the simulator and is much faster than simulating every test vector.

The effectiveness of the algorithms is evaluated employing an abstract high-level simulator of part of the NorthStar processor. Four algorithms are compared: Random sampling, PP, DNN and combining IF and DNN. The results of three experiments are reported:

  1. Running the algorithms with a fixed number of test templates, up to 400 runs. Combining IF and DNN is superior, missing only up to 1/3 of the hard to hit events
  2. Running the algorithms until all hard to hit events are covered. IF and DNN converges with half the number of test templates
  3. Running the last algorithm (IF and DNN) 5 times. All runs converge with a similar number of test templates, even the worst using ~30% less test templates than other algorithms

This is a well-written paper on a relevant problem in the field. It is (almost) self-contained, it is easy to follow, and the algorithms employed are reproducible. The results show a reduction of “the number of simulations by a factor of 2 or so” over implicit filtering. These results are based on one relatively simple experiment, NorthStar. I would have liked to see additional experimentation and results; some can be found in other publications by the authors.


The State of FPGA Functional Verification

The State of FPGA Functional Verification
by Daniel Payne on 02-15-2023 at 10:00 am

Design Styles min

Earlier I blogged about IC and ASIC functional verification, so today it’s time to round that out with the state of FPGA functional verification. The Wilson Research Group has been compiling an FPGA report every two years since 2018, so this marks the third time they’ve focused on this design segment. At $5.8 billion the FPGA market is sizable, and forecasted to grow to $8.1 billion by 2025. FPGAs started out in 1984 with limited gate capacity, and have now grown to include millions of gates, processors and standardized data protocols.

Low volume applications benefit from the NRE of FPGA devices, and engineers can quickly prototype their designs by verifying and validating at speed. FPGAs now include processors, like: Xilinx Zynq UltraSCALE, Intel Stratix, Microchip SmartFusion. From the 980 participants in the functional verification study, the FPGA and programmable SoC FPGA design styles are the most popular.

Design Styles

As the size of FPGAs has increased recently, the chance of a bug-free production release has dropped to just 17%, which is even worse than the 30% of IC and ASIC projects for correct first silicon. Clearly, we need better functional verification for complex FPGA systems.

FPGA bug escapes into production

The types of bugs found in production fall into several categories:

  • 53% – Logic or Functional
  • 31% – Firmware
  • 29% – Clocking
  • 28% – Timing, path too slow
  • 21% – Timing, path too fast
  • 18% – Mixed-signal interface
  • 9% – Safety feature
  • 8% – Security feature

Zooming into the largest category of failure, logic or functional, there are five root causes.

Root Causes

FGPA projects mostly didn’t complete on time, once again caused by the larger size of the systems, complexity of the logic and even the verification methods being used.

FPGA Design Schedules

Engineers on an FPGA team can have distinct titles like design engineer or verification engineer, yet on 22% of projects there were no verification engineers – meaning that the design engineers did double-duty and verified their own IP. Over the past 10 years there’s been a 38% increase in the number of verification engineers on an FPGA project, so that’s progress towards bug-free production.

Number of engineers

Verification engineers on FPGA projects spent most of their time on debug tasks at 47%:

  • 47% – Debug
  • 19% – Creating test and running simulation
  • 17% – Testbench development
  • 11% – Test Planning
  • 6% – Other

The number of embedded processors has steadily grown over time, so 65% of FPGA designs have one or more processor cores now, increasing the amount of verification between hardware, software interfaces; and managing on-chip networks.

Embedded Processors

The ever-popular RISC-V processor is embedded in 22% of FPGAs, and AI accelerators are used in 23% of projects. There are 3-4 average number of clock domains used on FPGAs, and they require gate-level timing simulations for verification, plus the use of static Clock Domain Crossing (CDC) tools for verification.

Security features are added to 49% of FPGA designs to hold sensitive data, plus 42% of FPGA projects adhere to safety-critical standards or guidelines. On SemiWiki we’ve often blogged about ISO 26262 and DO-254 standards. Functional Safety (FuSa) design efforts take between 25% to 50% of the overall project time.

Safety Critical Standards

The top three verification languages are VHDL, SystemVerilog and Verilog; but also notice the recent jumps in Python and C/C++ languages.

Verification Languages

The most popular FPGA methodologies and testbench base-case libraries are: Accellera UVM ,OSVVM and UVVM. The Python-based cocotb was even added as a new category for 2022.

Verification Methodologies

Assertion languages are led by SystemVerilog Assertions (SVA) at 45%, followed by Accellera Open Verification Library (OVL) at 13% and PSL at 11%. FPGA designs may combine VHDL for RTL design along with SVA for assertions.

Formal property checking is growing amongst FPGA projects, especially as more automatic formal apps have been introduced by EDA vendors.

Formal Techniques

Simulation-based verification approaches over the past 10 years shows steady adoption, listed in order of relevance: Code coverage, functional coverage, assertions, constrained random.

Summary

The low 17% bug-free number for FPGA projects in 2022 that made it into production was the most surprising number to me, as the effort to recall or re-program a device in the field is expensive and time consuming to correct. A more robust functional verification approach should lead to fewer bug escapes into production, and dividing the study participants into two groups does show the benefit.

Verification Adoption

Read the complete 18 page white paper here.

Related Blogs


Area-optimized AI inference for cost-sensitive applications

Area-optimized AI inference for cost-sensitive applications
by Don Dingee on 02-15-2023 at 6:00 am

Expedera uses packet-centric scalability to move up and down in AI inference performance while maintaining efficiency

Often, AI inference brings to mind more complex applications hungry for more processing power. At the other end of the spectrum, applications like home appliances and doorbell cameras can offer limited AI-enabled features but must be narrowly scoped to keep costs to a minimum. New area-optimized AI inference technology from Expedera is taking on this challenge, targeting 1 TOPS performance in the smallest possible chip area.

Optimized for one model, but maybe not for others

Fitting into an embedded device brings constraints and trade-offs. For example, many teams concentrate on developing the inference model for an application using a GPU-based implementation, only to discover that no amount of optimization will get them anywhere near the required power-performance-area (PPA) envelope.

A newer approach uses a neural processing unit (NPU) to handle AI inference workloads more efficiently, delivering the required throughput in less die size and power consumption. NPU hardware typically scales up or down to meet throughput requirements, often measured in tera operations per second (TOPS). In addition, compiler software can translate models developed in popular AI modeling frameworks like PyTorch, TensorFlow, and ONNN into run-time code for the NPU.

Following a long-held principle of embedded design, there’s a strong temptation for designers to optimize their NPU hardware in their application, wringing out every last cent of cost and milliwatt of power. However, if only a few AI inference models are in play, it might be possible to optimize hardware tightly using a deep understanding of model internals.

Model parameters manifest as operations, weights, and activations, varying considerably from model to model. Below is a graphic comparing several popular lower-end neural network models.

On top of these differences sits the neural network topology – how execution units interconnect in layers – adding to the variation. Supporting different models for additional features or modes leads to overdesigning with a one-size-fits-all NPU big enough to cover performance in all cases. However, living with the resulting cost and power inefficiencies may be untenable.

NPU co-design solves optimization challenges

It may seem futile to optimize AI inference in cost-sensitive devices where models are unknown when the project starts or running more than one model for mode preferences. But, is it possible to tailor an NPU more closely to a use case without enormous investments in design time or running the risk of an AI inference model changing later?

Here’s where Expedera’s NPU co-design philosophy shines. The key is not hardcoding models in hardware but instead using software to map models to hardware resources efficiently. Expedera does this with a unique work sequencing engine, breaking operations down into metadata sent to execution units as a packet stream. As a result, layer organization becomes virtual, operations order efficiently, and hardware utilization increases to 80% or more.

 

 

 

 

 

In some contexts, packet-centric scalability unlocks higher performance, but in Expedera’s area-optimized NPU technology, packets can also help scale performance down for the smallest chip area.

Smallest possible NPU for simple models

Customers say a smaller NPU that matches requirements and keeps costs to a minimum can make the difference between having AI inference or not in cost-sensitive applications. On the other hand, a general-purpose NPU might have to be overdesigned by as much as 3x, driving up die size, power requirements, and additional costs until a design is no longer economically feasible.

Starting with its Origin NPU architecture, fielded in over 8 million devices, Expedera tuned its engine for a set of low to mid-complexity neural networks, including MobileNet, EfficientNet, NanoDet, Tiny YOLOv3, and others. The results are the new Origin E1 edge AI processors, putting area-optimized 1 TOPS AI inference performance in soft NPU IP ready for any process technology.

“The focus of the Origin E1 is to deliver the ideal combination of small size and lower power consumption for 1 TOPS needs, all within an easy-to-deploy IP,” says Paul Karazuba, VP of Marketing for Expedera. “As Expedera has already done the optimization engineering required, we deliver time-to-market and risk-reduction benefits for our customers.”

Seeing a company invest in more than just simple throughput criteria to satisfy challenging embedded device requirements is refreshing. For more details on the area-optimized AI inference approach, please visit Expedera’s website.

Blog post: Sometimes Less is More—Introducing the New Origin E1 Edge AI Processor

NPU IP product page: Expedera Origin E1