wide 1

Scaling AI as a Service Demands New Server Hardware

Scaling AI as a Service Demands New Server Hardware
by Bernard Murphy on 03-14-2023 at 6:00 am

NLP min

While I usually talk about AI inference on edge devices, for ADAS or the IoT, in this blog I want to talk about inference in the cloud or an on-premises datacenter (I’ll use “cloud” below as a shorthand to cover both possibilities). Inference throughput in the cloud is much higher today than at the edge. Think about support in financial services for fraud detection, or recommender systems for e-commerce and streaming services. Or the hot topic of our time – natural language systems driving chatbots and intelligent virtual assistants, such as ChatGPT. We know that inference, like training, runs on specialized systems: deep learning accelerators (DLAs), built on GPUs, DSPs or other custom hardware based on ASICs or FPGAs. Now we need that capability to serve very high demand, from banks, streaming services, chatbots and many other applications. But current cloud infrastructure is not ready.

Why is Cloud Inference not Ready?

There are already multiple DLA technologies in production use, each offering different strengths. How does a cloud provider simplify user access to a selection of options? DLAs aren’t conventional computers, ready to just plug into a cloud network. They need special care and feeding to maximize throughput and value, and to minimize cost and complexity in user experience. Cloud clients also don’t want to staff expert teams in the arcane details of managing DLA technologies. Providers must hide that complexity behind dedicated front-end servers to manage the interface to those devices.

Scalability is a very important consideration. Big training tasks run on big AI hardware. Multiple training tasks can be serialized, or the provider can add more hardware if there is enough demand and clients are willing to factor the training costs into their R&D budgets. That reasoning doesn’t work for high volume, high throughput inferencing. Inference based on technologies like ChatGPT is aimed at high demand, low margin services, accounted for in client cost of sales. Training supercomputers can’t meet that need given their high capital and operating costs. Supercomputers are not an affordable starting point for inference clients.

High volume demand for general software is handled by virtualization on multi-core CPUs. Jobs are queued and served in parallel to the capacity of the system. Cloud providers like AWS now offer a similar capability for access to resources such as GPUs. You can schedule a virtual GPU, managed through a conventional server offering virtualization to multiple GPUs. Here we’re talking about using these virtual GPUs as DLAs so the server must run a big stack of software to handle all complexities of interfacing between the cloud and the inference back-end. This CPU-based server solution works but also proves expensive to scale.

Why CPU-Based Servers Don’t Scale for Inference

Think about the steps a CPU-based server must run to provide inference as-a-service. A remote client will initiate a request. The server must add that job to the queue and schedule the corresponding task for the next available virtual machine, all managed through a hypervisor.

When a task starts the virtual machine will first download a trained model because each new task needs a different trained model. A target DLA will be selected; the model will then be mapped to the appropriate DLA command primitives. Some of these will be supported directly by the accelerator, some may need to be mapped to software library functions. That translated model is then downloaded onto the DLA.

Next, large data files or streaming data – images, audio data or text – must be pipelined through to drive inferencing. Image and audio data often must be pre-processed through appropriate codecs. Further, DLAs have finite capacity so pipelining is essential to feed data through in digestible chunks. Results produced by the DLA will commonly require post-processing to stitch together a finished inference per frame, audio segment or text block.

Production software stacks to serve inferencing are very capable but demand a lot of software activity per task to initiate and feed a DLA, and to feed results back. The overhead per virtual machine per task is high and will become higher still under heavy loads. Worse yet inferencing traffic is expected to be high throughput with low turn-around time times from request to result. High demand puts higher loads on shared services, such as the hypervisor, which becomes visible in progressively slower response times as more tasks are pushed through.

An Architecture for a Dedicated AI Inference Server

A truly competitive alternative to a general-purpose server solution must offer significantly higher throughput/lower latency, lower power, and lower cost, through the path from network interface to the sequencing function, codecs and offload to AI engines. We know how to do that – convert from a largely software centric implementation to a much more hardware centric implementation. Function for function, hardware runs faster and can parallelize much more than software. A dedicated hardware-based server should be higher throughput, more responsive, lower power  and lower cost than a CPU-based server.

NeuReality, a startup based in Israel, has developed such an AI server-on-a-chip solution, realized in their NR1 network-attached processing unit (NAPU). This hosts a network interface, an AI-hypervisor handling the sequencing, hardware-based queue management, scheduling, dispatching, and pre- and post-processing, all through embedded heterogeneous compute engines. These couple to a PCIe root-complex (host) with 16-lane support to DLA endpoints. The NAPU comes with a full hardware-accelerated software stack: to execute the inference model on a DLA, for media processing and to interface to the larger datacenter environment. The NR1-M module is made available in multiple form-factors, including a full-height single width and full-height double-width PCIe card containing a NR1 NAPU system-on-chip connecting to a DLA. NR1-S provides a rack-mount system hosting 10 NR1-M cards and 10 DLA slots to provide disaggregated AI service at scale.

NeuReality has measured performance for NR1, with IBM Research guidance, for a variety of natural language processing applications: online chatbots with intent detection, offline sentiment analysis in documents, and online extractive Q&A. Tests were run under realistic heavy demand loads requiring fast model switching, comparing CPU-centric platforms with NR-based platforms. They have measured 10X better performance/$ and 10X better performance/Watt than comparable CPU-server-based solutions, directly lowering CAPEX and OPEX for the cloud provider and therefore increasing affordability for client inference services.

These are the kind of performance improvements we need to see to make inference as a service scalable. There’s a lot more to share, but this short review should give you a taste of what NeuReality has to offer. They already have partnerships with IBM, AMD, Lenovo, Arm, and Samsung. Even more impressive, they only recently closed their series A round! Definitely a company to watch.


MIPI D-PHY IP brings images on-chip for AI inference

MIPI D-PHY IP brings images on-chip for AI inference
by Don Dingee on 03-13-2023 at 10:00 am

Perceive Ergo 2 brings images on-chip for AI inference with Mixel MIPI D-PHY IP

Edge AI inference is getting more and more attention as demand grows for AI processing across an increasing number of diverse applications, including those requiring low-power chips in a wide range of consumer and enterprise-class devices. Much of the focus has been on optimizing the neural network processing engine for these smaller parts and the models they need to run – but optimization has a broader meaning in many contexts. In an image recognition use case, the images must come from somewhere, usually from a sensor with a MIPI interface. So, it makes sense to see Perceive integrating low-power MIPI D-PHY IP from Mixel on its latest Ergo 2 Edge AI Processor, bringing images on-chip for AI inference.

Resolutions and frame rates on the rise

AI processors have beefed up to the point where they can now handle larger images off high-resolution sensors at impressive frame rates. It’s crucial to be able to run inferences and make decisions quickly, keeping ahead of real-time changes in scenes. In view of this, Perceive has put considerable emphasis on the image processing pipeline in the Ergo 2.

 

 

 

 

 

 

Ergo 2 Edge AI Processor system diagram, courtesy Perceive

Large images with a lot of pixels present a fascinating challenge for device developers. In a sense, image recognition is a misnomer. Most use cases where AI inference adds value call for looking at a region of interest, or a few of them, with relatively few pixels wrapped inside a much larger image filled with mostly uninteresting pixels. Spotting those regions of interest sooner and more accurately determines how well the application runs.

The Ergo 2 image processing unit has dual, simultaneous pipelines that can isolate interesting pixels, making it easier for AI models to handle perception. The first pipeline supports four regions of interest in a max image size of 4672 x 3506 pixels at 24 frames per second (fps). The second pipeline can target a single region in a 2048 x 1536 pixel image coming in at 60 fps. The IPU also handles image-wide tasks like scaling, range compression, rotation, distortion and lens shading correction, and more.

Lost frames can throw off perception

Excessive noise or jitter in these fast, high-resolution images can lead to frame loss due to data errors. Lost frames in an image stream can impact the accuracy of inference operations, leading to missed or incorrect perceptions. Reliable image transfer that holds up to challenging environments is a necessity for accurate perception at the edge.

A defining feature of the Mixel MIPI D-PHY IP is its clock-forwarded synchronous link that provides high noise immunity and high jitter tolerance. In the Ergo 2, three different MIPI IP solutions are at work: a four-lane CSI-2 TX, a two-lane CSI-2 RX, and a four-lane CSI-2 RX. Each IP block integrates a transmitter or receiver and a 32-bit CSI-2 controller core. Links run up to 2.5 Gbps, with a typical eye pattern shown next.

First-pass success makes or breaks smaller chips

A flaw appearing in a large SoC isn’t fun, and a redesign can be expensive. However, a bigger SoC project tends to have a bigger design team, a longer schedule, and a bigger budget. On a smaller chip, a bust can kill a project in its tracks, with debug and re-spin costs quickly escalating to more than the initial development cost.

Although first-pass success isn’t a given in the semiconductor business, Perceive was able to achieve that with the Mixel IP. Mixel supported Perceive with compliance testing, enabling the full-up integrated design to endure rigorous MIPI interface characterization before the SoC moved to high-volume production. Mixel MIPI D-PHY IP contains pre-driver and post-driver loopback and built-in self-test features for exercising transmit and receive interfaces.

The result for Perceive of integrating Mixel’s MIPI D-PHY IP was hitting power, performance, and cost targets for the Ergo 2. Perceive’s customers, in turn, can implement Ergo 2 in smaller, power-constrained devices where battery life is a key metric, but AI inference performance has to be uncompromised. It’s a good example where bringing images on-chip for AI inference with carefully crafted integration contributes to savings at the small-system level.

For more information:

Perceive: Ergo 2 AI processor

Mixel: MIPI D-PHY IP core

Also Read:

MIPI bridging DSI-2 and CSI-2 Interfaces with an FPGA

MIPI in the Car – Transport From Sensors to Compute

A MIPI CSI-2/MIPI D-PHY Solution for AI Edge Devices


SPIE Advanced Lithography Conference 2023 – AMAT Sculpta® Announcement

SPIE Advanced Lithography Conference 2023 – AMAT Sculpta® Announcement
by Scotten Jones on 03-13-2023 at 8:00 am

Applied Materials Sculpta Presentation for Media Page 06

The SPIE Advanced Lithography Conference is the semiconductor industries premier conference on lithography. The 2023 conference was held the week of February 27th and at the conference Applied Materials announced their Sculpta® pattern shaping tool. Last week I had an opportunity to interview Steven Sherman the Managing Director and General Manager of the Advanced Products Group and discuss the new tool.

Introduction

The resolution of an exposure system is given by the Rayleigh Criteria:

R = k1λ/NA

where k1 is a process related factor, λ is the exposure wavelength and NA is the optical system numerical aperture.

The semiconductor industry is continually driving to smaller dimensions to enable greater transistor/bit density. With EUV delayed for many years, DUV was extended by a variety of techniques where multiple exposures were combined to create a higher resolution pattern than a single exposure could produce. Once EUV entered production multi patterning was in many cases replaced by a single EUV exposure.

From the Rayleigh Criteria the ultimate resolution for the current 0.33 NA EUV systems should be approximately 20nm but we are currently far from realizing that. ASML tests EUV system at the factory to 26nm on a flat wafer with a simple one-dimensional pattern but in production 30nm is the current practical limit and even then, there are circumstances where tight tip-to-tip requirements can require an extra cut or block mask. With a single EUV exposure tip-to-tip spacings are currently limited to approximately 25 to 30nm and a second EUV mask is required to get to a 15 to 20nm tip-to-tip. Next generation processes will require several EUV multi patterning layers. the Sculpta® tool is designed to address this situation.

Applied Materials Presentation

In their presentation Applied Materials describes two common cases where two EUV masks are used to create a pattern. The first is where dense interconnect lines are formed by an EUV mask and then a second EUV mask is used to cut or block the pattern to achieve tight tip-to-tip in the orthogonal direction. The second case is where one EUV mask is used to create an array of elongated contacts and then a second EUV mask is used to create a second array of contacts with tight tip-to-tip spacing relative to the first array. Elongated contacts are desirable for reduced sensitivity to placement errors versus lines.

In their presentation Applied Materials illustrates a simplified Litho-Etch Litho-Etch process flow that uses two EUV masks combined with two etches to create a pattern, see figure 1.

Figure 1. EUV Double Patterning Process Flow.

In figure 1, two litho-etch passes are illustrated. In each pass a deposition step deposits a film, the film is planarized, a lithography pattern is formed and measured, the pattern is etched into the film and then cleaned and measured again. Applied materials characterize each litho-etch pass as costing approximately $70.

I write for SemiWiki as a sideline to my “day job” building cost models for the semiconductor industry. A single EUV exposure cost about 2x the $70 cost listed for the entire litho-etch pass, the overall litho-etch pass cost is several times what Applied Materials is conservatively estimating as $70. Eliminating an EUV exposure with associate processing has a lot of value.

The Sculpta® tool is an etch tool built on the Applied material Centura® platform that uses an angled reactive ribbon beam to elongate a pattern in a hard mask. The two examples discussed were:

  1. Form a grating of lines and spaces with a relatively wide tip-tip spacing then use the Sculpta® tool to shrink the tip-to-tip, see figure 2.
  2. Form a dense array of round contact holes and then use the Sculpta® tool to elongate the contact holes, see figure 3.

In both cases a Litho-Etch – Litho-Etch process with two EUV exposures and associated processing is reduced to a single EUV Litho-Etch process followed by a Scultpa® tool shaping process.

Figure 2. Pattern Shaping of Interconnect Lines.

Figure 3. Pattern Shaping of Contacts.

Figure 4 illustrates the Sculpta® tool, it is a Centura® cluster tool with four ribbon beam etch chambers.

Figure 4. Centura® Sculpta® tool.

Applied Materials stated that the Sculpta® tools is the tool of records for multiple layers at a leading-edge logic customer and they included positive quotes from Intel and Samsung in their press release.

Analysis

The first thing to understand is the Sculpta® tool is addressing tip-to-tip and is not a general resolution enhancement solution. If you need resolution below 30nm today you would still be looking at EUV multi patterning, for example EUV based SADP. Sculpta® does not “fix” the fundamental 0.33 NA EUV limitations and does not eliminate the needs for High-NA EUV tools in the futures. It can however eliminate some EUV masks used to create tight tip-to-tip, this can help alleviate the EUV exposure system shortage and save on cost, process complexity and possibly environmental impact.

This brings up the question of cost savings. A standard 4 chamber cluster tool etcher should have a cost in the ten-million-dollar range. The Sculpta® tool may have specialized chambers that add cost but I would be surprised if it cost more than $15 million dollars (Applied Materials did not provide any guidance on this). For 2022 the average ASP for an EUV system from ASML is nearly $200 million dollars from ASML’s financial reports. Add to that deposition, etch, CMP, cleaning, inspection and metrology equipment and compares that to a Sculpta® tool, some inspection and metrology tools, and possibly a cleaning tool, and the capital cost saving should be substantial. The key question is what the throughput is for the Scupta® tool. I asked Applied Materials about this and was told it depends on the amount of shaping and the hard mask material being used (the pattern shaping is done to the hard mask before the pattern is etched into the final film). Due to the required precision I wouldn’t be surprised if the etch times are relatively long and therefore the tool throughput is relatively low, but it would have to be incredibly slow for the Sculpta® tool not to be a much less expensive option than an EUV Litho-Etch loop. The other questions would be what the practical process limits on the technique are in terms of where it can be applied. The fact that it has already been adopted for multiple layers at – at least at one major logic producer argue that it is a promising solution.

Conclusion

In conclusion I see this as a useful addition to the lithographer’s tool set. It is probably not revolutionary but will nicely augment the capability of EUV tools and could see wide adoption for leading edge logic and DRAM fabrication.

Also Read:

IEDM 2023 – 2D Materials – Intel and TSMC

IEDM 2022 – Imec 4 Track Cell

IEDM 2022 – TSMC 3nm

IEDM 2022 – Ann Kelleher of Intel – Plenary Talk


Podcast EP147: CachQ’s Harnessing of Heterogeneous Compute with Clay Johnson

Podcast EP147: CachQ’s Harnessing of Heterogeneous Compute with Clay Johnson
by Daniel Nenni on 03-10-2023 at 10:00 am

Dan is joined by Clay Johnson, CEO and co-founder of CacheQ Systems. Clay has more than 25 years of executive management experience across a broad spectrum of technologies including computing, security, semiconductors and EDA tools.

Dan discusses the CacheQ QCC development platform with Clay. This platform enables software developers to deploy and orchestrate applications using new compute architectures, such as multicore devices and heterogeneous distributed compute. The result is a significant increase in performance, reduced power and dramatically reduced development time.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


Cadence Hosts ESD Alliance Seminar on New Export Regulations Affecting EDA and SIP March 28

Cadence Hosts ESD Alliance Seminar on New Export Regulations Affecting EDA and SIP March 28
by Bob Smith on 03-10-2023 at 6:00 am

ESD Alliance Export Seminar 2023

Anyone interested in learning about general trade compliance concepts or how export control and sanction regulations affect the electronic systems design ecosystem will want to attend the upcoming ESD Alliance export seminar. It will be hosted by Ada Loo, chair of the ESD Alliance Export Committee and Cadence’s Group Director and Associate General Counsel and held March 28 at Cadence’s corporate headquarters,

The seminar will feature the Cadence Government and Trade Group, including Ada and William Duffy, Cadence’s Corporate Counsel. They will discuss why and how governments implement trade controls, what “exports” are and how they take place in different business contexts. The seminar will cover common due diligence methods, such as customer screening that U.S. companies use to incorporate regulatory compliance into their business processes. It will also highlight recent regulatory updates that address current issues such as U.S.-China trade relations and anticipated effects of those regulations on the U.S. semiconductor design ecosystem.

Attendees will get an overview of relevant regulators and regulations, including a high-level summary of the U.S. government’s publicly stated policy positions and how they influence the development of export controls that affect the EDA industry. The seminar will focus on the Export Administration Regulations (EAR) with a discussion on the International Trafficking in Arms Regulations (ITAR) and Office of Foreign Assets Control (OFAC) sanctions programs. It will look at what is and is not regulated under the EAR, what constitutes an “export,” “reexport,” “release,” and “transfer,” how tangible and intangible exports differ, and scenarios where exports can take place inadvertently.

General prohibitions and regulated activities under the EAR will be covered, as will the facts that compliance officers should almost always know or be willing to ask about any given transaction. Business implications that they must consider when evaluating transactions and instituting effective processes to backstop compliance will be addressed.

Another topic will be how business operations can and should be structured to accommodate export compliance regulations, including questions that compliance personnel should frequently ask themselves and business partners. The seminar will examine where export compliance considerations affect sales, operations, customer support and product development and evaluate effective compliance programs as described by the Bureau of Industry and Security (BIS) compliance guidelines. It will offer ways to identify red flags in customer behavior and how companies can protect themselves by asking the right questions, having the right contract clauses, and by asking for the right assurances from uncertain customers. The implications of falling short in compliance and how to investigate potential escapes internally and with outside counsel will be discussed.

The breakfast meeting will be held Tuesday, March 28, from 8:30am until 11:30am at Cadence’s corporate headquarters in San Jose.

Member tickets are $100 each and $125 per non-member. Member pricing is offered for individuals or companies that are active SEMI members. Visit the ESD Alliance website to register: www.esd-alliance.org

Please contact me if your company is interested in learning about the The ESD Alliance, a SEMI Technology Community, and the range of our programs. We represent members in the electronic system and semiconductor design ecosystem that addresses technical, marketing, economic and legislative issues affecting the entire industry. We act as the central voice to communicate and promote the value of the semiconductor design ecosystem as a vital component of the global electronics industry. I can be reached at bsmith@semi.org.

The Electronic System Design Alliance (ESD Alliance), a SEMI Technology Community, an international association of companies providing goods and services throughout the semiconductor design ecosystem, is a forum to address technical, marketing, economic and legislative issues affecting the entire industry. It acts as the central voice to communicate and promote the value of the semiconductor design industry as a vital component of the global electronics industry.

Follow SEMI ESD Alliance:

www.esd-alliance.org

ESD Alliance Bridging the Frontier blog

Twitter: @ESDAlliance

LinkedIn

Facebook

Also Read:

2022 Phil Kaufman Award Ceremony and Banquet Honoring Dr. Giovanni De Micheli

ESDA Reports Double-Digit Q3 2021 YOY Growth and EDA Finally Gets the Respect it Deserves

ESD Alliance Reports Double-Digit Growth – The Hits Just Keep Coming


AAA Hypes Self-Driving Car Fears

AAA Hypes Self-Driving Car Fears
by Roger C. Lanctot on 03-09-2023 at 10:00 am

AAA Hypes Self Driving Car Fears

The AAA (U.S. auto club) must have AGHD (attention-getting deficit disorder). The headline from the organization’s latest research is: “Fear of Self-Driving Cars is on the Rise.” That should straighten things out, right?

The survey was conducted across a representative sample of U.S. households, according to the reported methodology. This was an effective way to enhance the “fear factor” by not focusing the survey specifically on the drivers themselves or whether they own vehicles or intend to buy a vehicle soon. The “fear” that the AAA is targeting is presumably some sort of widespread anxiety among the general population.

Fear is an effective emotion to grab the attention of the press and consumers. There isn’t much else that turns our heads these days and all forms of media – broadcast, print, online – routinely turn to fear to increase ratings, viewers, and stimulate reactions: likes, shares, thumbs up.

Fear also has a deadening or muffling effect.  It is overwhelming and obscuring. Fear means different things to different people and fear isn’t always the right word for how people really feel.

I just returned from the Mobile World Congress in Barcelona. Was I afraid of being pickpocketed or otherwise becoming a victim of some petty crime? Not really afraid. No. “Concerned” would be a better term or “aware” of my surroundings – as I would be in any urban environment. Those feelings would be slightly elevated due to Barcelona’s reputation. Definitely not “afraid.”

AAA could have performed a much more useful public service if it had surveyed specific consumer groups – which, of course, would have been more complicated and expensive. Car makers and the general public would probably appreciate an education as to how current car owners feel about self-driving cars or how likely car buying intenders feel.

Saying consumers, in general, are afraid of self-driving cars reminds me of an old Monty Python’s Flying Circus animation of “killer cars” terrorizing London. The cars in the animation eat pedestrians – it seemed funny at the time.

By hyping the fear factor, though, AAA is missing the bigger picture. A more accurate description of the state of consumer attitudes toward cars generally and self-driving in particular would reflect intense curiosity, some concern, and, in some circles, broad-based enthusiasm.

TechInsights has conducted this research and found skepticism on the decline and interest in self-driving technology on the rise. The latest study, conducted in 2022 by TechInsights, found that at least 20% of car owners in China would pay more for any automated driving or parking feature, a figure that falls to 10% in Western Europe and 15% in the U.S.

A third of customers in China would pay more for fully autonomous driving capabilities. This figure does not exceed 18% in the West. This is certainly not what one would describe as fear.

Chinese consumers are most interested in automated parking. U.S. and Western European customers are more likely to pay more for parking assistance, followed by fully autonomous driving.

Missing this more nuanced part of the story is the greatest failing of the AAA study. In fact, enthusiasm for and interest in self-driving is a major contributor to Tesla’s ability to not only sell hundreds of thousands of cars equipped with its Autopilot technology (which still requires driver engagement), but has also fueled widespread consumer willingness to pay as much as $15,000 for so-called Full Self-Driving technology which is being blamed and investigated for crashes and mocked for its limitations.

In essence, AAA appears to be trying to stimulate fear rather than simply measure it. That, ultimately, is the weakness of the AAA study. Instead of lighting a candle, AAA has turned on the high beams blinding news consumers to the realities of the evolving driver assistance landscape. Consumers are actually quite interested in cars enhanced with driver assistance technology. Don’t let the fear mongers at AAA fool you.

Also Read:

IoT in Distress at MWC 2023

Modern Automotive Electronics System Design Challenges and Solutions

Maintaining Vehicles of the Future Using Deep Data Analytics

ASIL B Certification on an Industry-Class Root of Trust IP


Deep thinking on compute-in-memory in AI inference

Deep thinking on compute-in-memory in AI inference
by Don Dingee on 03-09-2023 at 6:00 am

Compute-in-memory for AI inference uses an analog matrix to instantaneously multiply an incoming data word

Neural network models are advancing rapidly and becoming more complex. Application developers using these new models need faster AI inference but typically can’t afford more power, space, or cooling. Researchers have put forth various strategies in efforts to wring out more performance from AI inference architectures, most notably placing parallelized execution and memory resources closer together. Compute-in-memory takes memory locality to the extreme by combining memory accesses and execution in a single operation – however, it comes with the risks and complexity of analog design and fabrication. Expedera’s new white paper examines compute-in-memory in AI inference and explores the tradeoffs versus its all-digital compute-near-memory solution optimizing all neural network processing elements.

Motivation for compute-in-memory technology

What’s the most expensive operation for a CPU? Moving data. It keeps operations from running and forms nasty bottlenecks if too much flows simultaneously. Manycore solutions like GPUs help multiplier performance but don’t eliminate the data movement penalties. Neither CPUs nor GPUs are efficient for AI inference, opening a door for the NPU (neural processing unit), seeking balance for efficient parallel execution with effective data access.

Instead of placing execution cores over here, a pile of memory over there, and a bus in between, one path to faster data access is locality – for which three basic strategies exist:

  • Compute-near-memory spreads smaller blocks of memory nearby blocks of execution units but not necessarily partitioned in a one-to-one relationship.
  • Compute-at-memory tightly couples an execution unit with its supporting memory, forming a co-processor unit that can be stepped and replicated for scale.
  • Compute-in-memory changes the memory structure by physically embedding a multiplier unit in a customized unit, so a memory read fetches a pre-multiplied result.

Saying compute-in-memory “changes the structure” may be understated. Both compute-near-memory and compute-at-memory use a conventional digital multiplier. Compute-in-memory relies on an incoming digital word triggering analog technology for multiplication and current summing. An analog-to-digital converter is required to get the result back to digital.

 

 

 

 

 

 

Conceptually, the neural network weight coefficients stay put in a compute-in-memory scheme and are always at hand for rapid multiplication with the data of interest. Cutting out data movements and streamlining operations trims AI inference time, and fewer transistors in action can mean lower power consumption. Those are good outcomes, right?

Throughput is only one factor for compute-in-memory in AI inference

Unsurprisingly, some tradeoffs arise. Some issues are inherent to the compute-in-memory structure, and some relate to the overall design and fabrication cycle.

A significant advantage of digital circuitry is noise immunity. Moving operations back into analog reintroduces noise factors and analog variation that limit realizable bit precision; researchers seem confident in 4-bit implementations. Lower AI inference precision raises demands on AI training, with more cycles required for reliable inference. Achieving higher precision, say int8, also introduces problems. The analog transistor array becomes much more complex, as does the analog-to-digital converter needed. Area and power consumption savings are offset, and the chance for bit errors rises as analog step width shrinks.

It’s certainly possible to size, interconnect, and optimize a compute-in-memory array for a particular inference model or group of similar complexity models. But, as we’ve discussed elsewhere, that removes flexibility, and the risk grows if a different model is needed. Other hits on flexibility are locking into specific, transistor-level modifications in a selected memory technology, and placement and routing constraints may appear at the system-on-chip level.

Moving into analog also means moving into a mixed-signal foundry process. Advanced nodes may be entirely off the table for some time. It also means analog expertise and tools are required, and it is difficult to scale layouts with analog circuitry.

Achieving better performance gains in all-digital NPU IP

Substantial engineering effort is often necessary to wring the last percentage points of performance out of an application unless conditions align perfectly where a bit more is possible. Use cases for compute-in-memory in AI inference will probably look like this:

  • Models with large numbers of weight coefficients, fully connected layers, and sparse activations
  • Int4 precisions where extended AI training is feasible and analog noise immunity is better
  • Scale compatible with mature mixed-signal process technology and wafer costs

Still, we’re not talking about something like a 4x performance gain with compute-in-memory compared to all-digital NPU IP. Remember that compute-in-memory is only one small part of a complete NPU solution, and performance improvements are available at other points around the core multiply operation. But are they worth the risks and costs of the analog environment?

Expedera’s value proposition is straightforward. Using high-efficiency all-digital NPU IP, combining scalable hardware with packet-based sequencing directed by compiler software, Expedera teams working with OEMs deliver better performance gains in a complete AI inference solution. As Expedera’s IP evolves and process technology advances, it gets faster, smaller, and more power efficient, and it can be customized for OEM requirements.

Compute-in-memory and Expedera NPU IP are independent of each other – both will exist in the market once compute-in-memory gains more adoption. There’s more to read about both approaches in the full Expedera white paper; simple registration gets a download.

Architectural Considerations for Compute-in-Memory in AI Inference

Also Read:

Deep thinking on compute-in-memory in AI inference

Area-optimized AI inference for cost-sensitive applications

Ultra-efficient heterogeneous SoCs for Level 5 self-driving

CEO Interview: Da Chuang of Expedera


DSP Innovation Promises to Boost Virtual RAN Efficiency

DSP Innovation Promises to Boost Virtual RAN Efficiency
by Bernard Murphy on 03-08-2023 at 6:00 am

DSP multi threading min

5G is already real, though some of us are wondering why our phone connections aren’t faster. That perspective misses the real intent of 5G – to extend high throughput (and low latency) communication to a vast number and variety of edge devices beyond our phones. One notable application is Fixed Wireless Access (FWA), promising to replace fiber with wireless broadband for last mile connectivity. Consumers are already cutting their (phone) landlines; with FWA they may also be able to cut their cable connections. Businesses can take this further, installing FWA base stations around factories, offices, hospitals, etc., to support many more smart devices in the enterprise.

An essential business requirement to enable this level of scale-out is much more cost-effective wireless network infrastructure. Open Radio Access Network (Open RAN) and virtualized RAN (vRAN) are two complementary efforts to support this objective. Open RAN standardizes interfaces across the network, encouraging competition between network component suppliers. vRAN improves throughput within a component by more efficiently exploiting a fixed hardware resource for multiple independent channels. We know how to do this with standard multi-core processor platforms, through dispatching tasks to separate cores or through multi-threading. Important functions in the RAN now run on DSPs, which also support multi-core but not multi-threading. Is DSP innovation possible to overcome this drawback?

What’s the solution?

Existing RAN infrastructure components – specifically processors used in baseband and backhaul– support virtualization / multithreading and are well established for 4G and early 5G. Surely network operators should stick to tried-and-true solutions for Open RAN and vRAN?

Unfortunately, existing components are not going to work as well for the scale-out we need for full 5G. They are expensive, power hungry (hurting operating cost), competition in components is very limited, and these devices are not optimized for the signal processing aspects of the RAN. Operators and equipment makers have enthusiastically switched to DSP-based ASICs to overcome those issues, especially as they get closer to the radio interface and user equipment, where the RAN must offer massive MIMO support.

A better solution would be to continue to leverage the demonstrated advantages of DSP-based platforms, where appropriate, while innovating to manage increasing high-volume traffic more efficiently in a fixed DSP footprint.

Higher throughput, less DSPs

Multi-core DSP system are already available. But any one of those DSP cores is handling just one channel at a time. A more efficient solution would also allow for multi-threading within a core. Commonly, it is possible to split a core to handle two or more channels at one time, but this fixed threading is a static assignment. What limits more flexibility is the vector compute unit (VCU) in each DSP core. VCUs are major differentiators between DSPs and general-purpose CPUs, handling all the signal-intensive computation – beamforming, FFT, channel aggregation and much more – in the RAN processing path between infrastructure and edge devices. VCUs consume significant footprint in DSP cores, an important consideration in multi-core systems during times when software is executing scalar operations and the VCU must idle.

Utilization can be improved significantly through the dynamic vector threading architecture illustrated in the figure above. Within one DSP core, two scalar processors support 2 channels in parallel; this does not add significantly to the footprint. The VCU is common to both processors and provides vector compute functions and a vector register file for each channel. So far this looks like the static split solution described earlier. However, when only one channel needs vector computation at a given time, that calculation can span across both compute units and register files, doubling throughput for that channel. This is dynamic vector threading, allowing two channels to use vector resource in parallel when needed, or allowing one channel to process a double-wide vector with higher effective throughput when vector need on the other channel is inactive. Naturally the solution can be extended to more than two threads with obvious hardware extensions.

Bottom line, such a system can both process with multiple cores and dynamically multi-thread vector compute within each core. At absolute peak load the system will still deliver effective throughput. During more common sub-peak loads it will deliver higher throughput for a fixed number of cores than a traditional multi-core system. Network operators, businesses and consumers will be able to get more out of installed hardware for longer, before needing to upgrade.

Talk to CEVA

CEVA have been working for many years with the big names in infrastructure hardware, consumer and business edge products. They tell me they have been actively guided by those customers towards this vector multi-threading capability, suggesting dynamic vector threading is likely to debut in products within the next few years. You can learn more about CEVA’s XC-20 family architecture offering dynamic vector threading HERE.


Multi-Die Systems Key to Next Wave of Systems Innovations

Multi-Die Systems Key to Next Wave of Systems Innovations
by Kalar Rajendiran on 03-07-2023 at 10:00 am

Shift to Multi Die Systems is Happening Now

These days, the term chiplets is referenced everywhere you look, in anything you read and in whatever you hear. Rightly so because the chiplets or die integration wave is taking off. Generally speaking, the tipping point that kicked off the move happened around the 16nm process technology when large monolithic SoCs started facing yield issues. This obviously translated to an economic issue and highlighted the fact that Moore’s Law benefit which had stood the test of time for more than five decades had started to flatten. While this is certainly true, there are lot more benefits for moving to chiplets-based designs. This “More than Moore” aspect is what will drive the chiplets adoption even faster.

I recently had a conversation on this topic, with Shekhar Kapoor, senior product line director at Synopsys. Shekhar shared Synopsys’ view on what is behind the move to multi-die systems , the challenges to overcome and the solutions that are needed to successfully support this wave.

More Than Just The Yield Benefit

Multi-die systems can accelerate the scaling of system functionality and performance. They can help lower system power consumption while increasing throughput. By allowing re-use of proven designs/dies as part of  a system implementation, they help reduce product risk and time-to-market. And, they help create new product variants rapidly and enable strategic development and management of a company’s product portfolio.

“SysMoore Era” Calls for Multi-Die Systems

Until recent years, Moore’s Law benefits delivered at the chip level translated well to satisfy the performance demands of systems. But as Moore’s Law benefits started to slow down, system performance demands have started to grow in leaps and bounds. Systems have been hitting the processing, memory and connectivity walls. Synopsys has coined the term “SysMoore Era” to refer to the future.

Take for example, the tremendous growth in artificial intelligence (AI) driven systems and advances in deep learning neural network models. The compute demand on systems have been growing at incredible rates every couple of years. As an extreme example, OpenAI’s ChatGPT application is powered by a Generative Pre-trained Transformer (GPT) model with 175 billion parameters. That is the current version (GPT3) and the next version (GPT4) is supposed to handle 100 trillion parameters. Just imagine the compute demand of such a system.

On average, the Transformer models have been growing in complexity by 750x over a two-year period and systems are hitting the processing wall. Domain Specific Architectures are being adopted to close the gap on performance demand. Multi-die systems are becoming essential to address the system demands of the “SysMoore Era.”

Multi-Die System Challenges

Heterogeneous die integration introduces a number of challenges. Die-to-die connectivity is at the heart of multi-die systems as different components need to properly communicate with each other. System pathfinding is a closely related challenge that involves determining the best data path between components in the system. Multi-die systems must be designed to also ensure that each component is supplied with adequate power and cooling, while also minimizing system level power consumption. Memory utilization and coherency are also important challenges as these systems must be designed to ensure efficient memory utilization and coherency across the different components. Software development and validation at the system level is yet another challenge as each component may have its own software stack. Design implementation has to be done for efficient die/package co-design with system signoff as the goal. And all this should be accomplished with cost-effective manufacturability and long-term reliability in mind.

Multi-Die System Solutions

Just as Design Technology Co-Optimization (DTCO) is very important in a monolithic SoC scenario, System Technology Co-Optimization (STCO) is imperative in a multi-die system scenario. To address multi-die system challenges, the solutions can be broadly categorized into the following areas of focus.

Architecture Exploration

A system level tool that allows early exploration and system partitioning for optimizing thermal, power and performance is imperative. Just as a chip-level platform architect tool was critical for a monolithic SoC scenario, so is a system-level platform architect tool for a multi-die system, if not more critical.

Software Development and Validation

High-capacity emulation and prototyping solutions are essential to support rapid software development and validation for the various components of a multi-die system.

Design Implementation

Access to robust and secure die-to-die IP and a unified exploration-to-signoff platform are key to an effective and efficient die/package co-design of the various components of a multi-die system.

Manufacturing & Reliability

Multi-die system hierarchical test, diagnostics and repair and holistic test capabilities are essential for manufacturability and long-term system reliability. Environmental, structural and functional monitoring are needed to enhance the operational metrics of a multi-die system. The solution comprises silicon IP, EDA software and analytics insights for the In-Design, In-Ramp, In-Production and In-Field phases of the product lifecycle.

Summary

As a leader in “Silicon to Software” solutions to enable system innovations, Synopsys offers a complete solution to design, manufacture and deploy multi-die systems. For solution-specific details, refer to their multi-die system page.

Also Read:

PCIe 6.0: Challenges of Achieving 64GT/s with PAM4 in Lossy, HVM Channels

Synopsys Design Space Optimization Hits a Milestone

Webinar: Achieving Consistent RTL Power Accuracy