RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

Flex Logix InferX X1 Optimizes Edge Inference at Linley Processor Conference

Flex Logix InferX X1 Optimizes Edge Inference at Linley Processor Conference
by Camille Kokozaki on 04-18-2019 at 12:00 pm

Dr. Cheng Wang, Co-Founder and SVP Engineering at Flex Logix, presented the second talk in the ‘AI at the Edge’ session, at the just concluded Linley Spring Processor Conference, highlighting the InferX X1 Inference Co-Processor’s high throughout, low cost, and low power. He opened by pointing out that existing inference solutions are not optimized for edge requirements though high-end server solutions exist. Processing images one at a time, with fixed power budgets, using larger images, larger models, with higher prediction accuracy is needed. High end solutions are not optimized for edge inference. Since cameras see one image at a time, at the edge, batching is not practical. Even high-end devices perform less well at low batch sizes.

Flex Logix started off having embedded FPGA and interconnect programmable technology and is now using it as a foundation for their technology stack. The nnMax technology utilizes embedded FPGA that is integrated into SoCs and with density and performance like leading FPGAs, with XFLX [SUP]TM[/SUP], ArrayLINX [SUP]TM[/SUP], RAMLINX [SUP]TM
[/SUP]
Flex Logix Technology Stack consists of:

  • Hardware

    • InferX [SUP]TM[/SUP] PCIe Cards
    • InferX Edge Inference co-processor ICs
    • nnMAX [SUP]TM[/SUP] Inference IP
    • eFPGA/ Interconnect Technology
  • Software

    • TensorFlow Lite, ONNX
    • Software driver
    • InferX/nnMAX Inference Compiler
    • eFPGA place and route back-end

Inferencing customers needed large EFLX DSP MACs so a 1K nnMax tile was developed. A detailed look at the 1K configurable MAC Inference Tile shows the following architecture and features.


Winograd acceleration [SUP]1[/SUP] for INT8 provides 2.25x performance gain for applicable layers and is invoked automatically by nnMax compiler. The Tile is also programmed by TensorFlow Lite/ONNX with multiple models running simultaneously. The 1K tiles can be configured in any array size with configurable L2 SRAMs supporting 1-4 MB per tile and with a variable DRAM bandwidth through reconfigurable I/Os typically connecting x32 or x64 LPDDR4. The key advantage here is the ability to reconfigure ports and controls for the data path for each layer, and once configured, can run with ASIC-like performance, with routing to the memory and to interconnect, and with localized data access and compute.

Winograd acceleration speeds up 3×3 convolution with a stride of 1 by a factor of x2.25. Though the resulting architecture creates 2x larger weights and a more complex data path, nnMax performs the transformations on the fly, removing the weight penalties. Winograd is essentially free to the user from the performance perspective because they do not get any penalty in DRAM power or precision, but not free in hardware with only some additional bits in the multiplier.


Some layers have large intermediate frame sizes that may not fit in on-chip SRAM (e.g. YOLOv3 layer 0 outputs 64MB), resulting in DRAM writes and re-reads, putting a strain on DRAM bandwidth and potentially causing pipeline stalls when those layers are processed. To address this, multiple layers are run in parallel. In the YOLOv3 situation, Layer 0 and 1 are run simultaneously, avoiding the 64MB need to store with Layer 0 streaming directly into nnMax clusters processing Layer 1.

InferX X1 applications include edge devices such as surveillance cameras, robots, set-top boxes, edge servers like edge gateways and low-end edge servers.

The InferX X1 Edge Inference co-processor which runs at 1.067GHz on TSMC16FFC is scheduled for Q3 2019 tape-out with 8.5 TOPs, with 4K MACs, 8MB SRAM, x32 LPDDR4 DRAM, x4 PCIe Gen 3/4 lanes. Total dynamic worse-case power for YOLOv3, the most demanding, on PCIe Card, and including DRAM and regulators is 9.6W. InferX X1 silicon and PCIe cards will sample by the end of 2019. The typical power is 2.2 Watts on ResNet-50, and varies by model.

InferX X1 throughput is 3 to 11 times existing edge inference ICs and can be chained for higher inference throughput. Performance gain is greater on large models such as YOLOv2, v3. Furthermore, its throughput/Watt is 4 to 26 times better allowing edge devices to stay within their power budget.
The nnMAX compiler front-end flow performs the Neural Network model to soft-logic translation. The back-end flow performs place-and-route, retiming, pipelining and binary generation. The compiler first estimates performance, then accepts the X1 floorplan and TensorFlow Lite (soon ONNX) and automatically partitions the model across multi-layer configurations and computes performance, latency, MAC utilization, DRAM bandwidth per layer and per model.

During the panel discussion Dr. Wang was asked about complexity perception, he stressed that the architecture is simple, and the verification is not any more difficult than verifying an FPGA design. When asked about the memory requirements, he stated that since there is no way to store all parameters into SRAM in edge devices, you can try to train the model to be sparse or partition it over multiple devices. Certain aspects can also be serialized across multiple devices. When asked about the time it takes to reconfigure the fabric, the answer was about 1 microsecond per layer, allowing for a video processing 30 frames per second, with a model having 100 configurations, cycling through 3,000 configurations, 300 microseconds per layer and thus the user will not experience a drop in performance with, as designed, an acceptable hardware impact and complexity. When asked how one addresses models other than CNN, he said that embedded FPGA runs a lookup table from anything to anything. Most functions are not throughput intensive enough and FPGAs handle these beautifully. The activation function is all in the lookup table. Most functions have matrix multiplication or data movement and FPGAs are optimized for that. How you deliver enough bandwidth at the edge, when GDDR or HBM are not in the cards, was the reason why the architecture was designed the way it was so that not much DRAM is required.

Edge application is all about efficiency, and how much throughput one can get for a certain amount of power and certain amount of cost. The goal is to extract as much performance as possible and their solution is as close to a data center in performance while still in edge space. FPGAs typically have a problem with capacity because they try to map everything at once. Flex Logix has multiple configurations addressing the capacity issue where a certain amount of resources is required to map a model, with the compiler making decisions on how to multiplex data and what degree of parallelism to use based on how much resources are available and how much the model requires.

Geoffrey Tate, Flex Logix CEO, emphasized the ability of the reuse of their FPGA technology to deliver very high throughput inference capability for the more challenging models that the customers want to run at low power and low cost. The chip customer programs ONNX or TensorFlow Lite models and Flex Logix software takes care of the FPGA internals. The interconnect technology can reconfigurably program the non-blocking paths from memory on-chip through the hardware units like the MACs and back to memory giving much more efficient computation than other processor architectures.
___
1 Distinct from the conventional convolution algorithm, Winograd algorithm uses less computing resources but puts more pressure on the memory bandwidth. Flex Logix architecture mitigates that.


ML and Memories: A Complex Relationship

ML and Memories: A Complex Relationship
by Bernard Murphy on 04-18-2019 at 7:00 am

No, I’m not going to talk about in-memory-compute architectures. There’s interesting work being done there but here I’m going to talk here about mainstream architectures for memory support in Machine Learning (ML) designs. These are still based on conventional memory components/IP such as cache, register files, SRAM and various flavors of off-chip memory, including not yet “conventional” high-bandwidth memory (HBM). However, the way these memories are organized, connected and located can vary quite significantly between ML applications.

At the simplest level, think of an accelerator in a general-purpose ML chip designed to power whatever edge applications a creative system designer might dream up (Movidius provides one example). The accelerator itself may be an off-the-shelf IP, perhaps FPGA or DSP-based. Power may or may not be an important consideration, latency typically is not so important. The accelerator is embedded in a larger SoC controlled by maybe an MCU or MCU cluster along with other functions, perhaps the usual peripheral interfaces and certainly a communications IP. To reduce off-chip memory accesses (for power and performance), the design provides on-chip cache. Accesses to that cache can come from both the MCU/MCU cluster and from the accelerator, so these must be coherently managed.

Now crank this up a notch, to ADAS applications, where Mobileye is well-known. This is still an edge application, but performance is much more demanding from latency, bandwidth and power consumption standpoints. Complexity is also higher; you need to support multiple accelerator types to support different types of sensor and sensor fusion for example. For scalability in product design, you cluster accelerators in groups, very likely with local scratchpad memory and/or cache; this enables you to release a range of products with varying numbers of these groups. As you increase the numbers and types of accelerators, it makes sense to cluster them together using multiple proxy cache connections to the system interconnect, one for each accelerator group. In support of your product strategy, it should then be easy to scale this number by device variant.

Arteris IP supports both of these use-cases through their Ncore cache-coherent NoC interconnect. Since this must maintain coherence across the NoC, it comes with its own directory/snoop filters. The product also provides proxy caches to interface between the coherent domain and non-coherent domains, and you can have multiple such caches to create customized clusters of IP blocks that use non-coherent protocols like AXI, but can now communicate as a cluster of equals in the cache coherent domain. Arteris IP also provides multiple types of last-level cache including the Ncore Coherent Memory Cache, which is also tied into coherency management to provide a final level of caching before needing to go to main memory. For non-coherent communications, Arteris IP also provides a standalone last-level cache integrating through an AXI interface (CodaCache).

These ML edge solutions are already proven in the field: Movidius and Mobileye are two pretty compelling examples (the company will happily share a longer list).

Moving now to datacenter accelerators, memory architectures look quite different based on what’s happening in China. I’ve talked before about Baidu and their leading-edge work in this area, so here I’ll introduce a new company: Enflame (Suiyuan) Technology, building high-performance but low-cost chips for major machine-learning frameworks. Enflame is a Tencent-backed startup based in Shanghai with $50M in pre-series A funding, so they’re a serious player in this fast-moving space. And they’re going after the same objective as Cambricon, and Baidu with their Kunlun chip – the ultimate in ML performance in the datacenter.

I’ve also talked before about how design teams are architecting for this objective – generally a mesh of accelerators to achieve massive parallelism in 2-D image processing. The mesh may be folded over into a ring or folded twice into a torus to implement RNNs, to support processing temporal sequences. The implementation is often tiled, with say 4 processors per tile and local memory and tiles are abutted to build up larger systems, simplifying some aspects of place and route in the back-end.

Designs like this quickly get very big and they need immediate access to a lot of off-chip working memory, without the latency that can come with mediation through cache coherency management. There are a couple of options here: HBM2 at high-bandwidth but at high cost, versus GDDR6 at lower cost but also lower bandwidth (off-chip memory on the edge is generally LPDDR). Kurt Shuler (VP Marketing at Arteris IP) tells me that GDDR6 is popular in China for cost reasons.

Another wrinkle in these mesh/tiled-mesh designs is that memory controllers are placed around the periphery of the mesh to minimize latency between cores in the mesh and controllers. Traffic through those controllers must then be managed through to channels on the main memory interface, (e.g. HBM2). That calls for a lot of interleaving, reordering, traffic aggregation and data-width adjustments between the memory interface and the controllers, while preserving the benefits of high throughput from these memory standards. The Arteris IP AI-package provides the IP and necessary interfacing to manage this need. On customers, they can already boast Baidu, Cambricon and Enflame at minimum; two of these (that I know of) have already made it through to deployment.

Clearly there is more than one way to architect memory and NoC interconnect for ML applications. Kurt tells me that they have been working with ML customers for years, refining these solutions. Since they’re now clearly king of the hill in commercial NoC solutions, I’m guessing they have a bright future.


TechInsights Gives Memory Update at IEDM18 DRAM and Emerging Memories

TechInsights Gives Memory Update at IEDM18 DRAM and Emerging Memories
by BHD on 04-17-2019 at 12:00 pm

On the Sunday evening at IEDM last year, TechInsights held a reception in which Arabinda Das and Jeongdong Choe gave presentations that attracted a roomful of conference attendees.

This is the second part of the review of Jeongdong’s talk, we covered NAND flash technology in the last post. Jeongdong is a Senior Technical Fellow at TechInsights, and their subject-matter expert for memory technology. Before joining the company, he worked as a Team Lead in R&D for SK Hynix and Samsung advancing next-generation memory devices, so he knows whereof he speaks.
Continue reading “TechInsights Gives Memory Update at IEDM18 DRAM and Emerging Memories”


Hogan Fireside Chat with Paul Cunningham at ESDA

Hogan Fireside Chat with Paul Cunningham at ESDA
by Bernard Murphy on 04-17-2019 at 7:00 am

If you’re in verification and you don’t know who Paul Cunningham is, this is a guy you need to have on your radar. Paul has risen through the Cadence ranks fast, first in synthesis and now running the verification group, responsible for about a third of Cadence revenue and a hefty percentage of verification tooling in the semiconductor industry. Since he was honored as one of the outstanding innovators under 40 at DAC 2017, you should realize he really is on the fast track and is likely to significantly influence how you will be verifying in the future. The ESD Alliance hosted an event recently at which Jim Hogan interviewed Paul, to help us learn more about this rising star and his entrepreneurial journey.

Paul is a fellow Brit/ex-Brit; there are a lot of us around (at least 5 at the ESDA meeting). He took his first degree (CS) at Cambridge, also rowed for the university, then stayed at Cambridge to get his Ph.D. in formal verification of asynchronous circuits. He was quite open about his journey of discovery in async circuits, saying he originally drank the Kool-Aid, believed this design style would conquer the world and decided he wanted to start a company to build compilers for self-timed chips.

Together with a co-founder, they started Azuro in Cambridge, raising ~$100k. Talking to prospects, they got a quick reality check between what is academically interesting and what can make serious money. They found that prospects weren’t interested in self-timed circuits but were very interested in better clock gating and useful skew. Paul/Azuro reworked their PowerPoint to reflect this reality and started doing deals with well-known companies. That woke up the big VCs; ultimately Benchmark Capital, who have a branch in London, put in $4M. Benchmark required, unsurprisingly, that Azuro move their HQ to the Bay Area (though there’s still an R&D operation in Cambridge, now driving clocks for Cadence).

Jim asked Paul what he learned from being a CEO. Pay attention here, would-be CEOs. He said that intense customer focus and agility to meet customer needs are primary. At the same time there’s a need for balance and a broad set of skills. No-one, not even a CEO has everything it takes, so it’s important build a strong team, to fill gaps in expertise and ensure priorities are balanced. One of the gaps was marketing. In the early stages some wins were self-marketing; new prospects called them. Azuro got to escape velocity but generally you can’t assume technology alone will get you there. If he was going to do it over again, he’d be a lot more vocal, even shameless, not try to over-optimize the pitch, pump up the volume and ensure that everyone knew the name. Jim added that now social media has to be a part of the strategy.

Charlie Huang, back then running strategy in Cadence, called Paul in 2010-2011. At that time, smartphones were really taking off and the ARM A9 had caught the wave. ARM were using aggressive clock gating and useful skew, giving them a 10% advantage in PPA. That’s massive in this business; Charlie (whose background was in timing) wanted it to be exclusive to Cadence. Paul had no ambitions to take Azuro public and Charlie saw the opportunity to have a powerful differentiator and grow market share. They just had to do the deal.

Again, for would-be CEOs, if you’re lucky enough to get there, this is one of the most painful stages in a startup; Paul said the due-diligence process was brutal. For several weeks they were gathering/assembling legal and financial docs (NDAs, patents, patent searches, customer contracts, audits, …), a very stressful, sleep-deprived time when the technologists are in a holding pattern while the lawyers and accountants do their thing. Even after that part is done, the transition from a small, tightly-knit startup group to being one group among many in a large enterprise, this also is traumatic. But Paul never regretted it, or the immense leverage it has enabled for the technology and for him personally.

At Cadence, Paul applied the Azuro technology to clock tree synthesis, then quickly took on a broader portfolio managing the digital back-end products. Logic synthesis, these days tightly coupled to implementation, is a solid pier in Cadence’s pretty clearly dominant implementation solution.

Not bad, but Paul wanted more. He saw one of the IBS charts at a kickoff event, the chart that shows growing investment in various phases of design. What stands out for everyone is that system and software verification dominate everything else. In his view if he wanted to make a real dent that was where he had to focus. Anirudh asked him about 15 months ago to run verification; Paul said this wasn’t a hard decision.

He believes the opportunity is boundless if Cadence can deliver new and compelling approaches. This starts with what I find to be a differentiated top-line goal – throughput. By this he means bugs found per dollar per day. He’s very single-minded about this goal; objective by objective, he asks does this move the throughput needle or not? I consider this goal to be an important new direction. When I look at verification pitches over the last 10+ years, it can be difficult to isolate a unifying metric or philosophy other than run faster! ease of use! more features! Laudable goals of course, but how do specific advances affect customer success and profitability? Implementation flows and teams don’t have this problem – they’re always optimizing for PPA. There’s no confusion about the right metric. Verification needs the same singular objective. That’s what I see in this direction.

Of course execution has to be broken down into sub-goals. For Paul this starts with the underlying bare-metal verification hardware – today x86 (Intel/AMD) and ARM based servers for simulation, then emulation and FPGA prototyping. He sees hardware platforms as a variable; they will continue to evolve. Above the bare metal, he sees a heterogenous compute layer, a hybrid mix of platforms to optimize throughput versus accuracy and bug-finding visibility. On top of that, smart analysis – isolating bugs faster and more intelligently in the always exponentially huge state-space.

Jim asked about compliance, safety and security. Paul likes Simon Segars’ (ARM) view, that all of us in the ecosystem enabling and building these transformational products have a responsibility to ensure these solutions are safe and secure. Verification has a big part to play in this, but for Paul this must be guided by the Lip-Bu/Anirudh philosophy of having the right to win. If you don’t have proven domain expertise, you need to work with people who do, which is why he’s so excited about the partnership with Green Hills, a company with proven leadership in automotive and in high-level security solutions.

For me this was a wake-up discussion, the first time in quite a while that I’ve seen someone who’s going to re-engineer the verification tooling business and move it onto a new level. I’m looking forward to hearing more.


Samsung 5nm and TSMC 6nm Update

Samsung 5nm and TSMC 6nm Update
by Daniel Nenni on 04-16-2019 at 12:00 pm

TSMC and Samsung continue to raise the competitive bar for FinFET foundry market share with dueling announcements this week. As I mentioned previously in the blog Semiconductor Foundry Landscape Update 2019, FinFETs are the market to watch with the coming onslaught of 5G and AI chips on the edge, in the cloud, and in our autonomous cars.

Yesterday Samsung announced that 5nm EUV is ready to go with PDKs, EDA Tools, IP, and MPWs. Samsung already has 14nm, 11nm, 10nm, 8nm, 7nm EUV, and 6nm EUV production ready. Samsung’s 5nm FinFET process technology provides up to a 25% increase in logic area efficiency with 20% lower power consumption or 10% higher performance over their 7nm process.

“In successful completion of our 5nm development, we’ve proven our capabilities in EUV-based nodes,” said Charlie Bae, Executive Vice President of Foundry Business at Samsung Electronics. “In response to customers’ surging demand for advanced process technologies to differentiate their next-generation products, we continue our commitment to accelerating the volume production of EUV-based technologies.”

Samsung foundry’s EUV-based process technologies are currently being manufactured at the S3-line in Hwaseong, Korea. Additionally, Samsung will expand its EUV capacity to a new EUV line in Hwaseong, which is expected to be completed within the second half of 2019 and will start production ramp-up for next year.

Mr. Bae continued, “Considering the various benefits including PPA and IP, Samsung’s EUV-based advanced nodes are expected to be in high demand for new and innovative applications such as 5G, artificial intelligence (AI), high performance computing (HPC), and automotive. Leveraging our robust technology competitiveness including our leadership in EUV lithography, Samsung will continue to deliver the most advanced technologies and solutions to customers.”

Not to be outdone, TSMC today announced 6nm EUV which fills out their FinFET offering of 16nm, 12nm, 10nm, 7nm, 7nm EUV, 6nm EUV, and 5nm EUV. TSMC 6nm offers an 18% density advantage over 7nm.

TSMC announced 5nm ecosystem completion last week which offers a 1.8X logic density and 15% speed gain versus 7nm. TSMC’s 6nm process delivers 18% higher logic density over the 7nm process. At the same time, its design rules are fully compatible with TSMC’s proven 7nm technology.

“TSMC N6 technology will further extend our leadership in delivering product benefits with higher performance and cost advantage beyond the current N7,” said Dr. Kevin Zhang, TSMC Vice President of Business Development. “Building upon the broad success of our 7nm technology, we’re confident that our customers will be able to quickly extract even higher product value from the new offering by leveraging a well-established design eco-system today.”

This is great news, we now have a legitimate two horse race for our FinFET design starts. The question is where is all of the IP going to come from for these new nodes? There are thousands of silicon proven FinFET based IPs in the ecosystem that will need to be tuned and verified to each and every node. It certainly is a good time to be a Semiconductor IP or IP management software company, absolutely.

About TSMC
TSMC pioneered the pure-play foundry business model when it was founded in 1987, and has been the world’s largest dedicated semiconductor foundry ever since. The company supports a thriving ecosystem of global customers and partners with the industry’s leading process technology and portfolio of design enablement solutions to unleash innovation for the global semiconductor industry. TSMC serves its customers with annual capacity of about 12 million 12-inch equivalent wafers in 2019 from fabs in Taiwan, the United States, and China, and provides the broadest range of technologies from 0.5 micron plus all the way to foundry’s most advanced processes, which is 7-nanometer today. TSMC is the first foundry to provide 7-nanometer production capabilities, and is headquartered in Hsinchu, Taiwan. For more information about TSMC please visit http://www.tsmc.com.

About Samsung Electronics Co., Ltd.
Samsung inspires the world and shapes the future with transformative ideas and technologies. The company is redefining the worlds of TVs, smartphones, wearable devices, tablets, digital appliances, network systems, and memory, system LSI and foundry. For the latest news, please visit the Samsung Newsroom at http://news.samsung.com.


Using ML to Build Efficient Low Power Platforms for Augmented Vision

Using ML to Build Efficient Low Power Platforms for Augmented Vision
by Tom Simon on 04-16-2019 at 7:00 am

We are all pretty familiar with augmented reality, where real world images are overlaid with computer generated images, graphics and even audio. Of course, our first exposure to augmented reality might have been images of heads up displays in fighter jets or perhaps in the movie The Terminator. Augmented reality is moving rapidly towards mobile devices, handhelds and also automotive applications. Developing augmented reality systems on these new platforms requires crucial decisions about the image processing implementation.

For system level and embedded processing vision, we have several choices, which include GPUs, FPGAs, vision DSPs, or vision DSPs combined with neural networks. The considerations include cost, energy efficiency and performance.

At Embedded World 2019 Synopsys presented a paper on implementing augmented reality in low-power systems. The presenter was Gordon Cooper, product marketing manager for processor IP at Synopsys. One focus of his presentation was simultaneous localization and mapping(SLAM), which is the technique used to determine the actual 3-D location of objects relative to the camera. To do this in real-time a 3-D model of the environment must be produced and the location of the camera must be determined. Often a two-camera technique is used which relies on stereoscopic vision to determine distances. However, many new systems only have a single camera, so techniques are needed to determine distance using monocular SLAM.

Using a single camera requires more complicated algorithms. Depth cannot be directly inferred from a single image because it can be difficult to determine absolute scale. So, Synopsys asks the question as to whether neural networks can be used to improve depth maps for monocular SLAM. They found that this has been studied and can be an attractive approach, as shown in the paper titled Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, by David Eigen, Rob Fergus, 2015. This paper shows that with just a two-dimensional RGB image it was possible to output a depth map as well as determine the surface normals.

For real time processing, frame rates of above 25 frames per second are necessary. This means that just 30 to 40ms of total processing time is available to avoid significant latency. Coming back to the question of implementation, it is clear that GPUs can be useful for running neural networks, but they may consume too much power for automotive applications. CPUs may be able to perform SLAM but again there’s the question of performance, power and area. Synopsys’ solution is to combine these functions into a single embedded processor which contains a 32 bit scalar unit, a vector unit and a neural network engine.

Software is the other important element of the complete solution. The software must perform feature extraction and feature matching between frames to determine camera motion. Additionally, there must be support for a variety of different neural network types. Synopsys offers its OpenVX framework which includes C/C++, OpenCL C, OpenCV, OpenVX libraries, and CNN mapping tools. With this their customers can develop user applications that address their specific requirements.

Synopsys also supports a number of optimizations including feature map compression in hardware, which offers runtime compression and decompression, use of simplified Huffman encoding, and CNN DMA with a hardware compression mode. In addition, there is coefficient pruning, where coefficients with zero value are skipped/counted, leading to dramatic reductions in the number of required operations.

Synopsys believes that properly implemented vision processors that include neural networks can help improve SLAM accuracy in determining depth perception and scaling. At the same time these vision processors can help improve performance and lower power consumption, which will be needed in many of its applications. Although we do not necessarily see augmented reality used on a day to day basis yet, it will probably be one of those things that gathers momentum and will soon become something that we rely on for numerous daily activities, such as driving, learning new skills or appreciating the world around us. More information on Synopsys vision processing solutions for SLAM and CNN can be found on their website.


TechInsights Gives Memory Update at IEDM18 NAND Flash

TechInsights Gives Memory Update at IEDM18 NAND Flash
by BHD on 04-15-2019 at 12:00 pm

On the Sunday evening at IEDM last year, TechInsights held a reception in which Arabinda Das and Jeongdong Choe gave presentations that attracted a roomful of conference attendees. Arabinda was first up, giving a talk on the “10-year Journey of Apple’s iPhone and Innovations in Semiconductor Technology”, followed by Jeongdong discussing “Memory Process, Design and Architecture: Today and Tomorrow”.
Continue reading “TechInsights Gives Memory Update at IEDM18 NAND Flash”


From Wild West to Modern Life the Semiconductor Evolution

From Wild West to Modern Life the Semiconductor Evolution
by Daniel Nenni on 04-15-2019 at 7:00 am

What started as blogs, or vignettes as Wally calls them, posted on SemiWiki is now a free PDF eBook. The journey starts with his school days at Stanford through 20+ years at TI and 24+ years at Mentor Graphics. Wally has traveled millions of miles meeting with every customer imaginable while presenting hundreds of different keynotes on the semiconductor industry. If we ever came out with an AI smart speaker on all things semiconductor it would be called Wally.

The crowning achievement of Wally’s career (my opinion) and one of the most disruptive moves in EDA history is the acquisition of Mentor Graphics by Siemens in 2017 for $4.5B (representing a 21% stock premium). Acquisition rumors had been flying around the fabless semiconductor ecosystem but no one would have guessed it would be the largest industrial manufacturing company in Europe. At first the rumors were that Siemens would break-up and sell Mentor keeping only the groups that were part of the Siemens core business, specifically they would sell the Mentor IC Group. Those rumors were flatly denied at the following Design Automation Conference during a CEO roundtable with Wally. Now Mentor, including the IC group, is an integral part of the Siemens corporate strategy. Thank you again Dr Walden Rhines, EDA would not have been as fun without you, absolutely.


From LinkedIn:
Wally Rhines is widely recognized as an expert in business value creation and technology for the semiconductor and electronic design automation (EDA) industries. Dr. Rhines was CEO of Mentor Graphics (a “Big Three” EDA company with $1.3B+ revenue) for 24 years, has served on the boards of four public companies, managed the semiconductor business of Texas Instruments (TI), and is a spokesperson, writer and highly-sought-after speaker for the high-tech industry delivering more than twenty keynotes per year.

Dr. Rhines currently serves as CEO Emeritus of Mentor, a Siemens Business, consults for investors, corporations and the U.S. government on strategic directions, value creation and technology, serves on public and private boards, and supervises a $20M foundation. Business achievements include major turnarounds, both at Texas Instruments, through his creation and management of the digital signal processing business, and at Mentor, where he managed more than 3X growth in revenue and a 10X increase in enterprise value before acquisition by Siemens AG.

Dr. Rhines’ technical expertise includes semiconductor design, process engineering and manufacturing as well as financial modeling of trends and value creation. He has been deeply involved in global business development including projects in China and India.

As CEO and Director, he has managed businesses through difficulties including unfriendly takeover attempts, favorable outcomes for both the company and the activists, with three of the world’s leading activist investors, and volatile economic and business cycles. He continues to seek new opportunities to grow businesses, particularly through private equity, consulting and personal investing.

From the book:

From Wild West to Modern Life the Semiconductor Evolution

Foreword

In 1968, Texas Instruments, Motorola, and Fairchild dominated the emerging semiconductor business with 66% combined market share. Over the next fifty years, the industry de-consolidated – dozens of new semiconductor companies emerged, creating a more dynamic market that altered the list of the top ten largest companies.During the same period, an ecosystem of companies emerged to grow the materials, develop the manufacturing equipment, design the software, and create all the other capabilities needed to support what has become one of the most strategic industries in the world. Much of this evolution was driven by relatively young, inexperienced individuals operating in a totally unregulated,free market, worldwide business environment. I was privileged to work with many of these people and to be involved in some of the revolutionary innovations.Many people, including Daniel Nenni, have asked me to relate some of the stories of game-changing programs and people with whom I was involved,including the dynamics of growth of the Electronic Design Automation (EDA) industry. I’ve put this off for a long time, but Daniel is persistent. So I started writing some short vignettes during long airline flights. This activity required that I contact other people who were involved in this history, some of whom I hadn’t seen for decades, to verify the accuracy of my recollections. I hope this collection of essays provides some feeling for the remarkable history of the growth of an industry as well as insights into its future evolution.

Walden Rhines
March 2019


An old IP theft gets a new Chinese label

An old IP theft gets a new Chinese label
by Robert Maire on 04-14-2019 at 7:00 am

The Dutch financial newspaper Financieele Dagblad (FD) reported on the past theft of ASML technology after doing some investigative digging. It now appears that a number of Chinese nationals and ASML employees, in ASML’s Santa Clara office stole key technology back in 2015. Though ASML talked about it at the time, little was said, with no further information disseminated.

Brion technology was stolen
Brion technology, which ASML acquired back in 2006, makes software which optimizes the results of the mask and scanner working together to produce better and finer lithographic images. It is critical technology as it greatly improves the performance of scanners and allows sharper and finer images to be printed thus enabling Moore’s law.

The technology was passed on to XTAL, a Chinese company with clear Chinese government connections. ASML apparently figured it out back then when XTAL started stealing customers away from ASML (obviously with ASML’s own software…). XTAL was started by two former ASML/Brion employees, other co-conspirators also had worked at D2S, KLA, Hermes (bought by ASML), Mentor Graphics and Synopsis. Its hard to estimate how much money ASML lost but it was obviously significant.

USB thumb drives are spy tradecraft
As we saw in the Jinhua/UMC/Micron case, employees simply walk out the door of the victim company with all the trade secrets they can copy on a USB drive. Probably the only limit to the theft is the size of the drive. Its very hard to prevent this sort of theft and we are very sure it occurs in every semiconductor company of size, every day of the week.

It is also clear that China is helping, encouraging and probably directly acting to obtain any and all technical information in the semiconductor industry to reach their made in China 2025 goal, through any and all means.

US and others have been unable to stop the loss
So far we have not only been powerless to stop the covert loss through spying but we have also been unable to stop the overt loss through the required “technology sharing” enforced by the Chinese government. Key technology is exiting the back door in USB drives and out the front door in “technology sharing” arrangements.

ASML swept it under the rug to avoid riling China & regulators
ASML kept the theft quiet for years until the newspaper FD dug it up. ASML also publicly denied the clear case of Chinese spying by saying;

“The suggestion that we were somehow victim of a national conspiracy is wrong,” CEO Peter Wennink said ,”We resent any suggestion that this event should have any implication for ASML conducting business in China. Some of the individuals (involved) happened to be Chinese nationals,” he added.

The corporate thieves just “happened to be” Chinese, who “happened to be” working for a Chinese company (XTAL is a subsidiary of Dongfang Jingyuan), which “happened to be” financially sponsored by Chinese government, that’s a lot of happenstance.

Saying that the Chinese stole from ASML wouldn’t get the stolen property back and would only tick off the Chinese government, and we saw what happened to Micron when they accused a Chinese related company of theft, they got shut down in China. In addition, if ASML were to broadcast the fact that China ripped them off it would give the US government and European regulators even more reason to shut down sales of ASML products in China….best to make the problem go away.

XTAL is gone but the problem lives on
After the $223M judgement against XTAL, they will likely go belly up. God only knows where the stolen software wound up. You can’t put the toothpaste back in the tube. Microns memory process is probably for sale on the dark web as well.

Scanners are too big to steal
Obviously stealing the plans for an ASML EUV scanner would be useless as it would be virtually impossible to duplicate one given the very specialized components, yet stolen software can be easily transported, hidden and used. All semiconductor processes and tools have a lot of software and there are many tools that can actually be easily copied . AMAT and Lam have had past problems and likely already have and will have more problems. KLA has high software content in their tools that is vulnerable.

The industry needs to increase its vigilance significantly as the problem won’t go away with a trade agreement (if we ever get one). We are sure the problem is only getting worse as no concrete steps have been taken to prevent it.

We told you so…
We gave a talk about China’s aspirations and potential issues with IP at Semicon West in 2018. This is a link to the presentation in which we spoke about these issues:

“China Chips- Semicon West 2018”

More Information on ASML/XTAL

XTAL’s website

Legal Summary of ASML v. XTAL

XTAL backgrounds

The Stocks
The ASML “news” is just old news rehashed which is not relevant to today, happened years ago and is not impactful to ASML’s financial model. ASML took the correct course of action and was not wrong in any way. The information does not add to or detract from existing China related risks as we see it. In essence it is much ado about nothing, and as such should have no impact on the stocks other than “headline risk”.

The only take away is the ongoing risks that all technology firms face in protecting their technology and that those risks are higher today with higher software content and USB drives that can hold 100’s of gigabytes of data in something the size of a key. The technology industry is more competitive than ever and this competition is a proxy for competition for global political dominance.


Real Time Object Recognition for Automotive Applications

Real Time Object Recognition for Automotive Applications
by Tom Simon on 04-12-2019 at 7:00 am

The basic principles used for neural networks have been understood for decades, what have changed to make them so successful in recent years are increased processing power, storage and training data. Layered on top of this is continued improvement in algorithms, often enabled by dramatic hardware performance improvements. There was a time not all that long ago when classifying objects in a still picture was impressive – and this was often done with training and classification running on large servers. The growing demand for autonomous vehicles has raised the bar. What is needed is the ability to perform real-time detection and recognition of objects at high framerates within the power, size and reliability constraints of automotive systems.

Recently in San Jose the Autonomous Hardware Summit brought together innovators in this field to discuss the latest technology trends. Autonomous vehicles must be able to identify and classify multiple objects in each frame of high resolution images. Frame rates must be high enough to keep up with high vehicle speeds. One new neural network type is extremely good at this. It is known as You Only Look Once (YOLOv3) and avoids the problem older approaches have with needing to break each frame up into separate identification tasks based on the detection of potential objects in various parts of the image. In previous techniques, each of these candidates needed to be run through a separate recognition step to determine what if anything is in the region.

YOLO works on the whole image at once, locating and recognizing objects much faster. Of course, its processing power and memory demands make this approach more difficult to implement. At the summit Dr. Cheng C. Wang, Co-Founder & Senior VP Engineering/Software at Flex Logix Technologies outlined their approach to tackling this challenge. With a resolution of 1920 x 1080 YOLOv3 can require over 100 GOPS per frame. Flex Logix offers modular neural inference building blocks called nnMAX 1K Tiles. They can be added to their EFLX embeddable FPGA to create specialized silicon hardware configurations to maximize performance.

YOLOv3 is made up of over 100 layers, often requiring over 200 billion MAC operations. The Flex Logix nnMAX 1K tile contains 1024 MACs in clusters of 64 with weights stored locally in L0 SRAM. It supports a wide range of data types with optimizations such as Winograd Acceleration when appropriate. nnMAX tiles can be reconfigured rapidly at runtime between layers to optimize data movement. For instance, layer 0 and layer 1 operations can be combined so the intermediate data stays in SRAM. In fact, their nnMAX compiler will automatically combine layers in this manner if there are enough resources.

Flex Logix’s ArrayLinx is used to perform the interconnect remapping of thousands of wires btween nnMAX tiles. nnMAX can also connect to 1,2 or 4MB of SRAM depending on how it is configured. Optimizing SRAM configurations can allow for nnMAX arrays that perform up to 100 TOPs

One of Dr. Wang’s main points is that when running YOLOv3 on nnMAX tiles, increasing resources effectively scales performance. nnMAX can be configured in a wide variety of array sizes. His performance example is a 2MP per frame video. With nnMAX 4K and 8MB SRAM it can handle 10 fps. Going to nnMAX 8K with 32MB SRAM yields 24 fps. And an impressive 48 fps can be reached with nnMAX 16K and 64MB SRAM.

Flex Logix also supplies a complete software development environment to accompany the embedded FPGA and nnMAX tiles. The nnMAX compiler will map neural networks to Tensorflow or ONNX. Given any neural network, their software can output performance metrics based on nnMAX provisioning, the amount of SRAM, and DRAM bandwidth. This makes it easy to understand MAC utilization and the overall efficiency of the proposed architecture.

This new development from Flex Logix looks very exciting for the highly demanding automotive market. YOLO has been a game changer and is rapidly becoming a favorite for real-time image processing. YOLOv3 running on silicon designed with Flex Logix IP should provide an effective solution for meeting the demanding requirement of autonomous vehicle hardware. Their presentation from the Autonomous Vehicle Hardware Summit can be found on the Flex Logix website.