Bronco Webinar 800x100 1

Virtually Verifying SSD Controllers

Virtually Verifying SSD Controllers
by Bernard Murphy on 09-26-2019 at 5:00 am

Datacenter storage

Solid State Drives (SSDs) are rapidly gaining popularity for storage in many applications, in gigabytes of storage in lightweight laptops to tens to hundreds of terabyte drives in datacenters. SSDs are intrinsically faster, quieter and lower-power than their hard disk-drive (HDD) equivalents, with roughly similar lifetimes, though SDDs are (currently) more expensive. All appealing characteristics in a datacenter, perhaps in some suitable mix with cheaper HDDs. However there are other challenges with SDDs which make them in some ways more difficult to manage.

The memory cells inside an SSD wear out through repeated read/write/erase actions. Also writes to an SSD are at a page or block level (I’ll use block from here on for simplicity). You can’t just update one word; if you want to update a block already containing data you have to copy the block to SRAM, make the update, write the updated block to a new empty block and mark the original block for deletion. So far no problem, but those marked blocks have to be deleted so they can be recycled back into the system. That’s handled by garbage collection, which the controller will run in the background to avoid slowing down host reads and writes.

There’s an obvious challenge here when write-traffic becomes significant and scattered. Demand for new blocks to write can exceed the pace at which marked blocks are being deleted, in which case writing stalls waiting for garbage collection to catch up. And the supposed fantastic performance of the SSD takes a hit until the backlog is cleared. Which is not great for XaaS providers who want to claim reliably superior throughput.

In managing this problem, storage technologists have come up with a concept of predictable latency through which long tails in this distribution can be limited or even eliminated. Characterizing for this latency for a new controller design under a wide range of demands obviously requires running with a host model which will faithfully represent realistic datacenter traffic as a driver. Here Mentor have further extended their VirtuaLAB concept for Veloce emulation to provide an SSD reference lab, providing all the necessary virtualized operation components, allowing for a host OS such as Qemu, along with debug interfaces. The controller model runs on the Veloce emulator.

What I find particularly interesting about this is the natural fit of a virtualized version of the production storage interface working together with the emulator based DUT model. In the right contexts I’m a fan of ICE-based modeling, where you connect the emulator to real hardware to get all the real-life variability and odd behavior you will have to accommodate. But dealing with massively complex system loads by building large hardware test systems is impractical and inevitably very incomplete. Here virtualized modeling seems a better solution, given easier scalability to a wide range of applications. This is similar I think to the work VirtuaLAB interface Mentor have with Ixia/Keysight for network testing under a wide range of possible loads.

None of which means you’re going to get everything right pre-silicon in this kind of modeling. I’m not sure the old “first-silicon” goal applies any more in complex system devices. But you can shake out a lot of problems to ensure that validation with that first silicon build will be catching real-life corner-cases and not problems you should have caught in design.

You can read Mentor paper on this capability HERE.


High-Speed PHY IP for Hyperscale Data Centers

High-Speed PHY IP for Hyperscale Data Centers
by Tom Dillinger on 09-25-2019 at 10:00 am

A new designation has recently entered the vernacular of the computing industry – a hyperscale data center.  The adjective hyperscale implies the ability of a computing resource to scale corresponding to increased workload, to maintain an appropriate quality of service.

The traditional enterprise data center is often characterized as a back room warehouse of data processing and storage resources, with components of varying capacity and performance.  Customers commonly request resource allocations.  There is typically a long leadtime for hardware upgrades and resource growth.

Conversely, the hyperscale data center is by nature modularized and distributed.  The large cloud computing service providers are the models usually associated with hyperscale data centers, yet any IT operation with the following characteristics would apply:

  • modular facilities for power and cooling delivery

An analogy for the modularity of a hyperscale data center would be the construction of a housing development, where the overall facilities infrastructure is divided into phases, each consisting of individual building lots.

  • workload balancing

The footprint of the hyperscale data center assumes a typical thermal dissipation, to provide the facility cooling – planning for cooling to support maximal dissipation throughout the center would be cost-ineffective.  Balancing the utilization of resources involves thermal monitoring and support for workload relocation.

  • high availability

Hyperscale architectures include the capability to replicate/restart workloads across servers, in case of a failure.

The modularity in hyperscale data center architectures is associated with the ubiquitous server rack, as depicted in the figure below.

Figure 1.  Common server rack hardware configuration, illustrating optical module or Direct Attach Copper connectivity to the Top-of-Rack switch.  (Source:  Synopsys)

Top of Rack (ToR) is a common position for the network switch hardware.  The figure above also indicates the increasing network switch bandwidth required – e.g., 25.6 Tb/sec and 51.2 Tb/sec – and the network interface card technologies used in these rack configurations.

When describing the connection bandwidth, the key parameters are:

  • serial (SerDes) data rate and the number of SerDes lanes

The effective datarate is reduced (slightly) from the SerDes specification due to the additional bits added to the payload as part of the data encoding algorithm.

  • insertion loss and crosstalk loss of the connection medium, and the range of the connection

The key overall specification to achieve is the bit error rate (BER), which is determined by a number of factors – e.g., Tx equalization, Rx adaptation to optimize signal sampling time, and especially, the frequency-dependent insertion loss and crosstalk interference of the connection.

For these very high-speed data rates, individual specifications for these losses are often provided (the loss acceptance mask versus frequency), for different configurations – e.g., chip-to-chip (short reach); backplane with 2 connectors (~1m), and Direct Attach Copper cable (~3m).  Increasingly, above a 100 Gbps serial rate, copper cabling in the rack may be displaced by low-cost optics and/or a transition of the network switch to a middle-of-rack (“MoR”) position.

I had the opportunity to chat with Manmeet Walia, Senior Marketing Manager for High-Speed PHY Development at Synopsys, about the characteristics of the hyperscale data center, the increased data communications bandwidth, and the ramifications of these trends on hardware design.

“There are several key trends emerging.”, Manmeet indicated.  “For improved efficiency, Smart network interface cards (“SmartNIC”) are being offered, with additional capabilities for network packet processing to off-load the host.”  

“The intra-rack bandwidth requirements are increasing – 56G and 112G Ethernet are required.”, Manmeet said.   The figure below highlights how these IP are used in support of various aggregate Ethernet speeds, using multi-lane configurations.  The targets for bandwidth between data centers are also shown below.

Figure 2.  Evolution of Ethernet speeds, targets for DC-to-DC bandwidth.  (Source:  Synopsys)

“Switch designs are integrating electro-optical conversion and optical fiber connectivity for the Ethernet physical layer even in medium- and short-range configurations.  Inter-rack and data center-to-data center bandwidth must also increase to accommodate the network traffic.”

Manmeet provided the figure below to illustrate how electro-optical conversion is transitioning from a distinct network card module to an integral part of advanced packages, with optical fiber used locally.  (The electrical SerDes signal conditioning retiming functionality required at high data rates is thereby eliminated.)

Figure 3.  Electro-optical conversion transition from a module to an integrated function.  (Source:  Synopsys)

He continued, “The 56G and 112G Ethernet communications requires PAM-4 signaling – conventional NRZ signal transitions for these networking applications maxes out at 28G.”

Figure 4.  PAM-4 Signaling Eye Diagram and Test Challenges.   (Source:  Teledyne LeCroy)

Briefly, PAM-4 signaling implies there are 4 different possible voltage levels to be sensed at the center of the signal eye.  The PAM-4 signal sensing window is therefore 1/3rd of the NRZ (PAM-2) height.  The linearity of the 4 signal levels is a critical parameter.

As with an NRE serial signal, minimizing crosstalk noise is crucial, especially with the reduced voltage sense differences with PAM-4.

The time sampling in the center of the eye opening for the PAM-4 signal is more complex, as well.  The jitter at the edges of the eye is magnified in PAM-4, due to the varying transitions between individual levels in successive unit intervals.  Separate Tx and Rx clock sources are used for training and auto-negotiation, potentially on a per-lane basis.  Additionally, there is a common requirement to support varying speed settings, again potentially for each lane.

The IEEE 802.3cd working group is establishing 56G/112G Tx/Rx signal specifications and PAM-4 standards for various network topologies, from ultra-short to long-reach, and for shielded/balanced copper and optical fiber cables.

“What’s new in the Synopsys PHY group?”, I asked.

Manmeet replied, “At the TSMC Open Innovation Platform symposium, we are highlighting our N7 56G and 112G PHY IP.  We are also providing reference design evaluation hardware and software evaluation platforms for customers.”

Manmeet included the following roadmap for large network switch SoCs, integrating 256 lanes of the 56G and 112G PHY’s.

Figure 5.  Example configuration of high-speed PHY’s, for large network switch SoC designs.  (Source:  Synopsys)

“The 56G PHY IP is provided in an X4 lane increment.  The DesignWare Physical Coding Sublayer (PCS) enables the networking protocol to span a wide range of data rates.  The 112G PHY is offered in an X1 lane unit, with similar PCS flexibility.”

Figure 6.  Synopsys 56G and 112G Ethernet PHY implementation architecture.  (Source:  Synopsys)

Manmeet added, “Synopsys will be providing customers with additional design materials.  SoC physical layout PHY tiling and power delivery recommendations are included, based on the results of our package escape studies.  IBIS-AMI models are provided.  A software toolkit enables evaluation of Tx and Rx settings and signal eye characteristics.  A test chip evaluation card is also available.”

 

Figure 7.  Design kit materials for 56G and 112G IP evaluation:  IBIS-AMI model, software toolkit for lab evaluation, reference card.   (Source:  Synopsys)

 

Several trends are clear:

  • Computing models are rapidly adopting the characteristics of flexible “hyperscale” data centers.

 

  • The volume of network traffic demands an increase in bandwidth to 400G, enabled by the availability of 56G and 112G Ethernet PHY IP utilizing PAM-4 signaling, whether for short-reach or long-reach configurations, utilizing copper or optical cable physical layer interconnects. (For ultra-short reach interfaces, these Synopsys PHYs also include options to optimize power for low-loss channels.)

 

  • The complexities of integrating a large number of high-speed 56G/112G Ethernet PAM-4 SerDes lanes to optimize signal losses and crosstalk requires more than just “silicon-proven” test chips from the IP provider. A strong collaboration between customers and the IP provider is needed to adapt to the SoC metallization stack and to leverage the available software/hardware reference materials.

 

Here are links to additional information that may be of interest.

Synopsys 112G Ethernet PHY IP press release –  link.

Article:  “Shift from NRZ to PAM-4 Signaling for 400G Ethernet” – link.

Youtube video on 7nm 56G Ethernet PHY IP performance results – link.

Synopsys DesignWare 112G Ethernet PHY IP – link.

 

-chipguy

 


How Should I Cache Thee? Let Me Count the Ways

How Should I Cache Thee? Let Me Count the Ways
by Bernard Murphy on 09-25-2019 at 5:00 am

cache hierarchy

Caching intent largely hasn’t changed since we started using the concept – to reduce average latency in memory accesses and to reduce average power consumption in off-chip reads and writes. The architecture started out simple enough, a small memory close to a processor, holding most-recently accessed instructions and data at some level of granularity (e.g. a page). Caching is a statistical bet; typical locality of reference in the program and data will ensure that multiple reads and writes can be made very quickly to that nearby cache memory before a reference is made outside that range. When a reference is out-of-range, the cache must be updated by a slower access to off-chip main memory. On average a program runs faster because, on average, the locality of reference bet pays off.

For performance, that cache memory has to be close to the processor and therefore small; it can only hold a limited number of instruction or data words. As programs and data size get bigger, added levels of caching (still on-chip) became important. Each hold a larger amount of data at the price of progressively slower accesses, but still much faster than main memory access. These levels of cache were named, imaginatively, L1 (for level 1, closest to the processor), through L2, L3 and in some cases L4. Whichever of these is closest to the external memory interface is called the last-level cache.

Then we started to see multi-processor compute. Each processor wants its own cache for performance but they all need to work with the same main memory, introducing a potential coherence problem. If processor A and processor B read/write to the same memory address, you don’t want them getting out of sync because actually they’re each working with a local copy in their own caches.

To avoid chaos, all accesses still have to be synchronized with the central main memory address space. Delivering this assurance of coherence between caches depends on a lot of behind-the-scenes activity to snoop on locations written/read in each cache, triggering corrective action before a mismatch might occur.

For SoCs, Arm provides a cache-coherent communications solution between their processor IPs called the DynamIQ Shared Unit (DSU), solving the problem for fixed clusters of Arm CPUs. But for the rest of the SoC you need a different solution. Think of an AI accelerator for example, used to recognize objects in an image. Ultimately the image (or more likely stream of images) is going to come through the same memory path for processing by AI and non-AI functionality alike.

So AI accelerator accesses must be made coherent with the rest of the coherent subsystem. Arteris IP provides a solution for this through their Ncore cache coherent NoC interconnect which provide proxy caches to coherently connect non-coherent AI accelerators with the SoC coherent subsystem. You can do this with multiple accelerators, even multiple AI accelerators, a configuration which apparently is becoming popular in a number of devices.

Next consider that a number of these accelerators are becoming pretty elaborate and pretty big in their own right. Now the Ncore interconnect is not just a way to connect the accelerator to the SoC cache subsystem but a full coherent interconnect supporting multiple caches inside the accelerator.

This is needed because AI accelerators depend even more heavily on localized memory for throughput; a grid-based accelerator might have local cache associated with each processing element or group of processing elements. This coherent interconnect performs a similar function to the Arm coherent DSU but inside a non-Arm subsystem and using a NoC architecture which can more practically extend over long distances.

Then think further on those long distances and the fact that some of these accelerators are now reticle sized. Maintaining coherence directly over that scope would be near-impossible without significantly compromising the performance advantages of caching. What does any good engineer do at that point? They manage the problem hierarchically with multiple domains of local coherence connecting to higher-level coherent domains.

Finally and more back down to earth, what about the last-level cache, the one right before you have to give up and go out to main memory? Arteris IP provides the CodaCache solution here which can sit right by the memory controller. This is highly configurable for size, partitioning, associativity and even scratchpad memory to allow AI users to optimize tuning to the dataflow they expect to have with their AI app, perhaps to pre-fetch data they know they will need soon.

Caching has come a long way from those early applications. Arteris IP is working with customers in each of these areas. You can learn more about their Ncore cache coherent interconnect HERE and CodaCache HERE.


Semiconductors back to growth in 2020

Semiconductors back to growth in 2020
by Bill Jewell on 09-24-2019 at 11:30 am

Semiconductor market forecasts 2019 to 2021

The global semiconductor market is headed for the largest decline in 18 years. The market dropped 32% in 2001 when the Internet bubble burst. The 2019 decline should be around 15%, the third largest annual drop after 2001 and a 17% drop in 1985. The current weakness is largely due to excess memory capacity (DRAM and NAND flash) relative to demand. WSTS expects the memory market will decline 31% in 2019 while semiconductors excluding memory will decline only 4%. The market weakness is also due to an uncertain global economy and weakness in key demand drivers.

The second half of 2019 is showing signs of a turnaround. Below are the 2nd quarter 2019 revenues reported by major semiconductor companies and their guidance for 3rd quarter 2019. The non-memory companies showed revenue growth in 2Q 2019 versus 1Q 2019, ranging from 0.3% from Qualcomm to 16.2% from Nvidia. The memory companies (SK Hynix, Micron and Toshiba) experienced revenue declines, except for Samsung which grew 11.2%. A few companies expect healthy growth in 3Q 2019 revenues, ranging from 9.1% from Intel to 15.3% from STMicroelectronics. TI, Infineon and NXP project low single digit growth. Micron and Qualcomm expect revenue declines.

Recent semiconductor market forecasts call for 2019 to be decline about 13% to 15%. We at Semiconductor Intelligence are sticking with our June forecast of a 15% decrease. Forecasts for 2020 are in a relatively narrow range, from 4.8% by WSTS to 8% by Semiconductor Intelligence. For 2021, IHS Markit projects accelerating growth to 10% from 6% in 020. Our Semiconductor Intelligence forecast is for slightly slower growth in 2021 of 7% compared to 8% in 2020.

What are the drivers of the semiconductor forecast? One key is gross domestic product (GDP) which measures overall economic activity. The September forecast from Euromonitor shows a slowing of World GDP growth from 3.7% in 2018 to 3.1% in 2019. GDP growth picks up slightly to 3.3% in 2020 and 2021. The advanced economies (including the U.S., Euro area, UK, Canada and Japan) decelerate from 2.2% growth in 2018 to 1.7% in 2019 and 1.5% in 2020 and 2021. Two major factors behind the 2019 slowdown are the trade dispute between the U.S. and China and uncertainty over the UK’s exit from the European Union (Brexit). Growth is stronger in the emerging economies (including China, India, Russia, Southeast Asia and Latin America) at 4.3% in 2019, picking up to 4.6% in 2020 and 2021. Slowing growth in China is offset by accelerating growth in India, Southeast Asia and Latin America.

The two largest applications for semiconductors are smartphones and PCs/tablets. IDC projects smartphone units will decline 2.2% in 2019 after falling 3.4% in 2018. Growth should turn positive in 2020 and 2021 as 5G smartphones enter the market. Gartner expects the combination of PC and tablet units to decline 1.5% in 2019 after a 2.5% decline in 2018. The rate of decline will slow to 1.4% in 2020 and 0.6% in 2021.

Against this backdrop, the semiconductor market is unlikely to show strong growth in the next few years. Our Semiconductor Intelligence forecast of 8% semiconductor market growth in 2020 is largely due to a bounce back from the 15% decline in 2019. We expect growth to moderate to 7% in 2021 due to continued economic uncertainty and lackluster end equipment market growth.


Webinar of Recent NCTU CDM/ESD Keynote Talk by Dundar Dumlugol – Thursday September 26th

Webinar of Recent NCTU CDM/ESD Keynote Talk by Dundar Dumlugol – Thursday September 26th
by Daniel Nenni on 09-24-2019 at 10:00 am

With many design teams still searching for an effective means of identifying Charged Device Model (CDM) issues early in the design process, it comes as no surprise that events on this topic generate a lot of interest and are well attended. In July Magwel’s CEO Dr. Dundar Dumlugol had the honor of being invited by Professor Ming-Dou Ker to speak at Taiwan’s National Chiao-Tung University (NCTU) on the topic of ESD and CDM event simulation. NCTU’s esteemed electrical engineering department serves as a center of expertise on the topic of ESD.

The session was well attended (myself included) and provided an excellent overview of the larger scope of ESD in general and on CDM specifically. Dundar reviewed the existing methods of analyzing an IC design for weaknesses that could lead to CDM related failures. He also discussed Magwel’s approach to solving these challenging design problems. During the talk he showed several examples of CDM discharge scenarios and what tools should be looking for in order to properly simulate the events.

The July seminar was recorded and is replayed HERE. This is a good opportunity to learn about the most effective ways to prevent CDM failures in finished designs. Magwel has applied their expertise in solver-based extractors and dynamic simulation to develop their CDMi tool. CDMi simulates current flows during high speed CDM pulses and determines which ESD devices are triggered. Overvoltage or overcurrent violations are reported through an intuitive UI.

About Us
Magwel® offers 3D field solver and simulation based analysis and design solutions for digital, analog/mixed-signal, power management, automotive, and RF semiconductors. Magwel® software products address power device design with Rdson extraction and electro-migration analysis, ESD protection network simulation/analysis, latch-up analysis and power distribution network integrity with EMIR and thermal analysis. Leading semiconductor vendors use Magwel’s tools to improve productivity, avoid redesign, respins and field failures. Magwel is privately held and is headquartered in Leuven, Belgium. Further information on Magwel can be found at www.magwel.com


Cadence Celsius Heats Up 3D System-Level Electro-Thermal Modeling Market

Cadence Celsius Heats Up 3D System-Level Electro-Thermal Modeling Market
by Tom Simon on 09-24-2019 at 6:00 am

A few years back people were saying that the “EDA” problem was solved and that design tools had become commodity. At the same time people hailed ADAS, smart homes, mobile communication and AI as the frontiers of electronics.  Perhaps it could be said that layout tools, routers, placers, and circuit simulators had largely matured and often could be used interchangeably to successfully complete electronic designs. However, there were new challenges arising due to the increasing complexity of electronic systems – like the very ones used in cars, smart homes, mobile and AI.

At more advanced nodes and with increasing computational requirements, ICs have become bigger consumers of power. ICs are also being packaged tightly together in configurations using interposers, 3D ICs, etc. Heat buildup in these systems is a very real concern. IC performance and behavior are affected by temperature, and of course too much heat can lead to failure over time or rapidly in some cases. We all have experienced laptops heating up on our laps or cell phones warming up in our pockets. If the exterior is getting hot, what is happening on the inside, from silicon, to package, board, heat sink and chassis?

Cadence Celsius

System level thermal performance cannot be left to chance, it needs to be simulated and well understood during the design process. Up until now there has not been a solution for system level behavior. Cadence has just announced a system level electro-thermal simulator that includes multi-physics and interfaces with other design tools to effectively solve this problem. IC designs are in Virtuoso, packages are in SiP and boards are in Allegro. 3D assemblies come from 3D CAD packages. Celsius can import the full system design and then model thermal behavior.

Power information is obtained from Cadence Voltus, then steady state or transient thermal results can be obtained through Celsius. With accurate temperature information, Voltus can then provide more accurate power information, improving accuracy of the final results. Celsius uses Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) to model internal thermal flux and airflow inside of the fully assembled system.

At smaller scales the problem of electro-thermal analysis has been tractable, but with the sizes of current and future generations of electronics systems, there has not been a feasible solution. Cadence Celsius is using massively parallel execution to increase capacity and boost performance. Interestingly, they are using a similar capability for their Clarity EM solver for system level analysis. Celsius can add hundreds of CPUs to effectively increase performance. An example provided by Cadence shows that with 320 CPUs, runtime is 10.2X faster than legacy solvers running on 40 CPUs. Even with the baseline 40 CPUs, Celsius is 2.5X faster.

Building today’s advanced electronic products and systems requires integration of many disciplines at once to ensure good results. For instance, an electronic system might appear to have adequate heat dissipation capacity, but transitions from one operating mode to another or changes in power consumption due to elevated temperatures may lead to exceeding allowable design parameters. Simulation and analysis that includes both thermal and electronic domains is necessary to avoid costly late cycle iterations to fix design problems.

With Celsius, Cadence is delivering its second system-level high-capacity and high-performance analysis tool. Both Clarity and Celsius use highly scalable solver technology to boost capacity so real-world systems can be fully analyzed. They also both accept inputs from 3D modeling, board, package an IC design flows, to permit full integration of all design elements. What they are doing today is a far cry from the era when EDA tools were siloed and only looked at a small part of the overall problem. With this move, Cadence is showing how EDA vendors, with the right vision, can leave behind the notion that EDA is a “solved problem” that offers little differentiation. Companies that adopt these tools will have a significant advantage in their design process that will lead to more successful products. Full information on Celsius is available on the Cadence website.


AI Inference at the Edge – Architecture and Design

AI Inference at the Edge – Architecture and Design
by Tom Dillinger on 09-23-2019 at 10:00 am

In the old days, product architects would throw a functional block diagram “over the wall” to the design team, who would plan the physical implementation, analyze the timing of estimated critical paths, and forecast the signal switching activity on representative benchmarks.  A common reply back to the architects was, “We’ve evaluated the preliminary results against the power, performance, and cost targets – pick two.”  Today, this traditional “silo-based” division of tasks is insufficient.  A close collaboration between architecture and design addressing all facets of product development is required.  This is perhaps best exemplified by AI inference engines at the edge, where there is a huge demand for analysis of image data, specifically object recognition and object classification (typically in a dense visual field).

The requirements for object classification at the edge differ from a more general product application.  Raw performance data – e.g., maximal operations per section (TOPS) – is less meaningful.  Designers seek to optimize the inference engine frames per second (fps), frames per watt (fpW), and frames per dollar (fp$).  Correspondingly, architects must address the convolutional neural network (CNN) topologies that achieve high classification accuracy, while meeting the fps, fpW, and fp$ product goals.  This activity is further complicated by the rapidly evolving results of CNN research – the architecture must also be sufficiently extendible to support advances in CNN technology.

Background

AI inference for analysis of image data consists of two primary steps – please refer to the figure below.

Feature extraction commonly utilizes a (three-dimensional) CNN, supporting the two dimensions of the pixel image plus the (RGB) color intensity.  A convolutional filter is applied to strides of pixels in the image.  The filter matrix is a set of learned weights.  The size of the filter coefficient array is larger than the stride to incorporate data from surrounding pixels, to identify distinct feature characteristics.  (The perimeter of the image is “padded” with data values, for the filter calculation at edge pixels.)  The simplest example would use a 3×3 filter with a stride of 1 pixel.  Multiple convolution filters are commonly applied (with different feature extraction properties), resulting in multiple feature maps for the image, to be combined as the output of the convolution layer.  This output is then provided to a non-linear function, such as Rectified Linear Unit (ReLU), sigmoid, tanh, softmax, etc.

The dimensionality of the data for subsequent layers in the neural network is reduced by pooling – various approaches are used to select the stride and mathematical algorithm to reduce the dataset to a smaller dimension.

Feature extraction is followed by object classification.  The data from the CNN layers is flattened and input to a conventional, fully-connected neural network, whose output identifies the presence and location of the objects in the image, from the set of predefined classes used during training.

Training of a complex CNN is similar to that of a conventional neural network.  A set of images with identified object classes is provided to the network.  The architect selects the number and size of filters and the stride for each CNN layer.  Optimization algorithms work backward from the final classification results on the image set, to update the CNN filter values.

Note that the image set used for training is often quite complex.  (For example, check out the Microsoft COCO, PASCAL VOC, and Google Open Images datasets.)  The number of object classes is large.  Objects may be scaled in unusual aspect ratios.  Objects may be partially obscured or truncated.  Objects may be present in varied viewpoints and diverse lighting conditions.  Object classification research strives to achieve greater accuracy on these training sets.

Early CNN approaches applied “sliding windows” across the image for analysis.  Iterations of the algorithm would adaptively size the windows to isolate objects in bounding boxes for classification.  Although high in accuracy, the computational effort of this method is extremely high and the throughput is low – not suitable for inference at the edge.  The main thrust for image analysis requiring real-time fps throughput is to use full-image, one-step CNN networks, as was depicted in the figure above.  A single convolutional network simultaneously predicts multiple bounding boxes and object class probabilities for those boxes.  Higher resolution images are needed to approach the accuracy of region-based classifiers.  (Note that a higher resolution image also improves extraction for small objects.)

Edge Inference Architecture and Design

I recently had the opportunity to chat with the team at Flex Logix, who recently announced a product for AI inference at the edge, InferX X1.  Geoff Tate, Flex Logix CEO, shared his insights into how the architecture and design teams collaborated on achieving the product goals.

“We were focused on the optimizing inferences per watt and inferences per dollar for high-density images, such as one to four megapixels.”, Geoff said.  “Inference engine chip cost is strongly dependent on the amount of local SRAM, to store weight coefficients and intermediate layer results – that SRAM capacity is balanced against the power associated with transferring weights and layer data to/from external DRAM.  For power optimization, the goal is to maximize MAC utilization for the neural network – the remaining data movement is overhead.”

For a depiction of the memory requirements associated with different CNN’s and image sizes, Geoff shared the chart below.  The graph shows the maximum memory to store weights and data for network layers n and n+1.

The figure highlights the amount of die area required to integrate sufficient local SRAM for the n/n+1 activation layer evaluation — e.g., 160MB for YOLOv3 and a 2MP image, with the largest activation layer storage of ~67MB.  This memory requirement would not be feasible for a low-cost edge inference solution.

Geoff continued, “We evaluated many design tradeoffs with regards to the storage architecture, such as the amount of local SRAM (improved fps, higher cost) versus the requisite external LPDDR4 capacity and bandwidth (reduced fps, higher power).  We evaluated the MAC architecture appropriate for edge inference – an “ASIC-like’ MAC implementation is needed at the edge, rather than a general-purpose GPU/NPU.  For edge inference on complex CNN’s, the time to reconfigure the hardware is also critical.  Throughout these optimizations, the focus was on supporting advanced one-step algorithms with high pixel count images.”

The final InferX X1 architecture and physical design implementation selected by the Flex Logix team is illustrated in the figure below.

The on-chip SRAM is present in two hierarchies:  (1) distributed locally with clusters of the MAC’s (providing 8×8 or 16×8 computation) to store current weights and layer calculations, and (2) outside the MAC array for future weights.

The MAC’s are clustered into groups of 64.  “The design builds upon the expertise in MAC implementations from our embedded FPGA IP development.”, Geoff indicated.  “However, edge inference is a different calculation, with deterministic data movement patterns known at compile time.  The granularity of general eFPGA support is not required, enabling the architecture to integrate the MAC’s into clusters with distributed SRAM.  Also, with this architecture, the internal MAC array can be reconfigured in less than 2 microseconds.”

Multiple smaller layers can be “fused” into the MAC array for pipelined evaluation, reducing the external DRAM bandwidth – the figure below illustrates two layers in the array (with a ReLU function after the convolution).  Similarly, the external DRAM data is loaded into the chip for the next layer in the background, while the current layer is evaluated.

Geoff provided benchmark data for the InferX X1 design, and shared a screen shot from the InferX X1 compiler (TensorFlowLite), providing detailed performance calculations – see the figure below.

“The compiler and performance modeling tools are available now.”, Geoff indicated.  “Customer samples of InferX X1 and our evaluation board will be available late Q1’2020.”

 

I learned a lot about inference requirements at the edge from my discussion with Geoff.  There’s a plethora of CNN benchmark data available, but edge inference requires laser focus by architecture and design teams on fps, fpW, fp$, and accuracy goals, for one-step algorithms with high pixel images.  As CNN research continues, designs must also be extendible to accommodate new approaches.  The Flex Logix InferX X1 architecture and design teams are addressing those goals.

For more info on InferX X1, please refer to the following link.

-chipguy

 


Actel Goes Public – The IPO

Actel Goes Public – The IPO
by John East on 09-23-2019 at 6:00 am

In 1990 Xilinx notified us that they believed Actel was infringing a patent that had just been issued to them.  My immediate thoughts – the patent system is all screwed up!  Actel had been developing our product for five years. We had been shipping it for a year and a half.  During all that time, we were totally unaware that there was or ever would be such a patent —   there was no way to know that Xilinx had filed for a patent or what was in the filing.   Then,  when the patent finally issued  —- BAM!!  We’d been blindsided!!  The “normal” way to handle this situation is to take a license  — to agree to pay a royalty to the owner of the patent.  There was a problem with that.  Bernie Vonderschmitt (The CEO of Xilinx) didn’t want to give any licenses.  I can’t really say that I blame him for wanting to keep the FPGA market to himself.  It was hard to get mad at Bernie.  He was a classy guy.  Yet,  if it had gone to court and Xilinx had won,  they would have been completely within their legal rights to enjoin Actel from shipping our FPGAs.  That would have forced us to close our doors – to shut Actel down.  I spent the better part of three years trying to solve this potentially very, very serious problem.

In those days the rule of thumb was that a start-up could go public once they had achieved two consecutive profitable quarters.  We reached profitability in 1990 and were hot to go public.  We selected bankers  — Goldman Sachs was the bulge-bracket bank  —  and started preparation.  Then, Goldman’s lawyers started to get cold feet.  What if the patent battle couldn’t be settled?  If the worst case came to pass   — that is — we went public, took money from new shareholders, and then had to shut our doors because we had lost the patent suit —   the lawsuits would be everywhere. Eventually we had to put the IPO on hold.  (IPO = Initial Public Offering AKA “going public”.)

I’m not sure that we could have settled the Xilinx patent suit.  Bernie hadn’t shown any interest in doing so.  But then a good break came our way.  Khalid el Ayat found that the early Xilinx parts didn’t use what we called segmented routing but that their later family (the 4000 family) did.  One of our original patents had some claims dealing with segmented routing techniques (thanks Abbas El Gamal,  John Green and team).  That allowed us to file a countersuit against Xilinx.  Calmer minds prevailed.  We settled the suits peacefully.

Note:  Xilinx later sued Altera for infringing that same patent.  They weren’t able to settle and eventually went to trial.  My analysis was that Altera did infringe, but that the patent wasn’t valid over prior art.  The jury eventually ruled that the patent was indeed valid but that Altera didn’t infringe.  In my mind, wrong on both counts.   That’s the reason that you never want to go to a jury trial with a technical issue.  The jury won’t understand it and has an excellent chance of getting it wrong.  This case was very technical.  I doubt that the jury members understood 10% of what they were being told.

Once the Xilinx suit was settled,  our IPO efforts cranked up in full force.  Everything was looking good.  Then we hit another snag.  We had been selling 1.2 micron material, but we needed 1.0 micron product if we were to be cost competitive.  The process was ready to go,  or so we thought,  but the first production runs to come out had terrible yields.  If we couldn’t run the 1.0 process,  we couldn’t compete well cost-wise.  That would have to be disclosed in the IPO documents.  We had to put the IPO on hold for a second time.  Then, we cracked the code.  We figured out the problem and fixed it.  Thanks Steve Chiang!  Thanks Esmat. The IPO was back on.

IPOs are hard work.  Huge amounts of effort go into putting together the documents.  Everything that could end up mattering to a prospective shareholder must be disclosed and explained.  The explanations had to be right.  If the stock went down for any reason other than the market’s fickle nature,  you could expect a lawsuit.  The point of the suit would be that you hadn’t properly disclosed the risk of some aspect of the company and that if you had, the shareholders wouldn’t have bought the stock and hence wouldn’t have lost their money.  I spent many, many, many hours writing those documents. It took a long time and a lot of effort by a lot of people to get the documents ready!  Those documents are a big, big deal!

Once you have the documents done they must be approved by the Security Exchange Commission.  The SEC folks are tough.  They invariably find fault and demand improvements.  Once the SEC has given their approval,  you can start the “road show”.  The point of the road show is to meet with the people who are about to invest and teach them the basics of the company’s business. The road show is grueling.  You present the company pitch over and over and over again.  Breakfast presentations.  Morning one-on-ones.  Lunch presentations.  Afternoon one-on-ones.  Dinner presentations.  Then, you head to the airport, fly to a new city, and do it again.  One day we had a breakfast meeting in Minneapolis, a lunch meeting in Chicago,  a dinner meeting in Madison, and then flew to the east coast where a full day was set up for the following day.  When we landed, Dave Courtney (Our Goldman Sachs guy) had a message that our afternoon meeting the next day had been canceled.  That would mean that we had a couple of hours on our hands.  I suggested that we go see the Liberty Bell  — I had never seen it.  Dave thought that I was really stupid!  “John, the Liberty Bell is in Philadelphia.  We’re in Boston.”  Actually, I knew where the Liberty Bell was.  I just didn’t know where I was.  All the cities had become a blur by then.

A couple of days later in New York we got on a plane at Kennedy Airport and prepared to head home.  We were done!!!  It was a Friday afternoon.  We would celebrate over the weekend – then  – Monday morning when NASDAQ opened  — there would be a new listing:  ACTL.  Has a nice ring to it,  don’t you think?  ACTL!!!!!  Just before I boarded the plane, someone from Goldman brought me a fax that had arrived a few minutes earlier.  I took a sip from the glass of wine that I’d already treated myself to and opened the envelope.  It was a letter from a struggling start-up accusing us of infringing a patent of theirs.  I almost spit out the wine!!!  When I took a quick look at the patent, it wasn’t clear that it really applied to us.  But  — their timing was clearly calculated to get some quick cash from us.  The patent would constitute a new “risk factor” that should be disclosed to all potential investors.  To do that, though, would mean putting the IPO on hold for a third time  — redrafting and reprinting all of the documents, waiting a month or so for the new info to disseminate, going through SEC approval again and then doing the road show all over again.  Devilishly clever timing, don’t you think? But also, in my mind, completely without class.  We were supposed to “price” and open the market Monday morning.  The weekend when I planned to relax and celebrate was shot.  We sorted it out that weekend.  I don’t remember any details, but I’m sure it involved a tidy payment to the very clever offended party.  What I do remember is being really relieved that we had sorted it out.

Monday morning, August 3,1993 when they rang the opening bell*,  Actel stock was trading publicly right along with the likes of Apple and Intel!!  We opened at $9.50 per share.  During the day we ran up to $13.00. We were a public company.

Next week:  The Mars Rovers.

*Actually they don’t ring a bell at NASDAQ.  They just push a button.  It’s the NYSE that rings a bell.  I found that out in 2003 when I was invited to open the market as we celebrated our tenth anniversary of being a public company.  I gave a little speech,  waited a couple of minutes, and when they gave the sign, reached out and pushed a small black button.  Six years in college for that?!!

Pictured:  NASDAQ building at 42nd and Broadway in New York City on the celebration of our tenth year of being listed on NASDAQ.

See the entire John East series HERE.

# Bernie Vonderschmitt, Goldman Sachs, IPO, NASDAQ, Xilinx, Altera, Microchip


Criminals Luring in Bitcoin Sellers to Launder Money

Criminals Luring in Bitcoin Sellers to Launder Money
by Matthew Rosenquist on 09-22-2019 at 10:00 am

Cybercriminals are luring in bitcoin holders with the promise of easy money if they become a mule to convert stolen assets into clean currency. The reality is these volunteers will just join the ranks of other victims. But that is not stopping people from joining up to replace other mules who have paid the price.

Criminals are selling cash for pennies on the dollar to buyers willing to trade it for Bitcoin, according to the new Q3 2019 Black Market Report from Armor’s Threat Resistance Unit. A recent Cointelegraph article outlines how $800 in Bitcoin buys $10000 in cash on the Dark Web.

Digital cybercrime is a huge business, accounting for an estimated global cost of $3 trillion in 2015 and predicted to rise to $6 trillion by 2021 according to research from herjavecgroup.com.  A subset of that is monies stolen from victims which criminals want to launder so it cannot be traced back to them.

One of the most popular ways of laundering money is to use a 3rd party to transfer the money through their clean accounts.

For years, criminal outfits have been recruiting dim-witted volunteers to conduct such activities by giving them a percentage of the money they launder. This initially looks like a great opportunity for these volunteers but it is likely the worst decision of their lives.

Mules are a throw-away resource for organized criminals. They are the most exposed point to law enforcement and are often the ones caught. As the mules are purposely insulated from the criminals, they can’t negotiate a lesser sentence by informing who they work for. They are stuck with the burden of the full legal repercussions. The criminals simply recruit new mules to replace the old and leave a trail of discarded minions.

As for the mules who are caught, their future is pretty dark. First, they get arrested for a number of felonies. Supporting organized crime and transporting funds across borders are serious offenses. They should expect prison time. This gives them a lifelong criminal record of financial crime, which will then haunt them any time they seek employment at a company. What business would want to hire a thief? They may be put on travel watch lists, for fear of physically moving cash. The government will seize their assets and they will have to surrender or pay back all illicit profits to the government. Then, the IRS will come for them. Yes, they will still have to pay taxes on all the money they moved, not just their profits. This can be much more then they made being a mule. Those taxes also incur late fees.

Simply put, choosing to become a money mule for easy cash can ruin someone’s life. Doubtful the criminals tell all this in the job recruiting brochure.

It may be tempting, but don’t let yourself, family, or your colleagues be lured into this disaster by the promise of quick cash. Money laundering only helps the criminals and in turn, they will continue to victimize more people, including the mules.


Forget about 5G

Forget about 5G
by Roger C. Lanctot on 09-20-2019 at 10:00 am

There is a vast amount of hubbub in the automotive industry regarding the onset of 5G technology. Industry excitement is manifest in the 5G Automotive Association (5GAA) which is facilitating collaboration (among 120+ member) between the automotive industry and the wireless industry, possibly for the first time ever.

For years, cars and connectivity have gone together like fish and bicycles – in other words, not at all. General Motors’ OnStar automatic crash notification was a brilliant idea, but the arrival of the smartphone has almost completely erased any consumer enthusiasm for this application.

But cellular skeptics will be wise to arrest their tut-tutting about the future of connected cars because pre-5G LTE-based connectivity in the form of C-V2X is about to change everyone’s understanding of the value of connecting cars. More than 6,000 pedestrians were killed in traffic-related incidents in 2018 in the U.S., a 40% increase from 2008 and now representing 16% of all U.S. highway fatalities. That’s likely to get worse with the proliferation of micromobility. C-V2X is about to change that grim reality forever.

When Ford Motor Company announced its plans to adopt built-in C-V2X technology in all or most of its cars by 2022, the automotive industry shrugged. Ford had long trailed GM in its implementation of embedded connectivity – relying instead on connected smartphones with its clever SYNC solution created in partnership with Microsoft. Ford even found a way to enable automatic crash notifications from connected smartphones – i.e. 911 assist.

SYNC has since evolved, but so has Ford’s thinking about connectivity. The company’s singular embrace of C-V2X is likely to put the company in the vanguard of a movement to leverage vehicle-to-vehicle technology (enabled by C-V2X) and vehicle-to-pedestrian technology to reduce collisions between cars and between cars and pedestrians.

Enabling cars to “see” pedestrians is something of an industry Holy Grail. We’ve all seen the LiDAR images of pedestrians and even the camera-based and night-vision pedestrian detection systems. All of these modalities hint at the potential to use sensors to avoid collisions with pedestrians.

C-V2X, however, will enable so-equipped cars to detect the presence of pedestrians by “detecting” their C-V2X (or 5G enabled) smartphones. As C-V2X proliferates through mobile devices and, eventually, in cars, vehicles will increasingly be able to detect other vehicles and pedestrians.

This ability will initially manifest as driver alerts in dashboard-based systems such as the instrument cluster or infotainment system or even in head-up displays. Eventually, in-vehicle systems will be able to respond automatically, braking or taking other evasive action to avoid collisions.

Qualcomm will be holding an Automotive Showcase next Tuesday, Sept. 24, at the NextEnergy Center in Detroit, Mich., to demonstrate the capabilities inherent in C-V2X technology and explore new life-saving applications. There is no doubt that 5G, when it arrives in the U.S., will have its own transformative impact on vehicle technology development and deployment, but C-V2X will arrive even sooner in cars and in mobile devices.

Gil-Scott Heron was wrong. This revolution WILL be televised…in your dashboard.

Register for the Qualcomm Automotive Showcase here: https://tinyurl.com/yxawjuse