Bronco Webinar 800x100 1

The lofty rise of the lowly FPGA

The lofty rise of the lowly FPGA
by Tom Simon on 01-10-2018 at 7:00 am

FPGA programmable logic has served in many capacities since it was introduced back in the early 80’s. Recently, with designers looking for innovative ways to boost system performance, FPGA’s have moved front and center. This initiative has taken on new urgency with the slowing down of process node based performance gains. The search has moved to new algorithmic and architectural innovations that can push performance forward to meet the needs of big data, cloud computing, mobile, networking and other domains.

The new applications for FPGA’s are a far cry from the glue-logic uses that they first fulfilled. FPGA’s have been moving up the semiconductor food chain for some time though. They were applied to networking applications by Cisco and others back in the 90’s – as they entered their second decade. Most recently a major shift occurred when FPGA’s were paired with CPU’s to facilitate compute intensive operations. FPGA’s cannot adapt to new tasks as quickly as a general-purpose CPU, but they excel at repetitive operations that involve high throughput.

Microsoft has embraced this approach for its cloud and search engine operations after they assessed its feasibility in their Catapult project. Another big mover in this space is Intel with its $16B acquisition of Altera. Long gone are the days where FPGA’s were a poor man’s alternative to ASIC’s. Commercial FPGA’s routinely are built on leading edge process nodes – to wit, Altera going to Intel 22nm for its first FinFET design. FPGA’s have become quite efficient and they come with a bevy of ancillary IP and high performance IO’s ensure high performance.

In a recent white paper by Achronix, they argue that the pairing of CPU’s and FPGA’s was inevitable and in many ways obvious. However, for FPGA’s to be effectively paired with CPU’s several further optimizations are required. For one, the FPGA needs to access system memory using cache coherence. Another point that Achronix makes is that data transfer between system memory should operate as fast as possible. They also posit that board area ought to be reduced and that unused or unnecessary IP blocks or modules should be eliminated, to save cost and wasted silicon area.

The Achronix white paper touches on the CCIX group’s work to create a high speed standard for cache coherent memory that can be used by heterogeneous processors, IO devices and accelerators. Recent news on CCIX shows that 25Gb/sec has been demonstrated over PCIe 4.0. However, there is usually a price to pay when going off chip for any data, and especially for cache coherent data. The solution is to combine FPGA fabrics into SOC’s so they gain the efficiencies of being on-chip.

Achronix has a successful line of FPGA chips, the Speedster22i, but their latest move is shaking up the FPGA market. By taking their proven FPGA technology and embedding it, system designers can reap significant benefits. General purpose FPGA chips often have resources that are not optimally aligned with the target application. For instance, the off the shelf configuration might include IP, embedded memories or LUT’s that are not needed. Alternatively, Achronix eFPGA offers designers the ability to tailor the FPGA fabric tightly to the system requirements. Also, bypassing the need to go off-chip reduces the IO pad/ring overhead on both sides, while saving power, and improving speed and reliability.

The Achronix white paper covers the history of FPGA’s up to the new era of embeddable FPGA fabric, while articulating the advantages of this new approach. Additionally, they provide an overview of how they engage with customers to ensure design success. In the past FPGA’s have always been a game changer. However, with advances is technology their importance is system design has grown. With the switch over to embedded FPGA technology, an even higher level of performance and efficiency is possible. In some ways, this represents a fundamental shift in SOC design, one that will certainly create new opportunities in many of today’s leading application areas.


French Tech at CES, 2nd country after USA with 274 Start-Up at Eureka Park!

French Tech at CES, 2nd country after USA with 274 Start-Up at Eureka Park!
by Eric Esteve on 01-09-2018 at 12:00 pm

France exposure will be very strong at Las Vegas CES this year, the 3[SUP]rd[/SUP] country with 365 companies, behind USA and China. If you just take the start-up number into account, 274 French start-ups will be present, just behind the USA with 280 start-ups! If you look back, it’s a great jump compared with 2017 (178 start-up) and even greater with 2015 when only 66 French start-ups were exposing at CES.

If I write about this topic, it’s not just to say “Cocorico”, or Cook-a-doodle-doo, but to highlight a deep mind-change in France, and in the way France is perceived outside of the country. Just take two examples, one in pure business (investment in start-up), the other in media coverage (France has been named “country of the year 2017” by British magazine The Economist). In 2015, it was a major surprise to hear John Chambers saying that “France is the next new thing” to explain why CISCO will invest $100 million in French startups in the next three years. For information, it was CISCO largest investment in Europe, larger than in the UK for example…

The second example is more recent, dated December 22, 2017, when the British magazine The Economist named France “country of the year 2017” (The title, awarded since 2013, goes to a country that has notably changed for the better over the past 12 months, or “made the world brighter”). If you consider the way business people in the UK were regarding France just a couple of years back, that’s a massive change!

So, what’s new in France to justify the way the country is perceived as a dynamic country at business level?

There is an obvious answer linked with Emmanuel Macron election in May 2017 as new French “president de la republique”. He is not simply the youngest French president, he is also a politician who know more than politic, as he has worked in the banking industry (where he has made a lot of money in a couple of years by the way). In the US, such a CV may not even be mentioned, as many politicians come from the industry, but in France that’s a real change: most of the politicians have joined politic when leaving university!

This information is not just cosmetic, it helps understanding why Macron grasp the new dynamic of the industry, mostly based on the infinite potential offered by Internet. In this new industry, algorithms have replaced oil and coal, cars will become electric before being driverless.

If we need a proof about Macron’s ability to consider that startups will provide the new jobs in the future, just remember that he (or his staff) has organized the venue of French startups in Las Vegas in 2016. In fact, he did such an excellent job, running “French Tech Night” (see the above picture) to invite the top tech businessmen from around the world, that some media have complained about the cost of the party (€380K), leading to legal investigations… In some way, France is staying old style France in that sense that some people absolutely don’t understand the power of marketing. But Macron does, and, by the way, he speaks perfect English!

Just to make it clear, this blog is not a political hagiography for Emmanuel Macron. I simply want to explain what has changed in France, and how it can impact the potentially brilliant future for French startups.

If you want to look at the startups list, you can go here, let’s just mention some examples of the various industry segments, from health industry to agriculture, smart home, smart cities, wine or FinTech.

If you are interested by innovation in smart house, you could meet with some of the 70 exhibitors addressing this segment in Vegas this year. Health industry is also well covered by almost 50 startups, as well as services to enterprises (35) or transportation (31).

To conclude, I remind my first work in the high-tech industry in 1985. At that time I had to register to a certain organization, and to do so I had to select a code describing my profile (IC design engineer). In fact, I couldn’t find any code fitting with my job: it didn’t exist!

But in the 1990’s, you could find very good research or design engineers in France, like these who have designed OMAP for TI, the first Network-on-Chip (NoC) for Arteris (founded in France in 2005) or EVE emulation platform, founded in 2000 in France and now part of Synopsys.

There are many good, very good designers or soft developers in France. But most of the high-tech companies had poor marketing vision (basically limited to launch a press release). This massive venue to CES 2018 is certainly a good sign for change!

From Eric Esteve from IPNEST


System Level Formal

System Level Formal
by Bernard Murphy on 01-09-2018 at 7:00 am

Two recently announced vulnerabilities in major processor platforms should remind us that bugs don’t organize themselves to appear only in domains we know how to test comprehensively. Both Meltdown and Spectre (the announced problems) are potential hardware system-level issues allowed by interactions between speculative execution and cache behavior under specialized circumstances. Finding hardware weaknesses among highly complex interactions is where formal-proving excels, but common belief is that formal analysis on hardware systems of this complexity is beyond the reach of today’s tools, which are typically bounded to block/IP-level proving.


Or so goes conventional wisdom. But not knowing how to test for complex issues doesn’t make them go away, so experienced teams look for creative approaches to formally verify at the system level. Qualcomm (Mandar Munishwar, Sr. Staff Eng. at Qualcomm/Atheros, who also teaches a class at UCSC on system-level assertions and formal) and Vigyan Singhal (CEO of Oski) co-presented on this topic at Oski’s most recent Decoding Formal event.

Mandar kicked off by showing a real example where he was looking for potential system-level deadlocks in a wireless PHY layer sub-system, a controller for a next-gen WiFi core. The central block in this system sequences operations of ten immediately surrounding blocks, together acting as 11 complex intercommunication state-machines. The total system is deep inside the design, therefore difficult to control/observe directly, and the complexity of interactions with other blocks is sufficiently high that deadlocks are possible. All of which fits my earlier characterization – conceptually a perfect for formal, but far too big.

When a design is too big for formal, the logical next step is to abstract. You replace memories or datapath elements (for example) in the design with reduced models which exhibit the important features of the replaced component for the purposes of the proof. This works well when just a handful of components stand between you and a feasible proof, but what do you do if all the components are too big? At this point you have to start thinking about architectural formal verification, where you abstract everything in the design.

Mandar explained that each block in this proof, including the central block, is abstracted down to an appropriate FSM. In essence, none of the original RTL remains except for the top-level connectivity, much of which may be ignored as not relevant to the deadlock verification. At this point, some of you will object (in the strongest possible terms). What are you verifying if hardly any of the original RTL remains? There were some attendees who seemed to be wrestling with that objection; I also have had the same concern, but listening to a previous talk on using this method for cache-coherence verification and this talk where abstraction was taken to the limit, along with Vigyan’s explanations both times, I believe I have come to terms with the value and the validity of the method.


This starts with the process. Since what you are verifying is not the design RTL, you need strong links between that RTL and the proof and also, incidentally, the original architectural spec. That last part is a nice plus in these proofs. You derive architectural models as simple FSMs for each block, starting in each case with the architectural spec for the block rather than the RTL. In the abstracted model, you preserve only those details considered relevant to the proof. In this case, Mandar was primarily concerned with handshakes passing between the blocks because he was looking for deadlocks, but he still allowed for timing variability, eg non-deterministic latencies.


The top-level design remains the same with architectural models (AMs) replacing the blocks under that top, and block signals irrelevant to the proof are stubbed or left open as appropriate. Also adjustments are made to time-frames to scale multi-microsecond transactions to ~200 cycles for reasonable proof times. To give us a sense of the complexity of sequencing through these intercommunicating FSMs, Mandar showed an example control flow. Proofs are run on this top-level architectural model to find potential deadlocks.

Finally, the AMs are validated against the corresponding RTL. This can be done in simulation or formally. Per Vigyan, the simulation approach works well for relatively simple AMs, where obviously all you want to look at is the handshaking behavior. Formal is used for the more complex AMs, where you validate between the RTL and the assumptions (constraints) made when constructing the AM.

What did Qualcomm get out of this? First, they found bugs in the architectural specs – try doing that with any kind of RTL verification. This is a human step, not automated, but simply the exercise of generating an AM exposes potential problems. Pretty valuable. Then system-level verification exposed bugs. And finally, in AM versus RTL verification they found corner case bugs in the RTL. Altogether they uncovered 9 hard-to-find bugs which could easily have gone through to production or at least a silicon re-spin.

I can understand why verification teams might feel queasy with this approach; what you are proving feels multiple steps removed from what you are building. Vigyan himself told me this approach isn’t for everyone. And yet problems at this level happen and can be difficult even to conceive, much less figure out how to test. Just ask the processor guys scrambling to deal with Meltdown and Spectre. I can’t tell if the Qualcomm approach is necessarily the best way to catch Meltdown/Spectre-type problems, but I do believe formal needs to play a bigger role at this level and it has to work with some kind of architectural spec level because proofs based directly on the RTL are impractical. You can check out Oski’s Decoding Formalvideos HERE.


FinFET ASICs for Networking, Data Center, AI and 5G

FinFET ASICs for Networking, Data Center, AI and 5G
by Daniel Nenni on 01-08-2018 at 12:00 pm

On the heels of successful seminars in Tokyo and Shanghai, eSilicon is starting the new year back in the cloud with a webinar version of the live events for those, like myself, who could not attend. The webinar will compress the 3 hour live event into 60 minutes which will provide a great place to start a conversation on your next chip for networking, data center, artificial intelligence, and 5G applications.

In talking to my paisan Mike Gianfagna, eSilicon presented a complete ecosystem to address the requirements of advanced ASICs for markets such as high-performance computing, networking, deep learning and 5G. Samsung Memory presented their HBM2 solutions, Samsung Foundry talked about their advanced 14nm FinFET solutions, ASE Group reviewed their advanced 2.5D packaging solutions, eSilicon presented ASIC and 2.5D design/implementation and IP solutions, Rambus detailed their high-performance SerDes solutions and Northwest Logic presented their HBM2 controller solutions.

Mike is one of the few people I work with that has more semiconductor experience than I do. In fact, I worked for Mike many many many years ago at Zycad and he is still one of my favorite bosses, which is saying something because I have worked for a LOT of people in the past 33 years.

The most common lunch conversation Mike and I have is “what’s next” for semiconductors, him on the ASIC side and me on the SemiWiki side. Based on his experience from the live seminars and my end of year wrap up on SemiWiki you will be seeing a lot more advanced ASICs for HPC, networking, AI, and the beginnings of 5G infrastructure. Not all of these chips will come from the top semiconductor companies of course, which brings us to the mighty ASIC business model that I have written about many times.

According to Mike, and I concur, many markets are initially enabled with semiconductor devices in the form of application-specific standard products (ASSPs) or field-programmable gate arrays (FPGAs). The recent rise of deep learning applications has created another ASSP-like option, the graphics processing unit (GPU). All of these technologies offer a predictable unit cost with a specific set of features. This usually works well until one of two things happens:
[LIST=1]

  • Volume and cost pressures grow to the point where buying a standard product where perhaps only 60 percent of the chip is used becomes a financial problem.
  • Competition in the market becomes so strong that differentiating with the same chip everyone else uses becomes a marketing problem.When one or both of the above situations occur, traditional and non-traditional chip companies turn to ASICs as a way to reduce unit cost (the chip does exactly what’s needed with no wasted silicon) or as a way to increase differentiation (you can optimize the chip for your application). The result being a more competitive chip, right?

    Mike makes a good point in regards to markets that are already high growth and highly competitive. Add to that the fact that the design cost of an ASIC is typically a very small part of the unit cost of the end system which makes ASICs a viable option if not a requirement, absolutely.

    Bottom line: Mike’s prediction and SemiWiki analytics predict a robust ASIC business for the HPC, networking, AI, and 5G infrastructure markets in 2018 which brings us to the upcoming webinar:

    FinFET ASICs for Networking, Data Center, AI and 5G Using 14nm 2.5D/HBM2 and SerDes
    The huge amount of data generated, moved, stored and analyzed around the world today has changed custom IC development forever. The massive, FinFET-class ASICs used in todayÂ’s networking, data center, artificial intelligence (AI) and 5G infrastructure applications require doing things in silicon that have never been done before.

    High-bandwidth memory (HBM2) addresses performance, power and size issues, but requires a 2.5D implementation. How do you find an ASIC ecosystem you can trust to build a 2.5D chip?

    We will present real case studies and roadmaps for high-performance FinFET ASICs. We will introduce a complete ecosystem with proven delivery of high-performance ASICs and 2.5D/HBM2 systems.

    Attendees will receive a new 2.5D/HBM2 white paper written by the HBM2 ecosystem, including presenting companies.

    Register for the 8:00-9:00 AM or 6:00-7:00 PM Pacific Time webinar.

    Agenda
    · Samsung: HBM2 memory solutions
    · Samsung: Foundry solutions including 14nm FinFET technology
    · ASE Group: advanced 2.5D packaging
    · eSilicon: ASIC and 2.5D design and implementation, HBM2 PHY, high-speed memories
    · Rambus: high-performance SerDes
    · Northwest Logic: HBM2 controller

    About eSilicon
    Silicon is an independent provider of complex FinFET-class ASIC design, custom IP and advanced 2.5D packaging solutions. Our ASIC+IP synergies include complete, silicon-proven 2.5D/HBM2 and TCAM platforms for FinFET technology at 14/16nm. Supported by patented knowledge base and optimization technology, eSilicon delivers a transparent, collaborative, flexible customer experience to serve the high-bandwidth networking, high-performance computing, artificial intelligence (AI) and 5G infrastructure markets. www.esilicon.com


CEVA ClearVox Simplifies Voice Pickup

CEVA ClearVox Simplifies Voice Pickup
by Bernard Murphy on 01-08-2018 at 7:00 am

Voice-based control is arguably becoming another killer app, or killer app-enabler in the very significant shifts we are seeing in automation. After a bumpy start in car feature control (for navigation, phone calls, etc) and early smartphone “intelligent” assistants, voice-based interfaces now seem to be maturing into a genuinely useful capability. As I have said before, it’s about time. Buttons and keyboards, real or virtual, and byzantine menu hierarchies are clumsy, distracting, sometimes dangerous and anyway are a poor substitute for the way we humans would ideally like to communicate with our machines.


AI tends to dominate our thinking in this area – speech recognition and natural language processing being obvious examples – but there’s a rather important step before that; picking up the voice (or voices) and passing clean signals to those algorithms. This isn’t just a question of using a high-fidelity analog path from microphones to the eventual digitized output. Voice-based systems these days frequently use multiple microphones for direction discrimination, you have to deal with (acoustic) reflections from walls and other surfaces, you need to be able to capture commands in noisy environments and increasingly there is a trend to identifying who is speaking among multiple potential speakers. Handling all of this is owned by the front-end of voice-processing, also known as voice pickup.

CEVA is already very active in this space with their CEVA-TeakLite-4 (ultra-low-power) and CEVA-X2 (high-performance) embedded DSP platforms. Now they have introduced a software suite they call ClearVox, bundling the input voice-processing algorithms optimized to drive these hardware platforms, rounding out a complete voice-pickup solution package for voice-enabled system builders. This spans from voice activation (eg. “Hello Google”), to tracking speakers, beamforming, echo cancellation and noise suppression.


I already talked about voice activation in any earlier blog (Active Voice), based on some very neat technology supporting always-on listening at ultra-low power, allowing everything else to be powered down until the trigger word/phrase is detected.


ClearVox also includes beamforming software which is absolutely essential for directional discrimination and speaker tracking in far-field applications (smart speakers like Amazon Echo and Google Home for example). Multiple omnidirectional microphones, from as few as 2 to as many as 13, provide inputs to the (DSP) software which can then use weighting, filtering and summing to extract a strongly directed signal, which could also be used to track a speaker (whose direction may change).

Another benefit in beamforming is significant noise reduction in the signal. By focusing only on the speaker, input from other directions is very largely suppressed. ClearVox algorithms support circular array technologies, familiar in smart speakers, and also linear technologies, expected to become more popular in smart TVs.


Acoustic echo cancellation (AEC) is another essential feature in all voice-activated systems (even in speakerphones). Sound waves from a speaker and other sources in any enclosed area (living room, conference room, car) will reflect back from hard surfaces (walls, tables, etc), resulting in multiple delayed inputs to the microphones on your voice-based system and adding further noise the signal. But echoes can be recognized and removed given sufficient sophistication in the DSP software. Again, ClearVox provides this capability.

As a part of the AEC algorithms, ClearVox also provides support for barge-in. This is voice-driven systems, where the speaker (you) must take priority over any music that may be playing or responses from the personal assistant. Asking your assistant to turn the music volume down isn’t going to work very well if said assistant can’t hear you over the music it is playing, or if it is preoccupied with answering your last question.

Naturally since this is CEVA software running on CEVA hardware (and it only runs with CEVA hardware, if you were wondering) it is optimized for performance and power. You don’t have to sweat those details when you’re building your solution. That said, it is configurable and modular so you can optimize the platform to best suit your needs in your system.

ClearVox is available in two configurations: for near-field applications (headsets, earbuds, hearables and wearables) and far-field applications (smart speakers, smart home, voice enabled IoT, mobile phones). The product is available to lead customers today and will be generally available in Q2 (2018).

CEVA provide a reference design showing use of ClearVox which you can see this week at CES. Again, this is hot off the presses, so checkout the website for more details.


Getting Started with RISC-V

Getting Started with RISC-V
by Daniel Nenni on 01-06-2018 at 4:00 pm

As I mentioned before, SiFive and RISC-V are trending topics on SemiWiki.com which makes complete sense since we have been covering semiconductor IP and ARM since we first went online in January of 2011.

In total we have published 707 IP related blogs that earned 3,565,140 views (5043 views per blog average). Out of that, 254 are ARM related blogs which earned 1,525,641 views (6,006 average). We also wrote a book on ARM “Mobile Unleashed” that has been downloaded more than 100,000 times so that probably helps as well. Thus far we have done 7 RISC-V related blogs which have earned a whopping 117,666 views (16,806 average).

Bottom line: SemiWiki is the place to read about Semiconductor IP from experts, absolutely!

SiFive is just finishing up a three part series of webinars on RISC-V. In fact the third one is next week so you still have time to watch the first two and register for the third. If you are interested in joining the RISC-V revolution this is a must see series. Even if you are a hard core ARM enthusiast, this series will give you a peek into the future of open-source semiconductor IP.

Getting started with SiFive IP
Three part series introducing Engineers to RISC-V and RISC-V Core IP

Part I: RISC-V 101

This one-hour webinar took place on Sep 12, 2017

This webinar provided an introduction to RISC-V, covering areas such as the Register File, Instruction Types, Modes, Interrupts, and Control and Status Registers. Prior knowledge of RISC-V is not necessary, but having a basic understanding of Computer Architecture would be beneficial.

Hosted by:

Drew Barbier; Field Engineer at SiFive, Inc.
Drew has worked in the Semiconductor industry for over 10 years in various engineering and customer facing roles. At SiFive Drew is responsible for a variety of tasks including customer support, software and development tools, ecosystem development, documentation, and whatever makes the customer experience great.

Krste Asanovic, Chief Architect at SiFive, Inc.
In addition to serving as Chief Architect at SiFive, Krste is a Professor in the EECS dept. at the U. of California, Berkeley, where he also serves as Director of the ASPIRE Lab. Krste leads the RISC-V ISA project at Berkeley, and is Chairman of the RISC-V Foundation. He is an ACM Distinguished Scientist and an IEEE Fellow. Krste Received a PhD from UC Berkeley and a BA from the U. of Cambridge.

Post Webinar Materials

Part II: Introduction to SiFive RISC-V Core IP
This one-hour webinar took place on Oct 17, 2017

This webinar focused on Embedded Developers who are interested in learning more about the RISC-V architecture. Part two introduced the SiFive RISC-V Core IP Products; the E31 RISC-V Core IP and the E51 RISC-V Core IP.

Hosted by:

Drew Barbier; Field Engineer at SiFive, Inc.
Drew has worked in the Semiconductor industry for over 10 years in various engineering and customer facing roles. At SiFive Drew is responsible for a variety of tasks including customer support, software and development tools, ecosystem development, documentation, and whatever makes the customer experience great.

Jack Kang; VP of Product and Business Development @Sifive, Inc.
Jack has held a variety of senior business development, management, and product marketing roles at both NVIDIA and Marvell, with a track record of successful, large scale design wins. Jack started his career as a frontend design engineer, focusing on CPU architecture and design. Jack received his BS degree in Electrical Eng. and Computer Science from UC Berkeley.

Post Webinar Materials

Part III: Evaluating SiFive RISC-V Core IP

Part 3 of the Getting Started with SiFive IP webinar series will demonstrate how to use the evaluation versions of our the E31 and E51 RISC-V Core IP. In this webinar we will download and program the an Arty FPGA board with one of the evaluation images, and use Freedom Studio to program load and debug a program.

REGISTER HERE

This one-hour webinar will take place on January 17th, 2018.

About SiFive
SiFive is the first fabless provider of customized semiconductors based on the free and open RISC-V instruction set architecture. Founded by RISC-V inventors Andrew Waterman, Yunsup Lee and Krste Asanovic, SiFive democratizes access to custom silicon by helping system designers reduce time-to-market and realize cost savings with customized RISC-V based semiconductors. SiFive is located in Silicon Valley and has venture backing from Sutter Hill Ventures, Spark Capital and Osage University Partners. For more information, visit www.sifive.com.


CEVA Ups the Ante for Edge-Based AI

CEVA Ups the Ante for Edge-Based AI
by Bernard Murphy on 01-05-2018 at 6:00 am

AI is quickly becoming the new killer app and everyone is piling on board as fast as they can. But there are multiple challenges for any would-be AI entrepreneur:

  • Forget about conventional software development; neural nets require a completely different infrastructure and skill-sets
  • More and more of the interesting opportunity is moving to the edge (phones, IoT, ADAS, self-driving cars). The cloud-based AI we routinely hear about is great for training, not so much for the edge where inference is primary, must be very fast, very low power and can’t rely on big iron or (quite often) a communication link
  • The hardware on which these systems run is becoming increasingly specialized. Forget about CPUs. The game starts with GPUs which have become very popular for training but are generally viewed as too slow and power-hungry for the edge. Next up are DSPs, faster and lower-power. Then you get to specialized hardware, faster and lower-power still. Obviously, this is the place to be for the most competitive neural net (NN) solutions.


The best-known example of specialized hardware is the Google TPU, which is sucking up all kinds of AI workloads on the cloud. But that doesn’t help for edge-based AI – too big, designed for datacenters, not small form-factor devices and anyway Google isn’t selling them. But now CEVA is entering this field with their family of embedded NeuPro processors designed specifically for edge applications.

You probably know that CEVA for some time has been active in supporting AI applications on the edge through their CEVA-XM family of embedded DSPs. In fact they’ve built up quite a portfolio of products, applications, support software, partnerships and customers, so they already have significant credibility in this space. Now, after 25 years of developing and selling DSP-based solutions in connectivity, imaging, speech-related technologies and AI, they have added their first non-DSP family of solutions directly targeting neural nets (NNs) to their lineup, pursuing this same trend towards specialized AI hardware.


The solution, and it is a solution in the true sense, centers around a new processor platform containing a NeuPro engine and a NeuPro vector processing unit (VPU). The engine is an NN-specific system. This supports matrix multiplication, convolution, activation and pooling layers on-the-fly so is very fast for the fundamental operations you will have in any NN-based product. Of course NN technology is advancing rapidly so you need ability to add and configure specialized layers; this is supported through the VPU and builds on the mature CEVA-XM architecture. Notice that the engine and the VPU are tightly interconnected in this self-contained system, so there can be seamless handoff between layers.

So what does this do for you on the edge? One thing it does is to deliver pretty impressive performance for the real-time applications that will be common in those environments. The product family offers from 2 TOPS to 12.5 TOPS, depending on the configuration you choose. On the ResNet-50 benchmark, CEVA has been able to show more than an order of magnitude performance improvement over their XM4 solution. And since operations run faster, net energy consumed (e.g. for battery drain) can be much lower.

Another very interesting thing I learned when talking to CEVA, and something for which they provide great support, concerns precision. Low-power NNs use fixed-point arithmetic so there’s a question of what precision is optimal. There has been quite a bit of debate around how inferencing can effectively use very short word-lengths (4 bits or lower). Which is great if you only need to do inferencing. But Liran Bar (Dir PM at CEVA) told me there are some edge applications where local re-training, potentially without access to the cloud, is needed. Think about a driver-monitoring system (DMS) which uses face ID to determine if you are allowed to start the car. You’re out in the middle of nowhere and you want your wife to drive, but she isn’t yet setup to be recognized by the DMS. So the system needs to support re-training. This is not something you can do with 4-bit fixed point arithmetic; you need to go to higher precision. But even more interesting, this doesn’t necessarily require a blanket word-size increase across all layers. Individual layers can be configured to be either 8-bit or 16-bit to optimize accuracy along with performance, depending on the application. CEVA supports modeling to help you optimize this before committing to an implementation, through their CDNN (CEVA Deep Neural Net) simulation package.


You should know also (if you don’t know already) that CEVA has been making waves with their CDNN including the CEVA Network Generator ability to take trained networks developed through over 120 NN styles and map these to implementation nets on their embedded platforms. That’s a big deal. You typically do training (in many contexts) in the cloud, but those trained networks don’t just drop onto edge NNs. They have to be mapped and optimized to fit those more compact, low-power inference networks. This stuff is pretty robust – they’ve been supporting it with the XM family for quite a while and they’ve won several awards for this product. Naturally, the same system (no doubt with added tuning) is available with NeuPro.


So this isn’t just hardware, or hardware supporting a few standard NN platforms. It’s a complete edge-based solution, which should enable all those AI-on-the-edge entrepreneurs to deliver highest performance, lowest power/energy solutions as fast as possible, leveraging all the investment they have already made or plan to make in cloud-based NN training. NeuPro is offered as a family of options to support a wide range of applications, from IoT all the way up to self-driving cars and with precision options at 8-bit and 16-bit. Availability for lead customers will be in Q2, and for general release in Q3 this year. This is hot off the press, so see CEVA at CES next week or checkout the website.


IEDM 2017 – imec Charting the Future of Logic

IEDM 2017 – imec Charting the Future of Logic
by Scotten Jones on 01-04-2018 at 12:00 pm

At the IEDM 2017, imec held an imec technology forum and presented several papers, I also had the opportunity to interview Anda Mocuta director of technology solutions and enablement. In this article I will summarize the keys points of what I learned about the future of logic. I will follow this up with a later article covering memory.

Imec is one of the premier semiconductor research organizations in the world today and their work, and the papers and forums describing it, are always interesting.

An Steegen
An Steegen executive VP semiconductor technology and systems gave an overview presentation at the imec technology forum. Looking out five years some of the key developments by applications segment she expects are summarized in figure 1.


Figure 1. The next five years.

imec is doing a lot of work on nanowire/nanosheets and when and how to replace FinFETs and I will discuss that further below. Foundries will likely scale FinFETs to 5nm, beyond 5nm nanosheets appear to be emerging as the replacement technology of choice. Beyond nanosheets imec is looking at vertical FETs and complimentary FETs (n and p nanosheets stacked on top of each other). Vertical FETs look particularly attractive for SRAM.

imec is also putting a lot of effort into EUV, specifically photoresist and smoothing techniques for lower doses and lower absorption pellicles. I will be speaking about EUV at the ISS conference in January and I have been spending a lot of time looking at EUV readiness. Further improvements in pellicle transmission and low dose photoresist with acceptable LER are essential for successful EUV introduction to high volume production particularly for 5nm foundry logic processes.

Anda Mocuta
Anda Mocuta director of technology solutions and enablement followed An and focused on logic device scaling.

Traditional scaling provides a 50% area improvement for each new node. The foundries are having difficulty achieving a 50% area improvement just from contacted gate pitch (CGP what I call contacted poly pitch or CPP) and metal pitch (MP) scaling. Foundries have turned to track height scaling and design technology co-optimization (DTCO) as another scaling option. Figure 2 illustrates this scaling trend.


Figure 2. Scaling and track heights.

Authors note, both TSMC and GLOBALFOUNDRIES have 6 track cells at 7nm.

As you scale track height, fin depopulation is required with 4 fins for 9 track cells, 3 fins for 7.5 track cells, 2 fins for 6.5 to 5.5 track cells and eventually 1 fin for 4.5 track cells. Fewer fins mean less drive current unless other improvements are made such as taller fin heights. For 1 fin cells nanosheets become very important.

There are many scaling boosters that are being investigated:

  • Self-aligned gate contacts – for example Intel has used this on their 10nm technology to enable contact over gate instead of contact over isolation as is typically done. Authors note, Intel claims this provides a 10% area improvement.
  • Single diffusion breaks reduces the cell to cell spacing and width. Authors note, Has the potential to reduce cell width by 33% but in actual designs may be less.
  • Super vias – vias connects interconnect layers to the layer directly above or below the current layer. Interconnect layer n is connected to n+1 or n-1. Supervias skip over the layers directly above or below to connect to n+2 or n-2. Authors note, TSMC has implemented supervias on their 10nm process.
  • Buried power rails can “bury” the power rail in the substrate reducing the area taken up by interconnect.

I also interviewed Anda and she highlighted some key points from papers imec presented at IEDM.

Monte Carlo Benchmark of In0.53Ga0.47As- and Silicon-FinFETs
This paper looked at Ion/Ioff performance for InGaAs versus silicon. InGaAs has had a lot of interest as an alternative to silicon due to the much higher bulk electron mobility. In theory that should result in better performance.

What imec found is that when you consider contact resistance and traps the advantage is much less. Also at narrow fin widths confinement reduces on-current significantly. One of the big challenges is traps in the gate stack. Current state-of-the-art silicon is better but with gate stack optimization this could possibly be overcome. The bottom line is the window for InGaAs is small and closing as we move to smaller linewidths. Authors note, this conclusion is similar to work Morov presented at ISPSD in 2016.

Power Aware FinFET and Lateral Nanosheet FET Targeting for 3nm CMOS Technology
In this paper FinFETs were compared to nanosheets for a 3nm technology with a 42nm CPP, 21nm minimum metal pitch (MMP) and 21nm fin pitch (FP). The 21nm fin pitch was done with self-aligned-quadruple-patterning (SAQP) for both technologies and a 5.5 track height was used with 2 fins.

This new methodology is used to analyze devices performance versus devices type and parameters. Generally power requirements are harder to achieve than performance.

When scaling down to a 5nm technology, if you achieve a 40% power savings, a 35% speed increase comes for free. A further 5% power savings requires significant process complexity. Furthermore, as you scale below 5nm parasitics don’t scale well.

At 3nm you can still find a FinFET solution, but it requires high stress, low contact resistance, air gap spacers and other enhancements such as a SiGe PMOS channel. Every performance element is pushed to the extreme. Nanosheets can also meet the requirements at 3nm and under equivalent stress and doping can match FinFET performance and relax some parameters. Another interesting property of nanosheets is that by varying the width of the sheets density and performance can be traded off. This is easier to do within a design than varying fin heights.

26nm wide nanosheets with 2 sheets can provide sufficient drive current but needs high stress levels induced using strain relaxed buffers (SRB). There is some question as to whether the SRB stress will propagate all the way up the nanosheet stack. Nanosheets currently look very promising but there is a lot of work still to be done.

Stacked nanosheet fork architecture for SRAM design and device co-optimization toward 3nm
This paper presented a novel implementation of nanosheets where the gate is left off of one side providing space for more effective width. It is like turning a FinFET on its side and provides improved mismatch and power. Cutting the gate off one end does impact electrostatic control, with a few to ~15% reduction in Ion at the same Ioff but that can be more than offset by a wider sheet. Figure 3 illustrates the nanosheet fork versus standard nanosheets and FinFETs.

​Figure 3. Fork nanosheets.

Ion and threshold mismatch are better than a FinFET or standard gate-all-around (GAA) at the same footprint and 2 nanosheets with a fork design can provide equivalent performance to 3 sheets in a standard GAA configuration. An SRAM in a fork sheet can be 20% smaller and have 2x the pull down of a standard GAA configuration.

Comprehensive study of Ga Activation in Si, SiGe and Ge with 5 × 10-10 Ω·cm2 Contact Resistivity Achieved on Ga doped Ge using Nanosecond Laser Activation
In this paper imec combined gallium and boron implants with laser annealing to lower PMOS contact resistance. Boron and Gallium each have their own activation and you can achieve more total active dopants than with either dopant alone. The net result is ~5E-10 contact resistivity and meets the requirements for 3nm. Contact resistance is a major parasitic element in leading edge technologies and is an area that has needed more attention.

Conclusion
The presentations from An Steegen and Anda Mocuta provide promising options to continue logic scaling beyond FinFETs well into the next decade.


How Deep Learning Works, Maybe

How Deep Learning Works, Maybe
by Bernard Murphy on 01-04-2018 at 7:00 am

Deep learning, modeled (loosely) on the way living neurons interact, has achieved amazing success in automating recognition tasks, from recognizing images more accurately in some cases than we or even experts can, to recognizing speech and written text. The engineering behind this technology revolution continues to advance at a blistering pace, so much so that there are now bidding wars between the giants (Google, FB, Amazon, Microsoft et al) for AI experts commanding superstar paychecks.


It might seem surprising then that we don’t really have a deep understanding of how deep learning works. I’m not talking about what you might call a mechanical understanding of neural nets; that we have down pretty well and we continue to improve through more hidden layers and techniques like sharpening and pooling. We understand how layers recognize features and how together these ultimately lead to recognition of objects. But we don’t have a good understanding of how recognition evolves in training and why ultimately it works as well as it does.

On reflection, this should not be surprising. Whenever technology advances rapidly, theory lags behind and catches up only as technology advances moderate. Some might wonder why we even need theory. We need it because all sustainable major advances eventually need a solid basis of theory if they are to have predictive power. Without that power, figuring out how to build even better solutions and knowing where the limits lie would all depend on trial and error, quickly becoming prohibitively expensive and undependable. Theoretical predictions still have to be tested (and adjusted) in practice but at least you know where to start.

Naftali Tishby of the Hebrew University of Jerusalem has developed an information theory of deep learning as a contribution to this domain, which seems like a pretty reasonable place to start. He makes the point that classical information theory is concerned only with accurate communication without an understanding of the semantics of what is communicated, whereas deep learning is all about the semantics (is this a dog or not a dog?). So an effective theory for deep learning, while following somewhat similar lines to Shannon’s theory, needs to look at loss of “relevant” information rather than loss of any information.

The theory details get quite technical, but what is more immediately accessible are implications for how deep learning evolves, especially as exposed by this team’s work in studying many training experiments on a variety of networks. They mapped the current state of their information metric by planes in a network and looked at how this evolves by epoch (a complete pass through the training data; multiple passes are typically made until error rate is acceptable). Before the first epoch there is high information in the first (labeled) layer and very little in final layers.

As epochs proceed, information with respect to labelling rises rapidly by layer (fitting), until this reaches a transition. At this transition, the network has minimized error in classifying the training examples seen so far, but the interesting part happens in subsequent training. Here detection accuracy does not improve but the number of bits in their input information metric (by plane) begin to drop. Tishby calls this compression; in effect, layers in the network are starting to drop information which is not relevant to the recognition problem. Put another way, during this phase, the network is learning to generalize, ignoring features in training examples which are not relevant to the object of interest.

The theory promises value not only in understanding this evolution but also being able to quantify the value of hidden layers in accelerating the compression phase, also bounds on accuracy, both of which are important in understanding how far this technique can be pushed and into what domains.

This is obviously not the last word on theory for deep learning (explaining unsupervised learning, for example) but is seems like an interesting start. A number of other researchers of note find this work at minimum intriguing and quite possibly an important breakthrough. Others are not so sure. In any case, it is by efforts like this that deeper understanding progresses, and we can certainly use more of that in this field. You can read more on this topic in this Wired article.


Qualcomm’s New Spectra ISP and Camera Modules Enable Nextgen AR

Qualcomm’s New Spectra ISP and Camera Modules Enable Nextgen AR
by Patrick Moorhead on 01-03-2018 at 12:00 pm

The mobile VR and AR space has been evolving rapidly, with many different players innovating in recent months. Companies like Apple and Google have been innovating on the software and hardware front, but others have been working diligently to support some of these efforts as well. One of those companies is Qualcomm with their support of technologies like Google’s Tango platform. However, Tango has had some challenges in taking off with the complexity of the hardware requirements proving to be a challenge for OEMs to ship in volumes.

In the past, Qualcomm had a camera module program co-branded with their ISP (image signal processor) which processes the image data from the cameras. This program was designed to make it easier for Qualcomm’s customers, the smartphone OEMs to quickly and reliably implement dual camera setups with a wide angle and telescopic zoom camera. That was what the market needed in the past, but now the market needs new capabilities which Qualcomm’s new Spectra Module Program and ISP will support.

The new Spectra Camera modules include an Iris Authentication Module that has latency as low as 40ms and features an Omnivision 1080P IR sensor for high resolution iris image capture. The real focus with these new camera modules, however, comes from the computer vision capabilities which include both passive and active depth sensing capabilities. The entry-level solution for value-tier devices will feature two cameras and passively calculate depth, allowing for more coarse measuring and lower cost. The high-end active depth sensing camera solution will feature three cameras including an IR emitter and IR camera for high resolution depth sensing at distances up to 4 meters and 0.1 mm accuracy.

The Spectra Camera modules and some of their capabilities will be supported by the newest generation of Spectra ISP inside of the next generation Snapdragon SoC. This new ISP will support features like multi-frame noise reduction in hardware which is like what Google already does in software with their HDR+ algorithm and make it available to anyone that uses their ISP. Qualcomm’s new Spectra ISP also supports motion compensated temporal filtering (MCTF) and accelerated EIS (electronic image stabilization) which cleans up the noise in low-light video and helps to sharpen the image as well, making low light video quality significantly better. Last but certainly not least is the implementation and support for 6-DoF and SLAM with 16ms motion to photon latency for inside-out tracking at room scale and collision avoidance. This paired with the new Spectra Camera module will enable highly precise AR solutions that can rely on the sub-mm accuracy point cloud generated by the Spectra Camera module and Spectra ISP inside of the Snapdragon.

Qualcomm’s new camera modules, especially their active depth sensing camera module, seem like no-brainers for Android OEMs to adopt to compete with what many expect that Apple will implement in the new iPhone 8. Qualcomm’s new ISP features are also sure to elevate the camera experience for Android smartphone users and can help to rise the tide that lifts all boats. Right now, the more integrated and simple an AR solution, the more likely OEMs and developers are to pick it up and build towards it. I believe that this new camera module helps to bring the Android ecosystem closer to where we expect Apple will be with the iPhone 8.