webinar banner2025 (1)

Innovation in IoT

Innovation in IoT
by Bernard Murphy on 07-24-2018 at 7:00 am

There is some interesting work reported this month in the Communications of the ACM, on novel sensing, multi-purpose uses for existing sensors and new ideas in agricultural IoT. The article opens on a method called Hitch-hike to use back-scatter methods for communication; I confess this doesn’t interest me so much, so I won’t spend any time on it (epistola mea, regulae meae).

VitalRadio is a novel approach from MIT to monitor breathing and heartrate without need for any wearable device. Your smart home can check up on you, even while you’re sleeping. I think that’s very cool. Having to wear or carry a device can be problematic for very young or very old people (e.g. forgetting to wear the thing), they can be uncomfortable (if you’ve ever used a medically-approved sleep monitor, you’ll know what I mean) and you can easily forget to charge them, not such a problem for a fitness monitor but more problematic for a medically-required device.

VitalRadio monitors your breathing and heart rate through phase variations in radio reflection. It is able to distinguish reflections from different objects through a method borrowed from radar detection; using this it can separate multiple mostly-stationary subjects, separated by at least 1-2m, and can easily extract phase variations due to inhalation and exhalation, per subject. Heartbeats appear as a smaller modulation on top of this extracted wave. The method has limits of course – it can’t distinguish pets from humans and it doesn’t work so well when a subject is moving.

Another idea called Caraoke builds on the existing e-toll transponder in your car. These are already widely-used for access to toll roads, some bridges and express lanes, and now there is interest in using them to pay at fast-food drive-throughs and parking garages. The Caraoke folks see potential in readers being deployed more widely, to track cars at intersections for adaptive traffic light control, on street lights to track speeders, for automatic parking billing for street parking and more.

A challenge in this approach is that apparently existing transponders are quite simple, assuming only one communication at a time (hence the directional antennae); they have no MAC protocol to manage multiple potential requests. An obvious solution would be to replace all of this infrastructure with more sophisticated comms, but that would be expensive. Caraoke provides an easier path. It allows continuing use of existing transponders (no need for car owners to replace these) while requiring the readers be upgraded to more intelligently handle potential collisions and extract more from the information they gather – counting cars, localizing them and estimating speed.

Clever idea. Upgrading everything to much more effective LTE or Wi-Fi communication would of course solve the technical problems and allow for all kinds of monitoring and services. But in the real world, technology alone is often not enough. Practical solutions have to bow to economics and latencies in changing infrastructure. Solutions that can build on existing tech have a definite appeal.

Finally, FarmBeats introduces a system for farm-wide data-driven optimization. IoT use in agriculture is not a new idea but this seems to be a more integrated solution than I have seen and certainly has some serious backing (Microsoft, MIT, UW, Purdue, UCSD and 6-month deployments in two active farms). The system gathers data from fixed cameras and drones, soil sensors and temperature and humidity monitoring in food storage and animal shelters.

A big part of FarmBeats seems to be ensuring reliable communications between IoT outposts and a central gateway at the farm. Some of this is intelligently duty-cycling components of the base-station to allow for cloudy weather (yeah, real clouds). Farmers are increasingly using solar power which obviously is a variable resource under these conditions. So FarmBeats builds weather forecasting into planning when power demand can be accommodated from different parts of the system.

It also controls flight paths for UAVs based on wind patterns (which can vary widely over a farm). Which makes the batteries last longer, requiring less frequent maintenance. They have some nice pictures integrating camera images of a piece of farmland with soil moisture sensing, pH sensing and ground temperature sensing (based on a surprisingly sparse set of sensors). All individually perhaps commonplace, but integrated together at this level, this starts to look like a total solution for a farmer.


Keeping Pace With 5nm Heartbeat

Keeping Pace With 5nm Heartbeat
by Alex Tan on 07-23-2018 at 12:00 pm

A Phase-Locked Loop (PLL) gives design a heartbeat. Despite its minute footprint, it has many purposes such as being part of the clock generation circuits, on-chip digital temperature sensor, process control monitoring in the scribe-line or as baseline circuitry to facilitate an effective measurement of the design’s power delivery network (PDN).
Continue reading “Keeping Pace With 5nm Heartbeat”


FDSOI Status and Roadmap

FDSOI Status and Roadmap
by Scotten Jones on 07-23-2018 at 7:00 am

FDSOI is gaining traction in the market place. At their foundry forum in May, Samsung announced they have 17 FDSOI products in high volume manufacturing (you can read Tom Dilliger’s write up of the Samsung Foundry Forum here). At SEMICON West in July, GLOBALFOUNDRIES (GF) announced FDSOI design wins worth $2 billion dollars in revenue with $1 billion dollars booked in 2017 and another $1 billion dollars in revenue booked in the first half of 2018. With the emergence of FDSOI I thought it would be useful to review who the players are in the market and what their current and planned processes look like (I recently did a similar leading-edge analysis for FinFETs available here).

FDSOI Ecosystem
The FDSOI Ecosystem is illustrated in figure 1.

CEA Leti has served as the key research group in the development of FDSOI working with ST Micro on 28nm and 14nm processes and working with GF on 22nm and 12nm processes.

FDSOI requires engineered substrates with very thin single crystal silicon layers on buried insulator layers to insure the channel region is fully depleted. The primary supplier of FDSOI substrates is Soitec with SEH as a second source (I have written more on Soitec and their FDSOI substrates here and have another article on Soitec due to be published shortly).

The companies producing FDSOI processes are ST Micro as an IDM with 28nm in production, Samsung foundry with 28nm in production and 18nm planned, and GF foundry with 22nm in production and 12nm planned.

Figure 1. FDSOI Ecosystem

ST Micro
ST Micro introduced 28nm FDSOI in 2012 that is produced in their Crolles II – 300mm wafer fab. The 28nm FDSOI process offers a 32% to 84% improvement in performance over ST Micro’s 28nm bulk process. ST Micro also developed a 14nm process with CEA Leti but it is not in production. ST Micro has reportedly begun working with GF on GF’s 22FDX FDSOI process so long-term ST Micro may not continue to produce their own FDSOI and may move to a fabless model for this technology. Crolles II is a relatively low capacity 300mm fab and ST Micro makes other things in the fab, so FDSOI volumes are likely not large.

Samsung
Samsung licensed ST Micro’s 28nm FDSOI process and used it to create Samsung’s 28FDS process. 28FDS entered production in 2015 and is producing 17 high volume products as previously mentioned. An 18nm follow-on process is in development and due next year.

28FDS provides fmax >400GHz for RF applications, embedded MRAM nonvolatile memory and is automotive qualified. 28FDS has a 1.0 volt Vdd.

18FDS is planned for 2019, it features a back end taken from Samsung’s mature 14nm FinFET technology and provides a 35% area reduction from 28FDS. 18FDS also provides a 22% performance improvement and 37% power reduction from 28FDS. The Vdd for 18FDS is 0.8 volts.

Samsung has significant foundry capacity and can ramp FDSOI to very high volumes as needed.

GLOBALFOUNDRIES (GF)
GF’s 22FDX process entered production in 2017 and offers a 400GHz fmax, embedded MRAM nonvolatile memory and is automotive qualified. 22FDX can operate down to 0.4 volts for low power applications. There are four versions available offering, low power, high performance, low leakage or RF & analog. 22FDX is based on ST Micro’s 14nm process for the front end and the back end is optimized for cost with 2 double patterned layers and the balance of the layers being single patterned.

A follow-on 12FDX process was originally due in 2019 but GF is holding off introduction of the process because customers are just now designing and ramping up products on 22FDX. Development of 12FDX is proceeding well and it will be introduced when needed, we estimate this will be around 2020. 12FDX will offer 20% performance improvement over 22FDX.

GF is producing 22FDX in their Dresden fab and has significant capacity in place. A fab being brought up in China will also become a source for FDSOI capacity in the future.

Comparison
Figure 2 compares the process density metrics for GF, Samsung and ST Micro. In terms of the current FDSOI offerings, GF’s 22FDX is the clear leader in density and also offers the lowest operating voltage. Samsung’s planned 18FDS process will likely be slightly denser than GF’s current FDX22 process but GF’s planned 12FDX process will once again establish GF as the clear FDSOI density leader.

Figure 2. FDSOI process comparison

One thing I have a hard time understanding is why Samsung isn’t more aggressive on operating voltage. Power consumption is proportional to the operating voltage squared and FDSOI is targeted at many low power applications. GF has a clear lead in low power operation with their 0.4 volts Vdd.

Discussion
FDSOI is being positioned as a lower cost alternative to FinFETs for IOT, automotive and mobile applications. The specific FDSOI process choices in terms of density and number of interconnect layers position them to be less expensive than the denser FinFET processes. FinFET processes are also typically not well suited to analog and RF applications. We believe that at the same node and number of metal layers FinFET processes and FDSOI processes are similar in cost but once again the FDSOI processes are positioned differently, for example GF offers 22FDX with 8 metal layers as a lower cost alternative to their 14nm FinFET process that has 11 or more metal layers. 22FDX has a lower mask count than 14nm FinFET and has lower cost per wafer, the 14nm FinFET process is denser and better suited for large – high-performance designs but 22FDX offers lower cost, nearly as good digital performance and better analog and RF performance at lower power.

FDSOI also offers the unique capability for back biasing to set threshold voltages and tune performance and power consumption. Accessing the back gate for back biasing only requires a 1% area penalty while delivering a unique and useful capability not available in other processes.

FDSOI also offers lower design costs than FinFETs with 28FDS and 22FDX offering similar design costs to 28nm bulk whereas 14nm FinFET processes have design costs that are roughly 2x the design costs for 28nm bulk. 7nm FinFET design costs are expected to be even higher than 14nm design costs.

Conclusion
We believe that FDSOI is well positioned to capture market share in IOT, 5G, and automotive applications. FinFETs will continue to be the technology of choice for applications with a lot of digital logic and that require the highest possible performance. After many years of development FDSOI is poised to become a main stream alternative.


SEMICON West – Soitec is becoming a key enabler

SEMICON West – Soitec is becoming a key enabler
by Scotten Jones on 07-22-2018 at 7:00 am

A variety of growing and emerging segments of the semiconductor industry rely on Silicon-On-Insulator (SOI) wafers. Soitec is the primary source for SOI wafers particularly on 300mm. On Tuesday at SEMICON I got to sit down with Bernard Aspar, Soitec’s Executive Vice President, Communication & Power BU and Christophe Maleville, Soitec’s Executive Vice President, Digital Electronics BU and discuss what is going on at Soitec.

During SEMICON GLOBALFOUNDRIES announced they have reached $2 billion dollars of design wins on their 22FDX FDSOI platform that relies on wafers from Soitec and Soitec was clearly pleased by this development.

I started the interview by asking about Soitec’s financial health. A few years ago they were struggling. Soitec built significant 300mm capacity to support IBM’s partially depleted SOI (PDSOI) business. At one time all three major game console manufacturers relied on IBM PDSOI process for their main processor chips. Unfortunately, the value proposition of PDSOI wasn’t very good and all three console manufacturers moved away from PDSOI taking away the major driver of 300mm SOI. Last year at SEMICON West Soitec said they had returned to profitability and that has continued into this year. Over the last three years Soitec has refocused on their core business and they are now growing nicely (4% and 31% the last two years), in fact there are reports in the industry of SOI shortages.

I asked Soitec about their current capacity situation and they said the 200mm line is full and 300mm is significantly loaded. Siotec has 200mm and 300mm lines in France, a 300mm line in Singapore and a 200mm line at Simgu China. Capacity is currently tight but Soitec is investing in expanding capacity. They are contracting with their customers and contract customers aren’t suffering but if someone new wants large capacity it would take time. 300mm has grown 2x in the last two years. Soitec has >1.5 million wafers per year of 300mm shell capacity that is >50% equipped. Singapore will ramp up 300mm to follow market demand. 200mm is >1 million wafers per year.

RFSOI
IBM’s Burlington Fab is the leader in RF SOI for antenna tuning and switching in the front end of cell phones replacing more expensive GaAs solutions. With the acquisition of the IBM semiconductor operations by GLOBALFOUNDRIES more attention is being paid to this business. Each generation of cell phones, 2G, 3G and 4G required a new and more complex front-end while maintaining the front-end module from the previous generation for backwards compatibility. With 5G on the horizon even more complex 5G front-end modules will be added to phones along side 2G, 3G and 4G modules, see figure 1.

Figure 1. Mobile phone front ends.

GLOBALFOUNDRIES has recently repurposed their 300mm East Fishkill fab for RF and Silicon Photonics and is also introducing RF into their 300mm Singapore Fab. This provides Soitec with a nice long-term growth driver for RF SOI wafers in 200mm and 300mm.

FDSOI
The recent design wins announcement from GLOBALFOUNDRIES is another example of the growing acceptance of FDSOI. Where PDSOI married an expensive SOI substrate with a complex process and provided only moderate performance improvements, FDSOI offers a greatly simplified process to offset the expensive starting substrate. FDSOI also combines good logic density and performance with low power and excellent RF and analog performance. Both Samsung and GLOBALFOUNDRIES have FDSOI foundry processes currently available and next generation processes in the works (I will be writing more about this shortly). 5G and Internet Of Things (IOT) are two emerging applications where FDSOI is expected to be very successful. FDSOI is made on SOI wafers with very thin silicon layers of ~6nm on ~20nm buried oxide layers. Soitec has significant intellectual property in producing these demanding specifications with the required uniformity.

Silicon Photonics
Silicon Photonics has been in development for many years and is starting to gain traction. The need for very high speed/energy efficient data transport in datacenters is driving an effort to move optical interconnect down to the blade level creating significant unit volume. This is an emerging opportunity for Silicon Photonics (I will be writing more about Silicon Photonics in the near future). 5G is another emerging application for silicon photonics. Silicon Photonics processes are fabricated on SOI wafers with ~500nm silicon layers on 1 to 3 micron buried oxides.

Piezzo Layers
Soitec’s expertise in wafer bonding and thinning is also being applied to creating thin piezoelectric material layers on insulator (POI) for filter applications. The emerging 5G standard has tight signal specifications that need well controlled stable filters providing another application for Soitec’s core expertise.

Other Applications
Soitecs basic tool box of bonding and controlled layer thickness can also be applied to thicker SOI layers for power electronics. Bonding of novel III-V materials is another area where Soitec can apply their expertise to produce novel engineered substrates. For example, they can produce thin InGaAs layers on sapphire for micro display applications.

Figure 2 illustrates the different Soitec substrates to support the applications discussed above and figure 3 illustrates how the various substrate types support 5G.

Figure 2 Soitec substrate options.

Figure 3. Soitec engineered substrates 5G usage.

Soitec – Leti Partnership
During SEMICON West Soitec and Leti announced the creation of a Substrate Innovation Center at Leti. Leti is a long-time pioneer in the development of SOI. The new center will bring together Soitec and Leti’s expertise with equipment vendors to drive further innovation with a prototyping line that can explore new processes that can’t be developed on active production lines.

Cost
Cost was at one time a big stumbling block for SOI adoption and I asked if SOI prices are still coming down. They replied that price has come down to no longer be a blocking point to go into foundry. New products come at a price in-line with value added. The standard 15 years ago was a GaAs front end for mobile phones, but then RF SOI provide a 2x cost reduction. FDSOI substrates are more expensive but provide simpler processes and solutions so they are cost competitive. SOI is not the severe price pressure/blocking point it once was.

Conclusion
Where Soitec was at one time reliant on PDSOI going into game machines, the company now has a portfolio of products addressing, automotive, wearables, mobile and cloud. With Soitec’s engineered substrates becoming a key enabler in these segments, Soitec is well positioned for sustainable growth.


Maximize Bandwidth in your Massively Parallel AI SoCs?

Maximize Bandwidth in your Massively Parallel AI SoCs?
by Daniel Nenni on 07-20-2018 at 12:00 pm

Artificial Intelligence is one of the most talked about topics on the conference circuit this year and I don’t expect that to change anytime soon. AI is also one of the trending topics on SemiWiki with organic search bringing us a wealth of new viewers. You may also have noticed that AI is a hot topic for webinars like the one I am writing about now.

We have been working with NetSpeed for 3 years now and have published blogs covering a wide range of topics. You can see their landing page here. NetSpeed has done some of the best and most widely viewed webinars that we have been involved with and I expect this one will be the same.

How do you maximize bandwidth in your massively parallel AI SoCs

Tue, Jul 24, 2018 8:30 AM – 9:00 AM PDT

When designing a SoC for AI applications, you are faced with a system using 1000’s of cores in a massively parallel architecture. Performance, bandwidth and quality of service (QoS) are critical requirements and the challenges of meeting them are very different for these SoCs used for AI. This webinar, in 30-minutes brings out the challenges and the solutions that has empowered multiple leaders in the AI space.

John Bainbridge, Principle Application Architect, NetSpeed Systems, will be presenting. Before joining NetSpeed John worked for Qualcomm on the SnapDragon chips so John knows SoCs, absolutely.

I have an advanced copy of the slides and they are definitely worth a look. Here is a quick outline:

Breaking down the AI workflow:

  • How it happens
  • What matters
  • Critical use cases
  • SoC Data FlowArchitectural Challenges:
  • Large number of cores
  • Extremely high bandwidth
  • Peer-peer traffic and multicast
  • Sophisticated QoSBottom line: Traditional approaches are inadequate for AI SoCs.

    John then goes into the NetSpeed approach, technology, and QoS support. This is why I like webinars, you get to hear it from and interact with the experts. Not as good as live but definitely the next best thing. Register even if you can make the live event so you automatically get a link to the replay. I hope to see you there!

    About NetSpeed
    NetSpeed Systems provides scalable, coherent on-chip network IPs to SoC designers for a wide range of markets from mobile to high-performance computing and networking. NetSpeed’s on-chip network platform delivers significant time-to-market advantages through a system-level approach, a high level of user-driven automation and state-of-the-art algorithms. NetSpeed Systems was founded in 2011 and is led by seasoned executives from the semiconductor and networking industries. The company is funded by top-tier investors from Silicon Valley. It is based in San Jose, California and has additional research and development facilities in Asia. For more information, visit www.netspeedsystems.com.


TI Patent Priorities

TI Patent Priorities
by Daniel Nenni on 07-20-2018 at 7:00 am

This is the seventh in the series of “20 Questions with Wally Rhines”

Probably the most innovative person I met at Texas Instruments, other than Jack Kilby, was Ken Bean. Ken had a list of patents that would impress even the most skeptical. He started his career at Eagle Picher and came to TI in the mid 1960s. He was a warm, delightful and modest person but very innovative when it came to finding solutions for silicon manufacturing problems. He worked in Semiconductor Group Product Divisions as well as research labs over his TI career, as did Mike Cochran, a topic that I’ll address later.

Ken Bean almost never saw a semiconductor manufacturing problem that he couldn’t solve. When TI had problems introducing the “thermal printer” that was used in the “Silent 700”, Ken had a solution that made the silicon print heads manufacturable. One of the most innovative patents that Ken filed was the patent on the slicing of silicon wafers. Easy, don’t you think? No. Ken addressed a problem for DUF (or diffusion under film) in bipolar integrated circuits. “Pattern shift” was a problem that occurred because early bipolar integrated circuits used wafers that were oriented to or crystal planes. As a result, subsequent layers of deposition “shifted” modestly as the epitaxial layer grew in the direction of crystal orientation. This caused a shift in the alignment of subsequent photomasks. Not a problem for Ken. He was called in to solve the problem and he did. Why not slice the wafers a few degrees off the perfect orientation. Then the DUF layer wouldn’t follow the crystalline orientation. It worked. Subsequently, wafers for bipolar integrated circuits were sliced slightly away from perfect orientation.

In the early 1970s Monsanto decided to get out of the semiconductor wafer business and showed up at TI with a list of patents for which they hoped to claim royalties (since TI still manufactured its own polysilicon and silicon wafers). After Monsanto showed their patents, TI lawyers passed Ken Bean’s patent to them, showing why wafers used for bipolar semiconductors are sliced a few degrees away from the perfect orientation (https://patents.google.com/patent/US3379584A/en). The story goes that the Monsanto lawyers looked at the patent and closed their brief cases. That was the last that the TI lawyers saw of them. It was truly a fundamental patent in the early days of semiconductor history. I loved my interaction with Ken and he loved our family. To his death, he communicated, kept our Christmas cards on his refrigerator and delighted in the success that TI ultimately achieved.

One of the things that Ken taught me was the importance of customer interaction in the innovation process. Ken had assignments in the Semiconductor Group and in the Central Research Labs as well as the Semiconductor Research and Development Lab. Interestingly, he generated patents at approximately the same rate per year regardless of where he was working. The same was true of Mike Cochran, who worked in a variety of organizations in TI, including both semiconductor product groups and research laboratories (and is partially responsible for the Cochran-Boone patents on the microprocessor). I decided to analyze the patent productivity of the truly great patent generators like Ken and Mike. Fortunately, TI had a system that helped me. After the TI DRAM lawsuits, TI management decided that patents were a very important source of royalty revenue, much to the dismay of many TI engineers who had been taught that patents should only be used defensively, to allow TI to enter new markets. So TI created a special segment of the annual performance review process that rewarded the creators of the most valuable patents. Those lawyers who negotiated the patent cross licenses voted on the most valuable patents. The result: I now had a list of the most “valuable” patents.

The result of the analysis amazed me although I wasn’t allowed to publish the results. But the conclusion was clear. People like Ken Bean and Mike Cochran generated about the same number of patents per year. But the ones that they generated when they were in product groups turned out to be much more valuable than those they generated when they worked in research organizations. Why? I concluded that, because the patents they filed when they were in product groups were developed in response to a customer problem, they grew in value as more competitors adopted similar solutions to the same type of problems. The other patents sounded great; they just weren’t as valuable because they were generated by innovative ideas rather than customer problem solving.

The 20 Questions with Wally Rhines Series


Aprisa and Apogee – The New Avatars

Aprisa and Apogee – The New Avatars
by Alex Tan on 07-19-2018 at 12:00 pm

Earlier physical optimization impacts a design QoR gain and can disclose potential hurdles in dealing with unknown design variants such as new IP inclusion or new process node issues. Along the RTL-to-GDS2 implementation continuum, a left-shift move requires a robust modeling and proper context captures in order to produce meaningful outcomes.

Aside from synthesis, floorplanning, placement and routing are three major optimization segments that largely shape the final design footprint and determine the feasibility of design targeted performance. Although judging on its own merits, each has a unique set of pre-requisites and its optimization context, close alignment among them is crucial.

The floorplanning step involves applying an optimal strategy for top-level placement on the given budgeted area without incurring complication for downstream implementation steps. For example, having a robust IP or macro placement that honors a uniform data flow and accomodates adequate repeater staging area or track allocation, should provide a better chance of placement and route convergence as it will prevent congestion risk and unmanagable area increase.

Similarly, during the place and route step, preserving the optimization intent achieved in earlier synthesis and driving further gate level QoR’s (Quality of Results) is key to ensure predictability in design closure convergence.

Avatar Physical Implementation Solution
Aprisa and Apogee are two physical design related products from Avatar Integrated Systems (previously known as AtopTech). Aprisa is a complete P&R system which includes placement, clock tree synthesis, optimization, global routing and detailed routing. It has embedded analysis engines that correlates with foundry-approved sign-off tools and supports standard formats for its collaterals (Verilog, LEF/DEF, Liberty, SDC and GDS2). It has been certified for 16/14/10nm and 7nm.

Apogee is a top-down hierarchical prototyping floor planning and chip assembly tool. It enables fast analysis of design hierarchy and automates many manual tasks such as macro placement and blockage creation to ensure faster convergence to an optimal floorplan.

Both tools shares a common analysis engine and database that ensures tight correlation between block and top level timing. Avatar’s In-Hierarchy-Optimization (iHO), intended to help top-level timing closure without either the traditional black-box modeling or flattening step, is an example of many patented technologies incorporated in both tools specifically developed to address increased design challenges in advanced process nodes.

Floorplanner and Placement
Apogee handles complex floorplan criteria such as rectilinear regions, multi-height cells and mixed/overlapping sites. Both channel based and channel-less floorplans are supported. It has an automatic placement blockage generator and a macro placer with grouping and legalization capabilities. To aid for data flow analysis, the GUI has both hierarchical flyline analysis and logic-vs-layout cross-probing features. With its unified architecture and hierarchical data model, it comfortably handles multi-million gates design and easier hierarchical ECO with continued 2x to 3x runtime improvement per major version refresh.

Aprisa´s placement technology is a timing and congestion driven analytical based placer, which keeps track of real time TNS and congestion overflow –to automatically adjust timing and congestion parameters for an optimal QoR or runtime tradeoff. Aside from the standard cost factors (such as wire-length, area, leakage power, etc.), the adaptive placement and optimization engines take into account critical dependencies such as pin accessability. It has automatic neighbor rule as well as user’s controlled version to allocate cell spacing for pin access.

Aprisa features a partition-based optimization mode that allows intelligent path clustering based on timing criticality. Such method is intended to achieve a more efficient thread allocation and multi-thread scaling for subsequent optimizations. Its power optimization features include switching activity optimization, OCV aware placement, useful skew handling and always-on buffering/retention cell placement. Both UPF and CPF constraints are supported for low power-driven optimization.

MCMM Analysis, CTS and Timing Analysis
Aprisa has native and adaptive MCMM (Multi-Corner Multi-Mode) approach that automatically groups scenarios and analyzes them in mixed sequential/multi-threaded mode, yielding an optimal balance of memory usage and run time. It supports various on-chip variation methods (AOCV, LOCV, POCV and LVF).

Aprisa´s progressive MCMM CTS handles various scenarios for complex designs such as based on skew-group, slack-driven, multi-point; mesh and H-tree; power aware clock tree optimization with useful skew; cluster-based clock trees or meshes; auto clock-gate cloning/decloning, etc. The CTS engine balances clock trees for all modes and corners while allowing flexibility to leverage skewgroup information to further optimize the clocktree. The route-based clock tree optimization minimizes the use of buffers, automatically create special routing constraints (such as double-width/spacing/via, shielding, route layer, etc.) and matches latency targets given by user for any pins. It is accompanied by a GUI for visualizing, cross-probing and real-time intervention such as clock buffer resizing or moving clock buffer or leaf cell to different level of the hierarchies.

Aprisa includes a fast timing analysis engine with many advanced features. Based on Avatar rating, it takes 5 minutes per million instances. It supports SDC parsing, native OCV timing analysis, CRPR (Clock Convergence Pessimism Removal) and a timing browser
Routing
Addressing first-order effects of SI (Signal Integrity), EM (Electro Migration) and metal-fill emulation with near detail route level accuracy during global route stage is key to its success in delivering targeted detail route outcomes. Aprisa’s fast global route engine was rated to route millions of nets in minutes. It supports multi-threading and routes 250K instance in about 5 minutes on an 8 CPU machine. The global route includes track assignment that facilitate delays and signal integrity assessment.

Aprisa’s detailed router is a hybrid technology, which support both gridded routing as well as off-grid pin routing when needed. According to Avatar, unlike the other routers’ handling of DRC as an afterthought, Aprisa handles all the DRC violations during route optimization. This includes handling complex design rules (such as EOL spacing or extension, minimum enclosure, etc.), special routing rules (such as double spaced, shielding, double vias, etc.) and DFM related issues (such as wire-spreading, double-vias).

For advanced nodes, Avatar’s router uses its own patented color-aware DPT routing technology to enable DPT compliant routing and also support CM (Cut-Metal) routing methodology. As resistance is more prevalent in advanced process nodes, Aprisa’s router has the capability to minimize jogging for lowering via usage and accounts for high-R layer usage in pre-route RC estimation as well as in detail routing for better timing. It automatically promotes critical net to high-metal layer while leveraging low-resistive wire for long nets connection to reduce buffer usage. Tight timing correlation is also maintained throughout the process including between pre-route and post detail route steps.

The resurgence of Avatar’s physical design solution has added color to the IC physical implementation landscape. An integrated solution that aligns both the optimization and analysis engines while managing proper contexts using a unified data model can deliver enhanced QoR’s. Aprisa and Apogee seems to have demonstrated such leverage.

For further details please check here: Aprisaor Apogee.


A Last-Level Cache for SoCs

A Last-Level Cache for SoCs
by Bernard Murphy on 07-19-2018 at 7:00 am

We tend to think of cache primarily as an adjunct to processors to improve performance. Reading and writing main memory (DRAM) is very slow thanks to all the package and board impedance between chips. If you can fetch blocks of contiguous memory from the DRAM to a local on-chip memory, locality of reference in most code ensures much faster access for many subsequent operations which will frequently find the data/addresses they need in these cached copies. This greatly improves overall performance, despite need at times to update cache contents from a different location (and maybe store back in DRAM what must be evicted from the cache).

Good ideas generally get pushed harder and so it is with cache; a hierarchy of on-chip caches further reduces the frequency of needed off-chip memory accesses and further increases overall performance. This extends from level-1 (L1), small, fast and really close to the CPU, then L2 all the way up to (potentially) L3, caches at each stage being larger and slower. The last level unsurprisingly is called the last-level cache (LLC) and is generally shared in multi-processor systems. Which is why cache coherency has become a big topic. Caches are a trick to improve performance but must still maintain a common logic view of the off-chip memory space. If you work with Arteris IP, you’ll use their Ncore cache coherent interconnect for communication between IP in the coherent domain to manage that coherency. Ncore also provides proxy caches to synchronize IP in the non-coherent domain with the coherent domain; I wrote about this earlier.

However a lot of logic in an SoC does not sit in the coherent domain; after all, there’s more to an SoC than the CPUs. There’s a human interface (perhaps graphics, audio, voice control), communications, external interfaces, security management, accelerators and sensor interfaces. At least some of these components also need to access memory extensively so can also benefit from cache support. This is the need Arteris IP’s CodaCache aims to support – as a cache function for an individual IPs in the non-coherent world, or as an LLC for the non-coherent system as a whole, or both of these.

Let’s address an obvious question first; these caches are operating in a non-coherent domain, so how do you avoid coherency problems without syncing back into the coherent domain? No magic here – in the same ways you avoid such problems in any context. Address map separation is one choice; each IP writes to and reads from its own address space and there are no overlaps between those spaces.

JP Loison, Senior Corporate Application Architect at Arteris IP, told me that the cache is very configurable. It can be used purely as a cache of course, or part of it can be configured (even at runtime) to be used as a scratchpad, or the whole thing can be used as a scratch pad. This is a handy feature for those targeting multiple markets with one device, e.g. low-cost IoT not needing external memory but where you do need fast on-board memory, all the way up to high-performance image processing where you need all the performance advantage of caching. Interestingly, while the cache can sit on the Arteris IP FlexNoC (non-coherent) bus fabric, it doesn’t have to. You can connect it directly to any AXI bus and use it independently of other Arteris IP products.

Another clever thing JP mentioned you could do with CodaCache is partition the cache to alleviate congestion. Rather than having, say, one big 4MB cache block tying up routing resources around that block, you can split the cache into multiple sub-blocks, say 1MB each, which can settle around the floorplan, spreading routing demand more evenly.

JP also mention support for what he called “way partitioning”, a method to reserve cache lines for specific IDs, giving them higher priority and therefore higher performance than for other accesses. For example, one ID could reserve ways 6 and 7 in the cache for high-priority real-time tasks, another could reserve way 5 for medium priority tasks and all other IDs would have to fight it out of the remaining ways. That’s a pretty detailed level of configurability.

You can learn more about CodaCache HERE. The product was released only last month and is now in production. It has been proven already with multiple customer, per JP.


Machine Learning and Embedded FPGA IP

Machine Learning and Embedded FPGA IP
by Tom Dillinger on 07-18-2018 at 12:00 pm

Machine learning-based applications have become prevalent across consumer, medical, and automotive markets. Still, the underlying architecture(s) and implementations are evolving rapidly, to best fit the throughput, latency, and power efficiency requirements of an ever increasing application space. Although ML is often associated with the unique nature of (many parallel) compute engines in GPU hardware, the opportunities for ML designs extend to cost-sensitive, low-power markets. The implementation of an ML inference engine on an SoC is a great fit for these applications – this article (very briefly) reviews ML basics, and then highlights what the embedded FPGA team at Flex Logix is pursuing in this area.

Introduction
Machine learning refers to the capability of an electronic system to:

[LIST=1]

  • receive an existing dataset of input values (“features”) and corresponding output responses
  • develop an algorithm to compute the output responses with low error (“training”) and,
  • deploy that algorithm to accept new inputs and calculate new outputs, with comparable accuracy to the training dataset (“inference”)The common hardware implementation of the ML algorithm is a neural network – loosely based on our understanding of the electrochemical interactions in the brain among a neuron cell nucleus, its dendrites, and the axons/synapses sending electrical impulses from other neurons to the dendrites. The figure below illustrates a “fully-connected, feed-forward” neural network, a set of nodes comprising:

     

    • an input layer (the “features” of the data)
    • additional computation layers (zero or more “hidden” layers)
    • an output layer

    In the fully-connected (acyclic graph) architecture, the computed value at the output of each node is an input to all nodes in the next layer.


    An expanded view of each network node is shown in the figure below. The computed input values each have an associated multiplicative “weight” factor. The node calculates the sum of the weighted inputs – in vector algebra terms, the “dot product”. A “bias” value may also be used in the summation, as part of the node calculation.


    There are two importance (interrelated) characteristics of note in the neural network – “normalization” and “activation”. The numerical range of individual input features could vary widely – for example, one input could range from (-10,10), while another spans (0,1). The neural network designer needs to assess the relative importance of each feature, and decide to what extent the range should be normalized in the input layer. Similarly, this architectural decision extends to the activation function within the node, as part of the output calculation. A variety of (linear and non-linear) activation functions are in common use – a few examples are shown below, including functions that normalize the output to a specific range (e.g., (0,1), (-1,1)).


    Some activation functions include a “threshold”, such that the output is truncated (to zero) if the dot product result is below the threshold value. (The axon and endpoint synapses that connect a neuron output to the dendrites of other neurons are also capable of complex electrical filtering – the brain is indeed a very unique system.)

    At the output layer, the activation function is a fundamental aspect of the neural network design. The desired output result could be a numeric value, or could be “classified” into (two or more) “labels”. The simplest classification would be a binary 0/1 (pass or fail, match or no_match), based upon comparisons to the threshold ranges defining each label.

    Training/Test and Inference
    The selected neural net architecture needs to be “trained”. A subset of the given input dataset records is selected, and feature values applied using the existing weights and biases at each node. The network output values are compared to the corresponding “known” output values for each input record. An error measure is calculated, which serves as the optimization target. Any of a number of error models can be used – two common examples are depicted in the figure below.


    The training phase then adjusts the network weights and biases, re-submits the input training dataset, and re-calculates the error. Sophisticated algorithms are used during optimization to derive the (multi-dimensional) “surface gradient” of the error as a function of the weights and biases, typically working backwards from the output layer. The training phase iterates through multiple input data applications, error calculations, and weight/bias adjustments, until an error minimum is reached. Special techniques are employed to avoid stopping on a “local minimum” of the error response.

    Once the training phase completes, the remaining dataset records serve as a separate “test” sample. These test records are applied to the network with the final training weights/biases, and an “accuracy” measure derived. (Accuracy is perhaps best understood for classification-based outputs – did each classified result for each test record match the given label? Also, considerable ML research is being pursued to select “good” training/test subsets, as well as identify “noisy” input data that may not be representative of the final environment.)

    Once a neural network with suitable accuracy has been derived, the design implementation is ready to be deployed as an “inference engine” for general purpose use.

    Numeric Resolution
    A key finding from ongoing ML research relates to the resolution of the weights, bias values, and activation calculations. During the training phase, high resolution calculations are needed at all layers – e.g., 32-bit floating point (fp32). However, once the network is ready to use for inference calculations, a reduction in resolution may result is minimal loss in accuracy, with corresponding improvements in power/area/cost. For example, weights and biases could be transformed to fp16 or 8-bit fixed point representations at some/all layers of the network, while maintaining comparable accuracy (link) – that is a game-changer.

    ML and Flex Logix eFPGA tiles
    I had an opportunity to chat with Geoff Tate and Cheng Wang at Flex Logix about their initiatives into supporting inference engines within an embedded FPGA implementation.

    Cheng indicated, “As you may recall, our eFPGA designs utilize modular, abutted tiles, allowing customers to build the IP in the capacity and configuration best suited for their application. In addition to the logic-centric tile (comprised of programmable LUT’s), we offer a DSP-centric tile with a rich mix of multiply-accumulate functions. ML customers are seeking high MAC density, optimal throughput, and power efficiency – we have prepared an ML-centric tilewith a concentration of programmable int8 MAC’s, ideally suited for many ML applications.” (This ML tile is similar to, yet simpler than, the DSP offering. Like the DSL tile, it can be readily incorporated into a larger eFPGA block. Also, the MAC’s can be configured as 8×16, 16×8, and 16×16.)


    Cheng continued, “We are engaging with customers seeking a variety of network options – e.g., even smaller bit resolutions, unique memory interfacing for faster access to retrieving weights and biases.”

    An increasing area of ML development relates to network partitioning. For architectures larger than the physical implementation, a set of successive, partial calculations are needed, with partition weights/biases updated prior to each evaluation. The overall throughput is thus a strong function of the time to load new weights and biases. The figure below illustrates how block partitioning applies to matrix multiplication (from linear algebra).


    For ML implementations targeting IoT edge devices (with input patterns representing sensor data), network partitioning may involve dividing the overall calculation between edge and host. In these cases, a detailed tradeoff assessment is made between throughput and power efficiency/cost.

    Geoff added,“Many customers are seeking an embedded FPGA solution with an ML-optimized MAC resolution. Our implementation style enables us to offer a tailored solution for a specific process and architecture within 6-8 months. Also, we realize there are a number of ML coding libraries used to define the neural network architecture – e.g., Caffe, TensorFlow. (link – also, see Footnote) A software toolset to establish a flow from the ML code to our eFLX compiler can be made available.”

    The attractiveness of a high throughput, power-efficient, and low cost embedded SoC inference engine implementation using an eFPGA optimized for the specific resolution requirements will no doubt greatly expand the breadth of ML applications. For more information on the Flex Logix ML tile specifications, please follow this link.

    –chipguy

    Footnote: The link provided is a YouTube video of a Stanford University CS lecture describing Caffe and (especially) TensorFlow ML software libraries. The most popular class in many CS departments is no longer “Introduction to Object-Oriented Programming”, but rather “Introduction to Machine Learning”. 😀

    PS. This introductory description above depicted a full-connected, acyclic, two-dimensional neural network graph, with a set of one-dimensional vectors for weights and biases. ML research has also pursued many other complex network topologies than depicted above, include graphs with feedback connections between layers. Also, the training phase was “supervised”, in that output values/labels were assumed to be provided for each input record. Additionally, “unsupervised” training algorithms are used when the inputs do not include corresponding output data – this represents a significantly more complex facet to ML, as the “pre-training” phase attempts to identify (higher-level) features from correlations among the detailed (lower-level) inputs.


SEMICON West Intel 10nm and GF 7nm Update

SEMICON West Intel 10nm and GF 7nm Update
by Daniel Nenni on 07-18-2018 at 7:00 am

SEMICON West seemed a little slow last week but maybe it was just me. I’m sure SEMI will come out with record breaking numbers but I did not see it in the exhibit hall (see the video). What I did see was hundreds of exhibitors but I had no idea what they did. San Francisco again was very congested and smelly. I talked to a friend who is in public works and he said drugs are relatively cheap and plentiful so SF is the place to be when you run out of prescription opioids, and it is getting worse.

SEMICON West 2018 Wrap (video)

Bottom line: I am no longer in favor of SF as a destination for DAC or any other conference. Either do it in San Jose or Santa Clara or get out of Northern California! San Francisco is not going to fix this problem if we keep ignoring it.

Robert Maire, Scotten Jones, and I attended SEMICON and had very productive meetings which gave us a pretty good outlook for 2018/2019. Robert has already published, Scott will add more, and this is mine. Click on the Events tab in the navigation bar to see them all.

We met with GlobalFoundries (Gary Patten, Erica McGill, and Jean-Baptise Laloe). Interesting story about Erica. She hosted Scott and I in Malta a couple of years back where we enjoyed a clean room tour. Erica is the first communications person I have seen do the tour partly because you cannot wear make-up, perfume, hair products, heels etc… but also because it is highly technical which excludes most semiconductor communications people.

The nice thing about meeting with Gary and Erica is that they know that we know more than we should so we are treated differently than the mainstream media. Questions are answered on and off the record and we can fill in the blanks if there are any. My interest was 7nm, Scott will cover FD-SOI.

GF took a different path to 7nm than most expected. In June of 2015 the IBM/SUNY Alliance unveiled the first 7nm silicon using SiGe and EUV which was expected to be production worthy in 2017. Globalfoundries however chose a much more “TSMC like” path to 7nm. It is not plug compatible but it is close enough so customers can move from one process to another with relative ease. AMD is already doing this (straddling TSMC and AMD) and others will follow, absolutely. In my opinion 7nm will be a very long node. If history repeats, 5nm will be a half node that will be mostly skipped like 20nm and 10nm in favor of a more aggressive 3nm.

Another issue is current political problems around the world which Robert Maire spoke to at SEMICON (standing room only). Having a leading edge fab in the United States is becoming more and more favorable and having a “TSMC like” process in the United States puts GF in a unique position. According to Gary, 7nm is on track for early 2019 production which is in line with the other foundries except TSMC and Apple of course. Apple is always first to production with TSMC to make the fall iPhone launch.

Intel is on the same track with 10nm. Based on people who actually know, 10nm yield is steadily improving and should be at Intel acceptable levels by the end of the year for early 2019 mass production. Remember, Intel went through a similar exercise at 14nm. Yield delays were standard practice in the history of the semiconductor industry up until Apple joined our ranks. TSMC and Apple work jointly on a customized process that must be in production in time for the yearly iPhone launch. The trade-off made of course is performance for yield. Intel on the other hand will not sacrifice performance for yield thus the 10nm delay.

The result being that the Intel processes are faster and denser than the same named foundry processes. You can see this by comparing FPGAs from Xilinx (TSMC 16nm) and Intel/Altera (14nm). The S2C FPGA prototyping boards support both Xilinx and Intel FPGAs and we see a significant performance and density advantage with Intel 14nm. I do not expect to see Intel 10nm and Xilinx 7nm FPGAs for a year or so but my guess is that we will see a similar performance and density advantage for Intel.

The other interesting Intel news is that they may skip 7nm and move directly to 5nm. This of course is a marketing move since the Intel processes are off by a node name or two. The Intel 14nm process rivals TSMC 10nm and the Intel 10nm process rivals TSMC 7nm. Scotten Jones is the expert so I will defer to him if you need further convincing. Personally I think it is a great idea and will support Intel 100% on this one, absolutely.