RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

Sondrel Explains One of the Secrets of Its Success – NoC Design

Sondrel Explains One of the Secrets of Its Success – NoC Design
by Mike Gianfagna on 03-15-2021 at 8:00 am

Seventeen horizontal layers of a complex digital chip design showing the interconnection layouts for each layer
Seventeen horizontal layers of a complex digital chip design showing the interconnection layouts for each layer

Sondrel is an interesting and unique company. They are a supplier of turnkey services from system to silicon supply. So far, not that unique as there are a lot of companies with this mission. What is unique is their focus on complex designs. The company takes on the design of chips that would need teams of engineers working for a year with the aim of providing economy of scale. I’ve spent some time at the leading edge of custom chip design, and I can tell you it’s not for the faint of heart. This stuff is very, very difficult and those that can help are rare and quite valuable. There are lots of ways to address the daunting challenges of complex custom chip design, so I was quite excited to be able to get some of the backstory from a key member of the Sondrel team. Read on to learn how Sondrel explains one of the secrets of its success – NoC design.

First, some of the basics. If you want to learn more about his unique company, you can read an in-depth interview Daniel Nenni did with Sondrel’s CEO here. Next a bit about a NoC, or network on chip technology. If you think of typical IP building blocks for an SoC as the electrical fixtures in a home, the NoC is the wiring.  It’s the interconnect backbone that delivers the right data to the right location at the speed required to make the whole system work. Doing something as complex as interconnecting the elements in an SoC, and even beyond to the external devices such as memory really benefits from a structured approach. That’s what a NoC delivers.  This technology can offer the margin of victory if done correctly.

I was able to catch up with Dr. Anne-Françoise Brenton, Sondrel’s NoC expert. Anne-Françoise  has extensive design experience from ST Micro, Thomson Consumer Electronics and Thomson Multi-Media, and TI before joining Sondrel over seven years ago. Anne-Françoise  offered some great insights into why a NoC is so critical to complex chip design and how Sondrel approaches its design.

She began by explaining that in an ideal design all the sections that need high speed, high data flow between them would be located as close together as possible. That is, memory in the middle of the chip next to the blocks of IP that need memory access. In reality, apart from cache, memory is located off chip on dedicated memory chips, which use state-of-the-art memory technologies so that access points to memory are located on the perimeter of the chip. As a result, a complex network of interconnections is needed to route the data traffic between blocks and to and from off-chip memory. On a big chip design, there could be seventeen layers of horizontal interconnections plus a number of vertical connections between these layers. The graphic at the top of this post illustrates such a case.

In her words, “It’s rather like designing a massive, multi-level office block where you have to design it to allow for optimal movement of people between areas and floors. Where a lot of people need to move rapidly between two locations, you need a wide fast corridor and the length of it affects the timing of the arrival of people. Similarly, an infrequently used, non-urgent route can be long and narrow, and therefore slow. The analogy continues with the vertical interconnects being lifts with big capacity, lifts that just connect two specific floors to provide a dedicated route for high-speed connections, and lifts that stop at all floors that are slower but connect a lot of locations. On top of this is the arbitration that dynamically controls the data flow through the NoC with buffering to smooth and optimize as demand changes, for example when two IP blocks are sharing and accessing the same memory.”

Is your head hurting yet?  Mine was. Anne-Françoise  went on to explain that designing a NoC is an iterative collaboration throughout the entire chip design process between the front end, back end and NoC teams of designers.

She went on to explain that one of the challenges in NoC design is that third party IP blocks can be a black box solution with very little data provided on its demands for data flow as the vendor wants to protect the exact workings of its IP. This is actually overcome as the whole design matures by using timing analysis performance modeling to help ensure that the NoC is delivering the data as required by arbitrating the pathways to deliver the data according to pre-assigned priorities – there cannot be any bottlenecks.

Anne-Françoise concluded our discussion by explaining that “NoC design is a constantly changing juggling act. Change one parameter and several other things could change. It’s as intellectually challenging as playing several games of chess simultaneously and it is immensely rewarding.”

The holistic perspective that Anne-Françoise offered was quite refreshing. Sondrel seems to have its act together when tackling near-impossible high-end design. If you need help doing the impossible, I would strongly recommend you contact Sondrel, now that you’ve seen how Sondrel explains one of the secrets of its success – NoC design. You can learn more about Sondrel here.

Also Read:

SoC Application Usecase Capture For System Architecture Exploration

CEO interview: Graham Curren of Sondrel

Sondrel explains the 10 steps to model and design a complex SoC


All-Digital In-Memory Computing

All-Digital In-Memory Computing
by Tom Dillinger on 03-15-2021 at 6:00 am

NOR gate

Research pursuing in-memory computing architectures is extremely active.  At the recent International Solid State Circuits conference (ISSCC 2021), multiple technical sessions were dedicated to novel memory array technologies to support the computational demand of machine learning algorithms.

The inefficiencies associated with moving data and weight values from memory to a processing unit, then storing intermediate results back to memory are great.  The information transfer not only adds to the computational latency, but the associated power dissipation is a major issue.  The “no value add” data movement is a significant percentage of the dissipated energy, potentially even greater than for the “value add” computation, as illustrated below. [1]  Note that the actual computational energy dissipation is a small fraction of the energy associated with data and weight transfer to the computation unit.  The goal of in-memory computing is to reduce these inefficiencies, especially critical for the implementation of machine learning inference systems at the edge.

The primary focus of in-memory computing for machine learning applications is to optimize the vector multiply-accumulate (MAC) operation associated with each neural network node.  The figure below illustrates the calculation for the (trained) network – the product of each data input times weight value is summed, then provided to a bias and activation function.

For a general network, the data and weights are typically multi-bit quantities.  The weight vector (for a trained, edge AI network) could use a signed, unsigned, or twos complement integer bit representation.  For in-memory computing, the final MAC output is realized by the addition of partial multiplication products.  The bit width of each (data * weight) arc into the node is well-defined – e.g., the product of 2 n-bit unsigned integers is covered by a 2n-bit vector.  Yet, the accumulation of (data * weight) products for all arcs into a highly-connected network could require significantly more bits to accurately represent the MAC result.

One area of emphasis of the in-memory computing research has been to implement a bitline current-sense measurement using resistive RAM (ReRAM) bitcells.  The product of the data input (as the active memory row wordline) and weight value stored in the ReRAM cell generates a distinguishable bitline current applied to charge a reference capacitance.  A subsequent analog-to-digital converter (ADC) translates this capacitor voltage into the equivalent binary value for subsequent MAC shift-add accumulation.  Although the ReRAM-based implementation of the (data * weight) product is area-efficient, it also has its drawbacks:

  • the accuracy of the analog bitline current sense and ADC is limited, due to limited voltage range, noise, and PVT variations
  • the write cycle time for the ReRAM array is long
  • the endurance of the ReRAM array severely limits the applicability as a general memory storage array

These issues all lead to the same conclusion.  For a relatively small inference neural network, where all the weights can be loaded in the memory array, and the data vector representation is limited – e.g., 8 bits or less – a ReRAM-based implementation will offer area benefits.

However, for a machine learning application requiring a network larger than stored in the array and/or a workload requiring reconfigurability, updating weight values frequently precludes the use of a ReRAM current sense approach.  The same issue applies where the data precision requirements are high, necessitating a larger input vector.

An alternative for an in-memory computing architecture is to utilize an enhanced SRAM array to support (data * weight) computation, rather than a novel memory technology.  This allows a much richer set of machine learning networks to be supported.  If the number of layers is large, the input and weight values can be loaded into the SRAM array for node computation, output values saved, and subsequent layer values retrieved.  The energy dissipation associated with the data and weight transfers is reduced over a general-purpose computing solution, and the issue with ReRAM endurance is eliminated.

In-Memory Computing using an Extended SRAM Design

At the recent ISSCC, researchers from TSMC presented a modified digital-based SRAM design for in-memory computing, supporting larger neural networks.[2]

The figure above illustrates the extended SRAM array configuration used by TSMC for their test vehicle – a slice of the array is circled.  Each slice has 256 data inputs, which connect to the ‘X’ logic (more on this logic shortly).  Consecutive bits of the data input vector are provided in successive clock cycles to the ‘X’ gate.  Each slice stores 256 4-bit weight segments, one weight nibble per data input;  these weights bits use conventional SRAM cells, as they could be updated frequently.  The value stored in each weight bit connects to the other input of the ‘X’ logic.

The figure below illustrates how this logic is integrated into the SRAM.

The ‘X’ is a 2-input NOR gate, with a data input and a weight bit as inputs.  (The multiplicative product of two one-bit values is realized by an AND gate;  by using inverted signal values and DeMorgan’s Theorem, the 2-input NOR gate is both area- and power-efficient.)  Between each slice, an adder tree plus partial sum accumulator logic is integrated, as illustrated below.

Note that the weight bit storage in the figure above uses a conventional SRAM topology – the weight bit word lines and bit lines are connected as usual, for a 6T bitcell.  The stored value at each cell fans out to one input of the NOR gate.

The output of each slice represents a partial product and sum for a nibble of each weight vector.  Additional logic outside the extended array provides shift-and-add computations, to enable wider weight value representations.  For example, a (signed or unsigned integer) 16-bit weight would combine the accumulator results from four slices.

Testsite results

A micrograph of the TSMC all-digital SRAM-based test vehicle is shown below, highlighting the 256-input, 16 slice (4-bit weight nibble) macro design.

Note that one of the key specifications for the SRAM-based Compute-in-Memory macro is the efficiency with which new weights can be updated in the array.

The measured performance (TOPS) and power efficiency (TOPS/W) versus supply voltage are illustrated below.   Note that the use of a digital logic-based MAC provides functionality over a wide range of supply voltage.

(Parenthetically, the TOPS/W figure-of-merit commonly used to describe the power efficiency of a neural network implementation can be a misleading measure – it is strongly dependent upon the “density” of the weights in the array, and the toggle rate of the data inputs.  There is also a figure below that illustrates how this measure depends upon the input toggle rate, assuming a 50% ratio of ‘1’ values in the weight vectors.)

Although this in-memory computing testsite was fabricated in an older 22nm process, the TSMC researchers provided preliminary area and power efficiency estimates when extending this design to the 5nm node.

Summary

There is a great deal of research activity underway to support in-memory computing for machine learning, to reduce the inefficiencies of data transfer in von Neumann architectures.  One facet of the research is seeking to use new memory storage technology, such as ReRAM.  The limited endurance of ReRAM limits the scope of this approach to applications where weight values will not be updated frequently.  The limited accuracy of bitline current sense also constrains the data input vector width.

TSMC has demonstrated how a conventional SRAM array could be extended to support in-memory computing, for large and/or reconfigurable networks, with frequent writes of weight values.  The insertion of 2-input NOR gates and adder tree logic among the SRAM rows and columns provides an area- and power-efficient approach.

-chipguy

 

References

[1]  https://energyestimation.mit.edu

[2]  Chih, Yu-Der, et al., “An 89TOPS/W and 16.3TOPS/mm**2 All-Digital SRAM-Based Full-Precision Compute-in-Memory Macro in 22nm for Machine-Learning Applications”, ISSCC 2021, paper 16.4.

 


Honda Asserts Automated Driving Leadership

Honda Asserts Automated Driving Leadership
by Roger C. Lanctot on 03-14-2021 at 10:00 am

Honda Asserts Automated Driving Leadership

When Honda Motor Co. tied up with General Motors almost a year ago to collaborate on vehicle propulsion technology, connected car tech, and assisted driving, observers might have been forgiven for thinking Honda was surrendering its independence to catch up in the EV race. Honda reasserted its independence last week with the launch of the semi-autonomous Honda Legend in Japan – a leasable fleet vehicle equipped with the world’s first Level 3 semi-autonomous driving system.

The announcement was momentous for validating SAE’s Level 3 classification. Industry insiders and experts frequently dismiss these classifications as artificial and unhelpful with Level 3 being the most controversial. Level 3 automated driving allows hands-off, eyes-off driving with the expectation and understanding that the driver must be available – presumably in the driver seat – to take back driving control under appropriate conditions.

SOURCE: SAE

Level 3 semi-autonomous driving is the mode that airline pilots look at and simply shake their heads – “It will never work.” Audi announced its intention to introduce Level 3 semi-autonomous driving on the A8 in Europe and then withdrew the announcement due to the lack of local regulatory support.

Honda’s announcement demonstrates that auto makers will distinguish between their autonomous driving development work for so-called robotaxis and shuttles and the technology they bring to potentially mass market vehicles. Honda was happy to invest billions in GM’s Cruise operation, but continued to pursue its own proprietary assisted driving platform – not unlike GM’s work on Super Cruise. In fact, most auto makers have similar in-house solutions in development.

According to AutoExpress: “Honda’s Sensing Elite system builds on the brand’s existing Sensing Elite safety technology, but uses a more accurate global positioning system, more detailed three-dimensional maps and several sensors which give the ECU a 360-degree view of the car’s surroundings.

“The most impressive part of the new Sensing Elite system is the hand-off driving mode, which uses Honda’s existing adaptive cruise control and lane-keeping assist systems to assume total control over the car when driving on the motorway.” The car is also capable of automatically overtaking slower vehicles as well as detecting an inattentive or unresponsive driver – automatically moving to the hard shoulder, bringing the car to a halt, and flashing its hazards and sounding the horn to warn other road users.

The Sensing Elite-equipped Legend became available in the form of 100 leased fleet vehicles in Japan each of which will cost approximately $100K due in part to the incorporation of five LiDAR sensors. The driver has up to 30 seconds to take over the driving task when alerted by the system – and is not responsible for anything that might occur during that 30 seconds.

The introduction of the car is reminiscent of GM’s launch of the electrified EV-1 as a lease. GM later called back and crushed all of the EV-1’s.

No immediate plans to bring the car and the new system to the U.S., U.K., or E.U. were announced. The arrival of the Honda Legend was enabled by new regulations in Japan that opened the door to Level 3-type operation.

The regulatory changes included a legal adjustment and adoption of WP.29 Automated Lane Keeping System rule. The legal change, adopted in April of 2020, allows for “the car, not the driver, (to be) responsible for the driving,” according to a report in Nikkei Asia. Japan then adopted the WP.29 ALKS regulation which establishes strict requirements for Automated Lane Keeping Systems (ALKS) for passenger cars which, once activated, are in primary control of the vehicle.

The UNECE states: “ALKS can be activated under certain conditions on roads where pedestrians and cyclists are prohibited and which, by design, are equipped with a physical separation that divides the traffic moving in opposite directions. In its current form, the Regulation limits the operational speed of ALKS systems to a maximum of 60 km/h.”

More details on the UN ALKS regulation: https://unece.org/transport/press/un-regulation-automated-lane-keeping-systems-milestone-safe-introduction-automated

Honda committed $2.75B (over 12 years) to GM’s Cruise autonomous vehicle operation, including a $750M equity investment and has announced further plans to bring Cruise’s Origin robotaxi to Japan for testing. From these announcements it is clear Honda wants a defined and limited commitment, but nevertheless a substantial stake, in the autonomous vehicle business.

When it comes to brand-defining mass-market semi-autonomous tech, though, Honda clearly prefers to keep its assets separate from GM. Honda may adopt GM’s electric propulsion tech and the associated Bolt connectivity and driver assist platforms to catch up in EVs. But GM’s Super Cruise will not displace Honda’s Sensing Elite system. Most important of all, Honda’s introduction of the Sensing Elite system on the Legend has served notice to the world that Japan – long silent in the world of autonomous driving – will be a leader.


A Funny Thing Happened on the Way to 5G Cars

A Funny Thing Happened on the Way to 5G Cars
by Roger C. Lanctot on 03-14-2021 at 8:00 am

A Funny Thing Happened on the Way to 5G Cars

Prior to the arrival of the COVID-19 pandemic car makers were inching toward putting their 5G connectivity plans into action.  There was a growing recognition, at the time, that no one wanted to miss the 5G car boat – and no car maker would want to be stuck selling 4G cars in a 5G world.

The onset of the pandemic was a shock to the system.  Advanced connectivity plans suddenly seemed secondary to basic survival.  The 5G road maps were shelved; autonomous vehicle development was frozen; and the industry held its breath for two critical months.

When car dealerships and factories re-opened and customers returned, those 5G plans now looked a little too ambitious.  A closer look at 5G revealed higher costs and some delayed network deployments.  Maybe 5G in cars could wait.

The arrival of 2021 has changed all that.  Car makers are back to the 5G drawing board and 5G deployments are beginning to trickle into premium vehicle RFQs with market introductions just a year or two away.  Still, the return of 5G planning has remained out of the headlines.

Car makers are battling over batteries and prattling about robotaxis, but no one is talking about 5G.  That is, no one, but the Chinese.

Chinese auto makers such as BAIC, BYD, GAC, SAIC, Great Wall, Nio, WM Motors, and DFM have all announced their 5G plans.  Others – FAW, SAIC, Ford, and Human Horizons – have announced C-V2X plans.  In the U.S. and the E.U., there is radio silence.

“Western” auto makers have, for the most part, clammed up regarding their C-V2X and 5G plans due to pending disputes with regulators over the standards and spectrum allocations for enabling inter-vehicle communications.  The dreaded Wi-Fi-based DSRC (dedicated short-range communication) technology proposed 20 years ago for vehicle-to-vehicle communications remains a sticking point on both sides of the Atlantic.

In the U.S., the Federal Communications Commission split up the spectrum that had originally been preserved for V2V (in fact, V2-everything) communications in an effort to open up more spectrum for more unlicensed wi-fi applications.  This was intended to be the “last word” on the subject – the end of DSRC in the U.S.  But the devotees fight on – thereby freezing OEMs leery of adopting a technology in flux.

In the E.U., the European Commission is working under the guidance of the ITS Directive to define a regulatory regime to enable an interoperable transportation connectivity regime.  The Commission is known to still favor ITS-G5 (i.e. DSRC) technology in spite of some claims of being technology agnostic.

Further complicating matters in the E.U. is Volkswagen’s launch of the DSRC-equipped Golf and a Euro NCAP (New Car Assessment Program) which includes V2X – with the relevant technology yet to be defined.  As in the U.S., E.U. auto makers are loath to lay out their plans not knowing which way the regulatory winds will blow.

As noted, none of this 5G hesitancy applies in China where top down regulatory alignment has all but ruled out DSRC technology for most V2X applications.  The clarity is no doubt welcomed by auto makers with their 3-4-year development cycles and adherence to standards.

One can only hope that the DSRC fever will finally pass and auto makers in the U.S. and Europe can get on with the business of making safer, better connected cars.  The promise of C-V2X and 5G wireless technology is enhanced vehicle situational awareness and collision avoidance – something sorely needed in the U.S. where annual highway fatalities are once again on the rise.  It would be a shame, and not very funny, if more lives were lost for the sake of DSRC nostalgia.


Podcast EP11: Semiconductor Shortages and the CHIPS Act Explained

Podcast EP11: Semiconductor Shortages and the CHIPS Act Explained
by Daniel Nenni on 03-12-2021 at 10:00 am

Dan and Mike are joined by Terry Daly for a thoughtful and informative overview of global semiconductor supply challenges and an excellent overview of the CHIPS for America Act, an ambitious piece of US legislation aimed at establishing investments and incentives to support U.S. semiconductor manufacturing, research and development, and supply chain security.

Terry is a Senior Fellow – Council on Emerging Market Enterprises at The Fletcher School of Law and Diplomacy. He is a long-time veteran of the semiconductor industry. His prior experience includes SVP at GLOBALFOUNDRIES. There, he served as head of strategy and corporate development, chief of staff to the CEO, and head of corporate program management. He had instrumental roles in global strategic alliances and M&A, including the acquisition of the microelectronics business from IBM.

There is a lot of very useful information in this podcast. As a result, we exceed our 30 minutes maximum length rule by two minutes.

The views, thoughts, and opinions expressed in these podcasts belong solely to the speaker, and not to the speaker’s employer, organization, committee or any other group or individual.


CEO Interview: R.K. Patil of Vayavya Labs

CEO Interview: R.K. Patil of Vayavya Labs
by Daniel Nenni on 03-12-2021 at 6:00 am

RK Patil 1

RK has over 25 years of industry experience in the domain of Telecom, Embedded software and Semiconductors. Before co-founding Vayavya, he was a co-founder at Smart Yantra Technologies, where he has held various positions in engineering, marketing and management. At Vayavya RK is responsible for overall management and strategic decisions related to engineering and business activities.

Vayavya Labs is different from most EDA companies as it provides solutions for software engineers to validate their software for SoCs. How has this journey been so far?

We started Vayavya Labs in 2006, with the single-minded focus of addressing the evolving needs of the embedded software industry and help companies adapt to the different software environments needed for their SoCs.  By then, highly programmable SoCs had already started making inroads into a number of different devices and systems in consumer electronics, automotive, communication and industrial applications. The programmability of these devices and the associated software environments (operating systems and software architectures) had compelled semiconductor companies to build software engineering teams, which often outnumbered the ASIC engineering teams.

Our approach to addressing this problem in the industry was to provide a set of tools and a methodology that would enable code synthesis/generation from a high-level golden specification. This was rather a niche concept and probably a bit ahead in time. We had a number of people appreciating the technology but unwilling to adopt the tool from a small company based out of India.

In 2008, we started an embedded software services unit to help sustain the company and fund our R&D efforts. Over the past decade we have grown steadily, become profitable and have sharpened our focus to address the demanding requirements of SoC verification from a system, software and a hardware perspective.

Today, we are in a rather, unique situation of having the expertise and the solutions to address the hardware and software verification requirements at the architectural, pre-silicon and post-silicon stages of the SoC design flow, to ensure shorter development cycles.

With the number of people working on SoC designs constantly increasing as system companies start developing their own SoCs, what challenges do you envisage Vayavya Labs to solve?

One of the biggest challenges facing designers developing the complex SoCs of today is that there is a need to not only verify the IC designs functionality according to the specification, but also from an end-system context. With shorter time-to-market windows, it is also becoming a necessity for design houses to ensure that the system and SoCs adhere to the many different industry specific standards and requirements prior to taping out. Verification today for a SoC, no longer implies merely hardware verification but also software verification and requires intimate knowledge of embedded systems to ensure system compliance.

As a contributing member of the Accellera committee for portable stimulus (PSS), we have contributed extensively to this standard and our contributions are referred to as Hardware-Software Interfaces (HSI). To enable design companies easily verify their SoC’s from a system software perspective, we have launched an open initiative called OpenHSI™ where designers can leverage from a readymade library of device drivers and middleware stack. We strongly believe that PSS needs to leverage the full potential of HSI /OpenHSI™  in order to realize  software-driven-verification.

At Vayavya Labs, we help companies meet the challenges of verification in three ways namely:

  • We provide virtual platforms by creating models, which can be used to verify the architecture, explore the performance and then subsequently verify the IP/SoC at both pre-silicon and post-silicon stages. In addition, the virtual platforms can also be used to validate the software using emulation.
  • We provide software tools for generating bare metal and operating system specific device drivers, from a hardware-software interface specification, making it easier to validate the hardware-software interface and the SoC functionality from a software perspective.
  • Enable companies to validate their SoCs by building PSS models to automate verification test cases. Additionally, we also help them realize these tests by providing the necessary software drivers and stacks to validate the SoC across all platforms – virtual models, simulation, emulation etc.

In a post-covid era, there is a growing importance of a Digital-Twin. A number of companies are now adopting virtualization for their IPs and SoCs. What solutions does Vayavya Labs provide in this area?

The notion of a digital twin becomes very pertinent these days as it provides an alternative for software and hardware design teams, dispersed across the world to rapidly test, debug and refine their systems. Virtualization as the name implies, removes the dependency on physical hardware for the designers. It provides design teams an opportunity to explore the performance requirements early in the design cycle as well as ensure consistency in the hardware and software verification results for common tests, eliminating the potential of misinterpretation between the two teams. In addition to ensuring concurrent development of software & hardware, virtualization also enables design teams to robustly test the hardware-software to mitigate post-silicon issues.

Unfortunately, developing virtual models is quite challenging as it needs varied technical skills such as knowledge about the modeling languages, insights into the hardware architecture, embedded software and domain knowledge.

Vayavya Labs provides two types of solutions to address the need for virtual platforms. The first solution includes providing ready-made  generic modeling libraries to jump-start virtual platform development and software for automatic generation of device drivers. The second type of solution includes custom development services such as developing custom SystemC, QEMU, Simics, SimNow models for IPs and SoC peripherals in addition to custom development of bare metal software, OS bring up, OS port and software device drivers.

Vayavya Labs has been part of the Accelera committee for portable stimulus (PSS) and has contributed significantly in the area of hardware-software interface. There is also an OpenHSI™ initiative, which Vayavya Labs has initiated to promote PSS usage. Can you elaborate more about it?

The benefits of using PSS are significant as it enables design teams to define the test intent through the domain specified language (DSL) in PSS and use it across all platforms such as simulation, FPGAs, emulation etc. It also provides the hardware and software engineers an inherent ability to test the SoC from a system and software perspective. However, one of the biggest impediments to PSS adoption is in realizing the tests defined in the DSL as it requires API’s, device drivers and middleware/protocol stacks to be exercised completely by the software. All this falls in the realm of embedded software, the nuances of which most hardware design engineers are uncomfortable with. Consequently, to promote the usage of PSS across semiconductor companies, Vayavya Labs launched the OpenHSI™  initiative which includes some commonly used API’s, device drivers & middleware stacks for use by engineering teams to jumpstart their system level verification.

Your team has been working on the device driver generator (DDGen) tool for some time now, something which would be of immense value as design companies struggle to verify their SoCs from a system perspective. Can you provide any insights  about it?

Thanks for bringing up this question. Our efforts to develop a software tool (DDGen), which can automatically generate device drivers from a specification, continues to evolve as we keep learning and  adapt continuously to the changing trends in the industry such as safety in automotive, consumer electronics etc. DDGen Tool is now a stable product and is currently in use with a few customers. With the current emphasis on system validation, we expect to have more customers using DDGen.

Our current offering of DDGen, also helps automotive ECU developers  with MCAL automation. The MCAL drivers generated through DDGen can easily be integrated with any AUTOSAR stack provider to enable rapid development of ECU software for automotive applications.

DDGen, in addition to creating device drivers, also plays a vital role in PSS adoption for SoC system level verification as it generates consistent  register access APIs and bare metal drivers to be used by all hardware and software teams.

What does the next 12 months have in store for Vayavya Labs?

We have grown at a modest rate over the past decade, building up the necessary expertise, domain knowledge and credibility with the customers while maintaining profitability. We are now well poised for growth in the automotive, consumer electronics and wireless/5G verticals. We are also continuing to expand our presence in North America and Europe. With strong fundamentals we are looking forward to investing in new areas along with growth.

Also Read:

CEO Interview: Dr. Shafy Eltoukhy of OpenFive 

CEO interview: Graham Curren of Sondrel

CEO Interview: Mark Williams of Pulsic


VersionVault brings SCM/DM capabilities to EDA World – with Cadence Virtuoso Integration

VersionVault brings SCM/DM capabilities to EDA World – with Cadence Virtuoso Integration
by manish_virmani on 03-11-2021 at 10:00 am

VersionVault Cadence Integration

HCL VersionVault is a secure enterprise solution for version control and configuration management. With HCL VersionVault – Cadence Virtuoso Integration, VersionVault brings its enterprise configuration management capabilities to analog and mixed signal designers. This integration enables designers to take advantage of core capabilities of VersionVault, without leaving their familiar design environment. This integration allows custom chip designers to complete VersionVault actions from within Cadence Virtuoso.

  • Salient Features:

VersionVault Cadence integration offers advanced sets of capabilities which makes it a right fit for IC designers.

Figure 1: Integration Capabilities

  • Instant Workspace Creation

With Dynamic views, irrespective of size of design libraries (running into GBs), designers can create their workspaces based on a desired configuration instantaneously. No client side downloading of content is needed.

  • Rich Graphical & Command-line support

Integration supports all prominent design management use cases from Cadence Virtuoso’s graphical interfaces i.e. Library Manager, Cell View Editors. Integration does provide a dedicated command-line interface as well for all major design management operations.

  • Library Manager:

Figure 2: DM Operations via Context Menu’s in LM

  • Cell View Editors:

Figure 3: DM Operations via CVE

  • Command Line

Figure 4: Command Line Interface

  • Interactive Graphical Schematic Diff

Schematic diff tool enables designers to graphically browse-through and review changes made across versions of the same schematic design. This tool will provide means to the designers to navigate through any addition, deletion or modification which may have taken place between the schematic versions being compared. During the navigation, the tool will also highlight the deltas on the schematic editor in case they happen to be part of any visible design component.

Figure 5: Graphical Schematic Diff

  • Hierarchical Design Management

The Hierarchy Manager GUI provides a powerful mechanism for examining and traversing a design hierarchy. On the specification tab, designer can specify various descent controls supported with advanced filtering capabilities. Cell views tab shows the corresponding results on which designer can perform various DM operations.

Figure 6: Hierarchy Manager

  • VersionVault Work Area Manager (WAM)

The VersionVault Work Area Manager is one of the key highlights of this integration. It offers the designers an interface to perform advance design management operations besides presenting additional VersionVault-specific information for designs including their mastership.

Figure 7: WorkArea Manager

  • Hardware and Software Co-development

VersionVault as an enterprise level source and design management solution, helps software/firmware developers and chip designers to enjoy server class security of SW and HW artifacts, maintain history, share components/artifacts across a local and/or distributed team and adhere to a common process management put by the organization in place. Common tooling thus ensures common training for SW and HW teams and thus reduces administration costs.

Will this solution help you and your organization?

  • Do you have a robust enterprise level Design Management solution in place?
  • Does your current Design Management solution enable designers to create designer’s workspace based on a specific configuration instantaneously?
  • Does your current Design Management solution support co-development of Hardware & Firmware, common labeling across components, adhere to a common process and provide fine-grained access control across SW/Firmware and HW teams?

If not, VersionVault – Cadence Virtuoso Integration can help you and your organization fill in these gaps.

 


Webinar: How to Protect Sensitive Data with Silicon Fingerprints

Webinar: How to Protect Sensitive Data with Silicon Fingerprints
by Daniel Nenni on 03-11-2021 at 8:00 am

Webinar How to Protect Sensitive Data with Silicon Fingerprints

Data protection is on everyone’s mind these days. The news cycle seems to contain a story about hacking, intrusion or cyber-terrorism on a regular basis. The cloud, our hyperconnected devices and the growing reliance on AI-assisted hardware to manage more and more mission critical functions all around us make data protection a front-of-mind item for many.  There are many approaches to address data security, some hardware-based and some software-based with many approaches blending both. All of them have a common liability – the cryptographic key that unlocks data access. Just like an impenetrable vault, having the key to that vault neutralizes its protection. An upcoming webinar outlines a way to implement this all-important key in a unique way, one that doesn’t require storing the key at all. Let’s explore how to protect sensitive data with silicon fingerprints.

First, a bit about the company holding the webinar. Intrinsic ID is a unique company that focuses on security IP. Their stated mission is to make it easy to secure any smart device and make the connected world safer. It’s hard to argue with that. At the core of their strategy is something called a physical unclonable function, or PUF technology. This is where the silicon fingerprint comes in. I’ll get back to that in a moment.  If you want more background on the company you can see my recent interview with their CEO, Pim Tuyls.

Back to silicon fingerprints. The concept is to use the innate and unique characteristics of each semiconductor device to create a PUF. A special SRAM cell is used to manifest this capability. It turns out every SRAM cell has its own preferred state every time the SRAM is powered on, resulting from random differences in threshold voltages. By starting with an uninitialized SRAM memory, its response yields a unique and random pattern of 0’s and 1’s. This pattern is the chip’s fingerprint, since it is unique to a particular SRAM and a particular chip.

If this sounds too easy, it is. The SRAM response is a noisy fingerprint and turning it into a high-quality and secure key requires special processing. This is done with the Intrinsic ID IP.  With this approach, it is possible to reconstruct exactly the same cryptographic key every time and under all environmental conditions.  This approach has some significant advantages. The key is not permanently stored anywhere and so it’s not present when the device is inactive (no key at rest). Hackers who open the device to compromise memory come up empty-handed. There is a lot more to this process. You’ll need to attend the webinar to learn more.

Beyond the basics of how silicon fingerprints work, there’s a lot more moving parts to build an actual secure system. The webinar covers all these steps, including how to:

  • Create a PUF root key from a chip’s silicon fingerprint
  • Derive device-unique cryptographic keys for different purposes, applications and users
  • Create a secure vault

This webinar covers a lot of ground. To give you a preview, here are some of the specific topics that you’ll learn about:

  • The need for keys in IoT – many keys are needed, where do they come from?
  • How to keep your (root) keys secure
  • The SRAM PUF and how creates the root key
  • SRAM PUFs vs. traditional methods
  • Protecting all keys with a key vault
  • Information about the widespread use of these methods

If you’re concerned about protecting data in your next design, you need to attend this webinar, absolutely. You’ll learn about methods to lock your data with a key that is never stored anywhere. This is how to protect sensitive data with silicon fingerprints. The webinar will be broadcast on Wednesday, March 24, 2021 at 10:00 AM PDT. You can register for the webinar here. You’ll be glad you attended.

Also Read:

CEO Interview: Pim Tuyls of Intrinsic ID

IDT Invests in IoT Security

IoT Devices Can Kill and What Chip Makers Need to Do Now


Using IP Interfaces to Reduce HPC Latency and Accelerate the Cloud

Using IP Interfaces to Reduce HPC Latency and Accelerate the Cloud
by Scott Durrant Gary Ruggles on 03-11-2021 at 6:00 am

dwtb q121 in memory comp fig1.jpg.imgw .850.x

IDC has forecasted that over the next five years, the Global Datasphere — the amount of data that’s created, transferred over the network and stored each year — will increase by over 3X to 175 zettabytes (Figure 1). Much of this is driven by the Internet of Things (IoT), video applications (including video streaming, social media, online gaming, augmented and virtual reality applications), and unified communications for video conferencing, text/chat and online voice communications.

Figure 1: Dramatic increase in the amount of network data that’s created, transferred, and stored

All of this data growth is driving the need for more compute power to process data in the cloud and high-performance computing (HPC) systems. To deliver the best experience at the endpoint, systems need faster interfaces to move data from point A to point B, an efficient and high performance storage infrastructure to store and retrieve data, and artificial intelligence (AI) and graphics accelerators to extract meaning from all of this data. High-performance IP can accelerate the design of chips that address these challenges.

Every HPC and cloud application has its own level of latency sensitivity, but they share three major sources of latency.

Latency Source 1: Network Latency

The first major source of latency is the network itself, including the time to move data between two points. Network latency is impacted by the distance that data must move. For example, with all else being equal, it’s much faster to move data between two nearby buildings than to move it across a continent.

Network latency is also impacted by the number of hops or network devices that the data has to traverse (which is typically directly related to the distance travelled). Minimizing the network distance and the number of hops can help to reduce network latency. To this end, cloud, telecom, and co-location service providers have recently established partnerships to put the power of cloud computing at the edge of the network, closer to the user and to end-user devices.

This helps to minimize latency and converge the data and services closer to the point of use for a much more responsive experience. It delivers smoother and more realistic experiences in applications like video streaming, augmented and virtual reality, and online gaming. (See How AI in Edge Computing Drives 5G and the IoT for a case study on this topic.)

In addition, moving cloud computing closer to the edge accelerates the response time for control system applications. In an automotive application, for example, a car moving at 60 miles per hour travels nearly 9 feet in 100 milliseconds – a blink of the eye. Any delay in data moving from the car to and from the cloud can be life-threatening. Offering nearly instantaneous response times gives the control system greater precision for increased safety.

Latency Source 2: Storage Latency

A second source of latency is the storage and retrieval of data, including the access time of the media. Historically, magnetic hard disk drives (HDDs) were the primary long-term data storage medium. HDDs had access times that were measured in milliseconds. But as solid state drives (SSDs) and persistent memory proliferate, media access time is measured in hundreds of nanoseconds, resulting in a 10,000X improvement in responsiveness (Figure 2).

Figure 2: As applications move from HDDs to persistent memory, systems see a 10,000x improvement in storage latency

The tremendous improvement in storage access times has resulted in network performance becoming the limiting factor for latency in a storage area network. Moving the storage closer to the CPU helps, as does using architectures such as persistent memory and innovative protocols like remote direct memory access (RDMA) to help accelerate storage transactions.

Another emerging technology in the storage space is computational storage. Computational storage combines compute capabilities with storage systems to offload computation or consumption of compute cycles from application servers. Computational storage allows processing to happen within the storage itself, reducing network traffic on the storage network and providing faster responses in certain applications.

Finally, smart network interface cards (NICs) are being adopted to reduce the load on application processors as data is transferred across the storage network. By offloading data transfer protocols, security protocols, and network management tasks from the application processor, smart NICs improve overall system performance for networked applications.

Latency Source 3: Compute Latency

The third source of latency is the actual compute time associated with data processing. The compute cycles and the movement of data between compute modules–between memory and the compute device–all impact data processing time. To address processing latency, designers need to address the amount of bandwidth available and the speed of the data transfer protocols.

Figure 3 shows an example of two chips: a cloud server system-on-chip (SoC), which provides the application compute processing, and a graphics accelerator chip. The graphics accelerator uses HBM memory, and the cloud server chip uses traditional DDR memory. By utilizing a cash coherent interface between these two devices, the memory can be pooled in what we call a “converged memory pool” and the devices can share memory space without actually having to copy data from one process or domain to the other. This type of connection benefits from high-performance interface IP such as PCI Express 5.0, Compute Express Link (CXL), and Cache Coherent Interconnect for Accelerators (CCIX).

Figure 3: Cache coherent interfaces reduce compute latency

PCIe 5.0, CXL, or CCIX for Lowest Latency & Right Feature Set?

While low latency is the goal in cloud environments for fast processing of complex workloads, each protocol provides additional unique features and functionality that best fit the needs of the target application. Traditionally, servers relied on CPUs and storage for compute resources, which is no longer an option for today’s large hyperscale data centers with AI accelerators. A cloud server with a certain amount of memory, AI acceleration, GPUs, and networking capabilities, may require two CPUs and four storage devices or one storage device and two CPUs to process a particular workload. Each of these scenarios poses a different server configuration requirement for flexibility and scalability while continuing to focus on the goal of low latency. Let’s now examine the crowded field of low latency and cache coherent protocols to make it easier for designers to select the technology that best addresses their unique design needs.

While the market is preparing for PCIe 6.0, which is expected to be introduced in 2021, the shift from 16 GT/s PCIe 4.0 to PCIe 5.0 operating at 32 GT/s is quickly ramping up. A quick survey of our current Synopsys DesignWare® IP users shows many designs have already adopted the 32 GT/s PCIe 5.0 interface for their HPC applications. However, with the use of AI accelerators requiring more efficient memory performance, cache coherency combined with high bandwidth has become a critical demand. The CXL and CCIX protocols address this demand by reducing the amount of back and forth copying of data from the memory to processors and accelerators, dramatically lowering latency.

To fully optimize a system, selecting the right interface becomes critical to making the necessary tradeoffs between bandwidth, latency, memory access, topology, and implementation.

PCI Express

PCIe is the de-facto standard for chip-to-chip connectivity between the host and device. A simplified PCIe implementation can be between a PCIe root port (or root complex), and a PCIe endpoint through a four-lane (x4) link. A typical chip-to-chip PCIe implementation is expandable and hierarchical with embedded switches or switch chips that allow one root port to interface with multiple endpoints. Such an implementation is seen in laptops or even servers, allowing connectivity with different endpoints like Ethernet cards, display drivers, disk drives and other storage devices. However, the limitation of this implementation is seen in large systems with isolated memory pools that require heterogeneous computing where the processor and accelerator share the same data and memory space in a single 64-bit address space. In other words, the lack of a cache coherency mechanism in PCIe makes memory performance inefficient and latency less than acceptable as compared to some of the newer protocols like CXL and CCIX.

It is possible to leverage PCIe with what can be referred to as private links to enable data centers with servers that require chip-to-chip communication for multi-processing or between a processor and multiple accelerators. Private PCIe links can be used when both ends of a chip-to-chip link are owned by the same vendor, as parts of a typical PCIe data stream can be co-opted to help route information from chip to chip, outside of the PCIe protocol itself. Overloading the PCIe header and flexible new packets via vendor-defined messages enable messages to reach the intended chip in the chain. While this is not a typical implementation, many Synopsys users have adopted it.

CCIX

When CCIX was announced, it offered 20 GT/s and 25 GT/s data rates, which at the time was higher than PCIe 4.0 at 16GT/s, and the protocol added coherency capabilities. Today, CCIX v1.1 offers data rates up to 32GT/s and supports cache coherency, enabling multiple chips to share memory via a virtual memory space. Components that are connected in a single system become part of a large memory pool, eliminating the need to transfer large amounts of data between the processor and accelerator. CCIX enables heterogeneous computing with the ability to support mesh architectures where many CPUs or accelerators are interconnected and share data coherently.

While a CCIX implementation is very similar to PCIe, it implements two virtual channels (VCs): one each for the coherent and non-coherent traffic, resulting in latency on the order of PCI Express or slightly higher, which may not be appealing for HPC applications. Since CCIX is a symmetric protocol, every device in a CCIX implementation behaves the same and leverages a Home Agent where caching is managed. Due to the inherent symmetry, a coherency issue in any device can be detrimental to the entire system and not just the SoC.

CXL

CXL is ideal for host-to-device heterogeneous computing with support anticipated from all four CPU providers – Intel, IBM, Arm, and AMD. Unlike CCIX, CXL is an asymmetric protocol giving the host exclusive control of memory coherency and memory access. The advantages are a much simpler implementation of CXL devices, without the need for the Home Agent, which means that any mishandling of memory by a device will not cause system failure.

CXL runs across the PCIe physical layer, which is currently the PCIe 5.0 protocol operating at 32 GT/s. It uses a flexible processor port that can auto-negotiate a high-bandwidth CXL link, for example a x16 link, seamlessly plugging into either a PCIe or CXL card. Merging IO (.io), cache (.cache), and memory (.mem) protocols into one, CXL enables high bandwidth with an extremely low-latency interface, allowing the processor and accelerator to leverage a converged memory space. A converged memory space allows different memories such as HBM for the accelerator and DDR for the processor to be shared coherently. The required CXL.io protocol is effectively a PCIe link, and is used for discovery, register access, configuration of the link, and link bring up, while the .cache and .mem protocols are used for low-latency coherent data exchange, and one or both must be implemented to create a complete CXL link.

CXL delivers much lower latency than PCIe and CCIX by implementing the SerDes architecture in the newest PIPE specification, essentially moving the PCS layer, and its associated latency, from inside the PHY to the controller and allowing the CXL.cache and CXL.mem traffic to split from the CXL.io traffic very early in the stack. This combines with the inherent low latency of the CXL stack to give CXL lower latency than either PCIe or CCIX.

The three CXL protocols can be combined to create three distinct device types. Since the CXL.io protocol is mandatory it is implemented in all device types.

A Type 1 device implements CXL.io and CXL.cache protocols to allow attached devices like accelerators and smart NICs to cache and coherently access the host cache memory.

A Type 2 device implements all three protocols: CXL.io, CXL.cache, and CXL.mem to process the coherent data between the host and device-attached memory to optimize performance for a given task, allowing the Device to cache the Host memory and the Host to access attached device memory within a unified memory space.

Type 3 devices, such as memory expanders, are a very interesting implementation for HPC applications leveraging CXL.io and CXL.mem to allow the Host processor to access attached Device memory as if it were part of its own memory space.

Use cases for CXL Type 1 and 2 devices are applications that leverage accelerators, graphics, and computational storage. Use cases for Type 3 devices are applications that require storage class memory (persistent memory) and DDR that potentially will work over CXL. Replacing DDR controllers with CXL links is a new use case which the industry is exploring, leveraging the coherent memory access capabilities of CXL to make the SoC and board design less complex, versus using additional DDR memory. Another emerging application for CXL is the use of the CXS interface as an alternative to the separate CXL.cache and CXL.mem protocols. This approach can enable things like CCIX over CXL, potentially allowing support for a mesh network architecture and symmetric operation using the CCIX protocol, but over the low latency CXL link. This CCIX over CXL approach, using the CXS interface, enables links between multiple SoCs using CCIX while benefiting from the extremely low-latency provided by CXL.

Comparing the Three Interfaces

The PCIe interface is the de-facto standard for external connectivity in a wide range of applications including HPC. The ecosystem has developed and adopted new alternative protocols such as CCIX and CXL that leverage the PCIe physical layer and add several additional benefits like cache coherency and low latency. When selecting the right protocol, designers must make several trade-offs to best fit the needs of their target applications. Table 1 summarizes the unique characteristics of each protocol.

Table 1: Unique characteristics of PCIe, CXL, and CCIX best suited for HPC designs

While maximum bandwidth is the same across the three protocols, CXL offers the best latency at lower than 40 nanoseconds round trip by implementing the SerDes architecture and a CXL design from the ground up. Because CCIX is a symmetric protocol with support for mesh architecture, it adds connectivity for multiple accelerators. PCIe typically transfers a large block of data through a direct memory access (DMA) mechanism whereas CXL uses a dedicated CXL.mem protocol for short data exchanges and extremely low latency. Very similar to PCIe, CCIX uses a dedicated memory mechanism through two channels – coherent channel through VC1 and non-coherent channel through VC0.

The ecosystem has successfully adopted PCIe for a long time and understands its complexity and ways to manage it. CCIX adds the additional complexity of requiring a controller that supports two VCs and the required implementation of a Home Agent in every CCIX SoC, however, it offers slightly lower latency than PCIe and support for cache coherency. CXL adds the complexity of requiring a new controller, more interfaces, and more pins, however, it offers even lower latency than PCIe and CCIX in addition to cache coherency. PCIe, over five generations with PCIe 6.0 in the near future, has been proven and interoperated with third-party products. The newest interface, CXL, is being rapidly adopted by the industry with products expected in 2021. Intel has already announced their future Xeon Scalable processor with CXL support. CCIX, while introduced ahead of CXL, has been on a slow adoption path by the industry due to CXL’s more efficient memory access mechanism and low latency.

Conclusion

While each HPC SoC and cloud system has its own challenges and requirements, they all face compute latency, storage latency, and network latency. Understanding the latest interface IP standards that are available, along with their benefits and tradeoffs, can help designers minimize latency while integrating features that make their SoCs and systems stand above the competition.

Synopsys has delivered PCIe IP solutions to thousands of successful designs across the five generations of standards. For example, Synopsys recently announced and demonstrated the industry’s first DesignWare PCI Express 5.0 IP Interoperability with Intel’s Future Xeon Scalable Processor. In addition, Synopsys’ collaboration with Intel on CXL allowed us to deliver the industry’s first DesignWare CXL IP solution, including controller and 32GT/s PHY. We are working with other CPU vendors to support new applications using DesignWare CCIX and CXL IP for latency-optimized cloud and HPC solutions.

For more information:

DesignWare IP for HPC & Cloud Computing SoCs

DesignWare IP for PCIe

DesignWare CCIX IP

DesignWare CXL IP

Authors:
Scott Durrant, Strategic Marketing Manager, and Gary Ruggles, Sr. Product Marketing Manager, Synopsys

Also Read:

USB 3.2 Helps Deliver on Type-C Connector Performance Potential

Synopsys is Enabling the Cloud Computing Revolution

Synopsys Delivers a Brief History of AI chips and Specialty AI IP


Register File Design at the 5nm Node

Register File Design at the 5nm Node
by Tom Dillinger on 03-10-2021 at 2:00 pm

lowVt bitcell

“What are the tradeoffs when designing a register file?”  Engineering graduates pursuing a career in microelectronics might expect to be asked this question during a job interview.  (I was.)

On the surface, one might reply, “Well, a register file is just like any other memory array – address inputs, data inputs and outputs, read/write operation cycles.  Maybe some bit masking functionality to write a subset of the data inputs.  I’ll just use the SRAM compiler for the foundry technology.”  Alas, that answer will likely not receive any kudos from the interviewer.

At the recent International Solid State Circuits Conference (ISSCC 2021), TSMC provided an insightful technical presentation into their unique approach to register file implementation for the 5nm process node. [1]

The rest of this article provides some of the highlights of their decision and implementation tradeoffs.  I would encourage SemiWiki readers to obtain a copy of their paper and delve more deeply into this topic (particularly before a job interview).

Register File Bitcell Implementation Options

There are three general alternatives for selecting the register file bit cell design:

  • an array of standard-cell flip-flops, with standard cell logic circuitry for row decode and column mux selection

The figure above illustrates n registers built from flip-flops, with standard logic to control the write and read cycles (shown separately above) – one write port and two read ports are shown.

  • a conventional 6T SRAM bitcell

The figure above illustrates an SRAM embedded within a stdcell logic block, where the supply voltage domains are likely separate.  Additional area around the SRAM is required, to accommodate the difference between the conventional cell layout rules and the “pushed” rules for (large) SRAM arrays.

  • a unique bitcell design, optimized for register file operation

For the 5nm register file compiler, TSMC chose the third option using the bitcell illustrated above, based on the considerations described below.  Note that the 16-transistor cell includes additional support for masked bit-level write, using the additional CL/CLB inputs.  The TSMC team highlighted that this specific bit-write cell design reduces the concern with cell stability for adjacent bitcells on the active wordline that are not being written – the “half-select” failure issue (wordline selected, bit column not selected).

Bitcell Layout

The foundry SRAM compiler bitcell typically uses unique (aggressive) layout design rules, optimized for array density.  Yet, there are specific layout spacing and dummy shape transition rules between designated SRAM macros and adjacent standard cell logic – given the large number of register files typically present in an SoC architecture, this required transition area is inefficient.

Flip-flops use the conventional standard cell design layout rules, with fewer adjacency restrictions to adjacent logic.

For the TSMC 5nm register file bitcell, standard cell digital layout rules were also used.

Peripheral Circuitry

A major design tradeoff for optimal register file PPA is the required peripheral circuitry around the bitcell array.  There are several facets to this tradeoff:

  • complexity of the read/write access cycle

The flip-flop implementation shown above is perhaps the simplest.  All flip-flop outputs are separate signals, routed to multiplexing logic to select “column” outputs for a read cycle.  Yet, the wiring demand/congestion and peripheral logic depth grows quickly with the number of register file rows.

The SRAM uses dotted bitcell inputs and outputs along the bitline column;  the decoded row address is the only active circuit on the bitline.  A single peripheral write driver and differential read sense circuit supports the entire column.

The TSMC register file bitcell also adopts a dotted connection for the column, but separates the write and read bit lines.  The additional transistors comprising the read driver in the cell (P6, N6, P7, and N7 in the bitcell figure above) offer specific advantages:

  • the read output is full-swing, and static (while the pass gate N7/P7 is enabled)

No SRAM differential bitline precharge/discharge read access cycle is needed, saving power.  The read operation does not disturb the internal, cross-coupled nodes of the bitcell.

  • the read and write operations are independent

The use of separate WWL and RWL controls allows a concurrent write operation and read operation to the same (“write-through”) or different row.

Although based on digital standard cell design rules, note that the peripheral circuitry for the TSMC register file design needs some special consideration.  The read output transfer gate circuit presents a diffusion node at the bitcell boundary, with multiple dotted bitcell rows.  This node is extremely sensitive to switching noise, and requires detailed analysis.

Vt Selection

The choice of standard cell design rules also allows greater flexibility for the TSMC register file bitcell.  For example, low Vt devices could be selectively used in the read buffer for improved performance, with a minor impact on bitcell leakage current, as illustrated below.

VDD Operation

Perhaps the greatest register file implementation tradeoff pertains to the potential range of operating supply voltages available to foundry customers.  At advanced process nodes, the range of supply voltages needed for different target markets has increased.  Specifically, very low power applications require aggressive reductions in VDDmin – e.g., for the 5nm process node, logic functionality down to ~0.4-0.5V (from the nominal VDD=0.75V) is being pursued.

The use of standard cell design rules enables the register file implementation to scale the supply voltage with the logic library – indeed, the embedded register file can be readily integrated with other logic in the block in a single power domain.

Conversely, the traditional SRAM cell design at advanced nodes increasingly requires a “boost” during the write operation, to ensure sufficient design margin across a large number of memory bitcells, using aggressive design rules.  This write assist cycle enables a reduction in the static SRAM supply voltage, reducing the SRAM leakage current.  Yet, it also introduces considerable complexity to the access cycle with the charge-pump boost precursor (possibly even requiring a read-after-write operation to confirm the written data).

Write Power

Another comparison to a conventional SRAM bitcell worth mentioning is that the feedback loop in the TSMC register file bitcell is broken during the write operation.  (Most flip-flops circuits also use this technique.)  The write current overdrive used to flip the state of the SRAM bitcell with cross-coupled inverters dissipates greater power during this cycle.

Testsite and Measurement Data

The first figure below shows the 5nm register file testsite photomicrograph, with two array configurations highlighted.  The second figure illustrates the measured performance data for 4kb and 8kb register file macros, across VDD and temperature ranges.  Note the selection of a digital process design enables functional operation down to a very low VDDmin.

(Astute observers will note the nature of temperature inversion in the figure – operation at 0C is more limited than at 100C.)

The testsite macros also included DFT and BIST support circuitry – the test strategy (and circuit overhead) is definitely part of the register file implementation tradeoff decision.

Summary:  The Final Tradeoff

Like all tradeoffs, there is a range of applicability which much be taken into account.  for the case of register file implementation using either flip-flops, conventional SRAM bitcells, or a unique bitcell as developed by TSMC for the 5nm node, the considerations are:

  • area:  dense 6T SRAM cells with complex peripheral circuitry versus larger area cells (using digital design rules)
  • VDDmin support (power) and VDDmax capabilities (performance, reliability)
  • masked bit-write requirements
  • test methodology (e.g., BIST versus a simple scan chain through flip-flops)
  • and, last but certainly not least,
  • number of register file access ports (including concurrent read/write operation requirements)

The TSMC focus for their ISSCC presentation was on a 1W, 1R port architecture.  If more register file ports are needed, the other tradeoff assessments listed above change considerably.

The figure below illustrates the area tradeoff between an SRAM bitcell and the 5nm bitcell, indicating a “cross-over” point at ~40 rows (for 256 columns).  The 4kb (32×128) and 8kb (32×256) register file macros shown earlier fit with the preferred window for the fully digital bitcell design.

For reference, TSMC also shared this tradeoff for their previous 7nm register file design, as shown below (1W1R ports). [2]  Note the this figure also includes the lower range, where a flip-flop-based implementation is attractive.

Yet, as currently SoC architectures demand larger on-die local storage, the unique bitcell design in 5nm supporting optimum 4kb and 8kb macros hits the sweet spot.

Hopefully, this article will help you nail the register file design job interview question.   🙂

I would encourage you to read the TSMC papers describing their design approach and tradeoff assessments on 5nm (and 7nm) register file implementations.

-chipguy

References

[1]  Fujiwara, H., et al., “A 5nm 5.7GHz@1.0V and 1.3GHz@0.5V 4kb Standard-Cell-Based Two-Port Register File with a 16T Bitcell with No Half-Selection Issue”, ISSCC 2021, paper 24.4.

[2]  Sinangil, M., et al., “A 290mV Ultra-Low Voltage One-Port SRAM Compiler Design Using a 12T Write Contention and Read Upset Free Bitcell in 7nm FinFET Technology”, VLSI Symposium 2018.