Bronco Webinar 800x100 1

VersionVault brings SCM/DM capabilities to EDA World – with Cadence Virtuoso Integration

VersionVault brings SCM/DM capabilities to EDA World – with Cadence Virtuoso Integration
by manish_virmani on 03-11-2021 at 10:00 am

VersionVault Cadence Integration

HCL VersionVault is a secure enterprise solution for version control and configuration management. With HCL VersionVault – Cadence Virtuoso Integration, VersionVault brings its enterprise configuration management capabilities to analog and mixed signal designers. This integration enables designers to take advantage of core capabilities of VersionVault, without leaving their familiar design environment. This integration allows custom chip designers to complete VersionVault actions from within Cadence Virtuoso.

  • Salient Features:

VersionVault Cadence integration offers advanced sets of capabilities which makes it a right fit for IC designers.

Figure 1: Integration Capabilities

  • Instant Workspace Creation

With Dynamic views, irrespective of size of design libraries (running into GBs), designers can create their workspaces based on a desired configuration instantaneously. No client side downloading of content is needed.

  • Rich Graphical & Command-line support

Integration supports all prominent design management use cases from Cadence Virtuoso’s graphical interfaces i.e. Library Manager, Cell View Editors. Integration does provide a dedicated command-line interface as well for all major design management operations.

  • Library Manager:

Figure 2: DM Operations via Context Menu’s in LM

  • Cell View Editors:

Figure 3: DM Operations via CVE

  • Command Line

Figure 4: Command Line Interface

  • Interactive Graphical Schematic Diff

Schematic diff tool enables designers to graphically browse-through and review changes made across versions of the same schematic design. This tool will provide means to the designers to navigate through any addition, deletion or modification which may have taken place between the schematic versions being compared. During the navigation, the tool will also highlight the deltas on the schematic editor in case they happen to be part of any visible design component.

Figure 5: Graphical Schematic Diff

  • Hierarchical Design Management

The Hierarchy Manager GUI provides a powerful mechanism for examining and traversing a design hierarchy. On the specification tab, designer can specify various descent controls supported with advanced filtering capabilities. Cell views tab shows the corresponding results on which designer can perform various DM operations.

Figure 6: Hierarchy Manager

  • VersionVault Work Area Manager (WAM)

The VersionVault Work Area Manager is one of the key highlights of this integration. It offers the designers an interface to perform advance design management operations besides presenting additional VersionVault-specific information for designs including their mastership.

Figure 7: WorkArea Manager

  • Hardware and Software Co-development

VersionVault as an enterprise level source and design management solution, helps software/firmware developers and chip designers to enjoy server class security of SW and HW artifacts, maintain history, share components/artifacts across a local and/or distributed team and adhere to a common process management put by the organization in place. Common tooling thus ensures common training for SW and HW teams and thus reduces administration costs.

Will this solution help you and your organization?

  • Do you have a robust enterprise level Design Management solution in place?
  • Does your current Design Management solution enable designers to create designer’s workspace based on a specific configuration instantaneously?
  • Does your current Design Management solution support co-development of Hardware & Firmware, common labeling across components, adhere to a common process and provide fine-grained access control across SW/Firmware and HW teams?

If not, VersionVault – Cadence Virtuoso Integration can help you and your organization fill in these gaps.

 


Webinar: How to Protect Sensitive Data with Silicon Fingerprints

Webinar: How to Protect Sensitive Data with Silicon Fingerprints
by Daniel Nenni on 03-11-2021 at 8:00 am

Webinar How to Protect Sensitive Data with Silicon Fingerprints

Data protection is on everyone’s mind these days. The news cycle seems to contain a story about hacking, intrusion or cyber-terrorism on a regular basis. The cloud, our hyperconnected devices and the growing reliance on AI-assisted hardware to manage more and more mission critical functions all around us make data protection a front-of-mind item for many.  There are many approaches to address data security, some hardware-based and some software-based with many approaches blending both. All of them have a common liability – the cryptographic key that unlocks data access. Just like an impenetrable vault, having the key to that vault neutralizes its protection. An upcoming webinar outlines a way to implement this all-important key in a unique way, one that doesn’t require storing the key at all. Let’s explore how to protect sensitive data with silicon fingerprints.

First, a bit about the company holding the webinar. Intrinsic ID is a unique company that focuses on security IP. Their stated mission is to make it easy to secure any smart device and make the connected world safer. It’s hard to argue with that. At the core of their strategy is something called a physical unclonable function, or PUF technology. This is where the silicon fingerprint comes in. I’ll get back to that in a moment.  If you want more background on the company you can see my recent interview with their CEO, Pim Tuyls.

Back to silicon fingerprints. The concept is to use the innate and unique characteristics of each semiconductor device to create a PUF. A special SRAM cell is used to manifest this capability. It turns out every SRAM cell has its own preferred state every time the SRAM is powered on, resulting from random differences in threshold voltages. By starting with an uninitialized SRAM memory, its response yields a unique and random pattern of 0’s and 1’s. This pattern is the chip’s fingerprint, since it is unique to a particular SRAM and a particular chip.

If this sounds too easy, it is. The SRAM response is a noisy fingerprint and turning it into a high-quality and secure key requires special processing. This is done with the Intrinsic ID IP.  With this approach, it is possible to reconstruct exactly the same cryptographic key every time and under all environmental conditions.  This approach has some significant advantages. The key is not permanently stored anywhere and so it’s not present when the device is inactive (no key at rest). Hackers who open the device to compromise memory come up empty-handed. There is a lot more to this process. You’ll need to attend the webinar to learn more.

Beyond the basics of how silicon fingerprints work, there’s a lot more moving parts to build an actual secure system. The webinar covers all these steps, including how to:

  • Create a PUF root key from a chip’s silicon fingerprint
  • Derive device-unique cryptographic keys for different purposes, applications and users
  • Create a secure vault

This webinar covers a lot of ground. To give you a preview, here are some of the specific topics that you’ll learn about:

  • The need for keys in IoT – many keys are needed, where do they come from?
  • How to keep your (root) keys secure
  • The SRAM PUF and how creates the root key
  • SRAM PUFs vs. traditional methods
  • Protecting all keys with a key vault
  • Information about the widespread use of these methods

If you’re concerned about protecting data in your next design, you need to attend this webinar, absolutely. You’ll learn about methods to lock your data with a key that is never stored anywhere. This is how to protect sensitive data with silicon fingerprints. The webinar will be broadcast on Wednesday, March 24, 2021 at 10:00 AM PDT. You can register for the webinar here. You’ll be glad you attended.

Also Read:

CEO Interview: Pim Tuyls of Intrinsic ID

IDT Invests in IoT Security

IoT Devices Can Kill and What Chip Makers Need to Do Now


Using IP Interfaces to Reduce HPC Latency and Accelerate the Cloud

Using IP Interfaces to Reduce HPC Latency and Accelerate the Cloud
by Scott Durrant Gary Ruggles on 03-11-2021 at 6:00 am

dwtb q121 in memory comp fig1.jpg.imgw .850.x

IDC has forecasted that over the next five years, the Global Datasphere — the amount of data that’s created, transferred over the network and stored each year — will increase by over 3X to 175 zettabytes (Figure 1). Much of this is driven by the Internet of Things (IoT), video applications (including video streaming, social media, online gaming, augmented and virtual reality applications), and unified communications for video conferencing, text/chat and online voice communications.

Figure 1: Dramatic increase in the amount of network data that’s created, transferred, and stored

All of this data growth is driving the need for more compute power to process data in the cloud and high-performance computing (HPC) systems. To deliver the best experience at the endpoint, systems need faster interfaces to move data from point A to point B, an efficient and high performance storage infrastructure to store and retrieve data, and artificial intelligence (AI) and graphics accelerators to extract meaning from all of this data. High-performance IP can accelerate the design of chips that address these challenges.

Every HPC and cloud application has its own level of latency sensitivity, but they share three major sources of latency.

Latency Source 1: Network Latency

The first major source of latency is the network itself, including the time to move data between two points. Network latency is impacted by the distance that data must move. For example, with all else being equal, it’s much faster to move data between two nearby buildings than to move it across a continent.

Network latency is also impacted by the number of hops or network devices that the data has to traverse (which is typically directly related to the distance travelled). Minimizing the network distance and the number of hops can help to reduce network latency. To this end, cloud, telecom, and co-location service providers have recently established partnerships to put the power of cloud computing at the edge of the network, closer to the user and to end-user devices.

This helps to minimize latency and converge the data and services closer to the point of use for a much more responsive experience. It delivers smoother and more realistic experiences in applications like video streaming, augmented and virtual reality, and online gaming. (See How AI in Edge Computing Drives 5G and the IoT for a case study on this topic.)

In addition, moving cloud computing closer to the edge accelerates the response time for control system applications. In an automotive application, for example, a car moving at 60 miles per hour travels nearly 9 feet in 100 milliseconds – a blink of the eye. Any delay in data moving from the car to and from the cloud can be life-threatening. Offering nearly instantaneous response times gives the control system greater precision for increased safety.

Latency Source 2: Storage Latency

A second source of latency is the storage and retrieval of data, including the access time of the media. Historically, magnetic hard disk drives (HDDs) were the primary long-term data storage medium. HDDs had access times that were measured in milliseconds. But as solid state drives (SSDs) and persistent memory proliferate, media access time is measured in hundreds of nanoseconds, resulting in a 10,000X improvement in responsiveness (Figure 2).

Figure 2: As applications move from HDDs to persistent memory, systems see a 10,000x improvement in storage latency

The tremendous improvement in storage access times has resulted in network performance becoming the limiting factor for latency in a storage area network. Moving the storage closer to the CPU helps, as does using architectures such as persistent memory and innovative protocols like remote direct memory access (RDMA) to help accelerate storage transactions.

Another emerging technology in the storage space is computational storage. Computational storage combines compute capabilities with storage systems to offload computation or consumption of compute cycles from application servers. Computational storage allows processing to happen within the storage itself, reducing network traffic on the storage network and providing faster responses in certain applications.

Finally, smart network interface cards (NICs) are being adopted to reduce the load on application processors as data is transferred across the storage network. By offloading data transfer protocols, security protocols, and network management tasks from the application processor, smart NICs improve overall system performance for networked applications.

Latency Source 3: Compute Latency

The third source of latency is the actual compute time associated with data processing. The compute cycles and the movement of data between compute modules–between memory and the compute device–all impact data processing time. To address processing latency, designers need to address the amount of bandwidth available and the speed of the data transfer protocols.

Figure 3 shows an example of two chips: a cloud server system-on-chip (SoC), which provides the application compute processing, and a graphics accelerator chip. The graphics accelerator uses HBM memory, and the cloud server chip uses traditional DDR memory. By utilizing a cash coherent interface between these two devices, the memory can be pooled in what we call a “converged memory pool” and the devices can share memory space without actually having to copy data from one process or domain to the other. This type of connection benefits from high-performance interface IP such as PCI Express 5.0, Compute Express Link (CXL), and Cache Coherent Interconnect for Accelerators (CCIX).

Figure 3: Cache coherent interfaces reduce compute latency

PCIe 5.0, CXL, or CCIX for Lowest Latency & Right Feature Set?

While low latency is the goal in cloud environments for fast processing of complex workloads, each protocol provides additional unique features and functionality that best fit the needs of the target application. Traditionally, servers relied on CPUs and storage for compute resources, which is no longer an option for today’s large hyperscale data centers with AI accelerators. A cloud server with a certain amount of memory, AI acceleration, GPUs, and networking capabilities, may require two CPUs and four storage devices or one storage device and two CPUs to process a particular workload. Each of these scenarios poses a different server configuration requirement for flexibility and scalability while continuing to focus on the goal of low latency. Let’s now examine the crowded field of low latency and cache coherent protocols to make it easier for designers to select the technology that best addresses their unique design needs.

While the market is preparing for PCIe 6.0, which is expected to be introduced in 2021, the shift from 16 GT/s PCIe 4.0 to PCIe 5.0 operating at 32 GT/s is quickly ramping up. A quick survey of our current Synopsys DesignWare® IP users shows many designs have already adopted the 32 GT/s PCIe 5.0 interface for their HPC applications. However, with the use of AI accelerators requiring more efficient memory performance, cache coherency combined with high bandwidth has become a critical demand. The CXL and CCIX protocols address this demand by reducing the amount of back and forth copying of data from the memory to processors and accelerators, dramatically lowering latency.

To fully optimize a system, selecting the right interface becomes critical to making the necessary tradeoffs between bandwidth, latency, memory access, topology, and implementation.

PCI Express

PCIe is the de-facto standard for chip-to-chip connectivity between the host and device. A simplified PCIe implementation can be between a PCIe root port (or root complex), and a PCIe endpoint through a four-lane (x4) link. A typical chip-to-chip PCIe implementation is expandable and hierarchical with embedded switches or switch chips that allow one root port to interface with multiple endpoints. Such an implementation is seen in laptops or even servers, allowing connectivity with different endpoints like Ethernet cards, display drivers, disk drives and other storage devices. However, the limitation of this implementation is seen in large systems with isolated memory pools that require heterogeneous computing where the processor and accelerator share the same data and memory space in a single 64-bit address space. In other words, the lack of a cache coherency mechanism in PCIe makes memory performance inefficient and latency less than acceptable as compared to some of the newer protocols like CXL and CCIX.

It is possible to leverage PCIe with what can be referred to as private links to enable data centers with servers that require chip-to-chip communication for multi-processing or between a processor and multiple accelerators. Private PCIe links can be used when both ends of a chip-to-chip link are owned by the same vendor, as parts of a typical PCIe data stream can be co-opted to help route information from chip to chip, outside of the PCIe protocol itself. Overloading the PCIe header and flexible new packets via vendor-defined messages enable messages to reach the intended chip in the chain. While this is not a typical implementation, many Synopsys users have adopted it.

CCIX

When CCIX was announced, it offered 20 GT/s and 25 GT/s data rates, which at the time was higher than PCIe 4.0 at 16GT/s, and the protocol added coherency capabilities. Today, CCIX v1.1 offers data rates up to 32GT/s and supports cache coherency, enabling multiple chips to share memory via a virtual memory space. Components that are connected in a single system become part of a large memory pool, eliminating the need to transfer large amounts of data between the processor and accelerator. CCIX enables heterogeneous computing with the ability to support mesh architectures where many CPUs or accelerators are interconnected and share data coherently.

While a CCIX implementation is very similar to PCIe, it implements two virtual channels (VCs): one each for the coherent and non-coherent traffic, resulting in latency on the order of PCI Express or slightly higher, which may not be appealing for HPC applications. Since CCIX is a symmetric protocol, every device in a CCIX implementation behaves the same and leverages a Home Agent where caching is managed. Due to the inherent symmetry, a coherency issue in any device can be detrimental to the entire system and not just the SoC.

CXL

CXL is ideal for host-to-device heterogeneous computing with support anticipated from all four CPU providers – Intel, IBM, Arm, and AMD. Unlike CCIX, CXL is an asymmetric protocol giving the host exclusive control of memory coherency and memory access. The advantages are a much simpler implementation of CXL devices, without the need for the Home Agent, which means that any mishandling of memory by a device will not cause system failure.

CXL runs across the PCIe physical layer, which is currently the PCIe 5.0 protocol operating at 32 GT/s. It uses a flexible processor port that can auto-negotiate a high-bandwidth CXL link, for example a x16 link, seamlessly plugging into either a PCIe or CXL card. Merging IO (.io), cache (.cache), and memory (.mem) protocols into one, CXL enables high bandwidth with an extremely low-latency interface, allowing the processor and accelerator to leverage a converged memory space. A converged memory space allows different memories such as HBM for the accelerator and DDR for the processor to be shared coherently. The required CXL.io protocol is effectively a PCIe link, and is used for discovery, register access, configuration of the link, and link bring up, while the .cache and .mem protocols are used for low-latency coherent data exchange, and one or both must be implemented to create a complete CXL link.

CXL delivers much lower latency than PCIe and CCIX by implementing the SerDes architecture in the newest PIPE specification, essentially moving the PCS layer, and its associated latency, from inside the PHY to the controller and allowing the CXL.cache and CXL.mem traffic to split from the CXL.io traffic very early in the stack. This combines with the inherent low latency of the CXL stack to give CXL lower latency than either PCIe or CCIX.

The three CXL protocols can be combined to create three distinct device types. Since the CXL.io protocol is mandatory it is implemented in all device types.

A Type 1 device implements CXL.io and CXL.cache protocols to allow attached devices like accelerators and smart NICs to cache and coherently access the host cache memory.

A Type 2 device implements all three protocols: CXL.io, CXL.cache, and CXL.mem to process the coherent data between the host and device-attached memory to optimize performance for a given task, allowing the Device to cache the Host memory and the Host to access attached device memory within a unified memory space.

Type 3 devices, such as memory expanders, are a very interesting implementation for HPC applications leveraging CXL.io and CXL.mem to allow the Host processor to access attached Device memory as if it were part of its own memory space.

Use cases for CXL Type 1 and 2 devices are applications that leverage accelerators, graphics, and computational storage. Use cases for Type 3 devices are applications that require storage class memory (persistent memory) and DDR that potentially will work over CXL. Replacing DDR controllers with CXL links is a new use case which the industry is exploring, leveraging the coherent memory access capabilities of CXL to make the SoC and board design less complex, versus using additional DDR memory. Another emerging application for CXL is the use of the CXS interface as an alternative to the separate CXL.cache and CXL.mem protocols. This approach can enable things like CCIX over CXL, potentially allowing support for a mesh network architecture and symmetric operation using the CCIX protocol, but over the low latency CXL link. This CCIX over CXL approach, using the CXS interface, enables links between multiple SoCs using CCIX while benefiting from the extremely low-latency provided by CXL.

Comparing the Three Interfaces

The PCIe interface is the de-facto standard for external connectivity in a wide range of applications including HPC. The ecosystem has developed and adopted new alternative protocols such as CCIX and CXL that leverage the PCIe physical layer and add several additional benefits like cache coherency and low latency. When selecting the right protocol, designers must make several trade-offs to best fit the needs of their target applications. Table 1 summarizes the unique characteristics of each protocol.

Table 1: Unique characteristics of PCIe, CXL, and CCIX best suited for HPC designs

While maximum bandwidth is the same across the three protocols, CXL offers the best latency at lower than 40 nanoseconds round trip by implementing the SerDes architecture and a CXL design from the ground up. Because CCIX is a symmetric protocol with support for mesh architecture, it adds connectivity for multiple accelerators. PCIe typically transfers a large block of data through a direct memory access (DMA) mechanism whereas CXL uses a dedicated CXL.mem protocol for short data exchanges and extremely low latency. Very similar to PCIe, CCIX uses a dedicated memory mechanism through two channels – coherent channel through VC1 and non-coherent channel through VC0.

The ecosystem has successfully adopted PCIe for a long time and understands its complexity and ways to manage it. CCIX adds the additional complexity of requiring a controller that supports two VCs and the required implementation of a Home Agent in every CCIX SoC, however, it offers slightly lower latency than PCIe and support for cache coherency. CXL adds the complexity of requiring a new controller, more interfaces, and more pins, however, it offers even lower latency than PCIe and CCIX in addition to cache coherency. PCIe, over five generations with PCIe 6.0 in the near future, has been proven and interoperated with third-party products. The newest interface, CXL, is being rapidly adopted by the industry with products expected in 2021. Intel has already announced their future Xeon Scalable processor with CXL support. CCIX, while introduced ahead of CXL, has been on a slow adoption path by the industry due to CXL’s more efficient memory access mechanism and low latency.

Conclusion

While each HPC SoC and cloud system has its own challenges and requirements, they all face compute latency, storage latency, and network latency. Understanding the latest interface IP standards that are available, along with their benefits and tradeoffs, can help designers minimize latency while integrating features that make their SoCs and systems stand above the competition.

Synopsys has delivered PCIe IP solutions to thousands of successful designs across the five generations of standards. For example, Synopsys recently announced and demonstrated the industry’s first DesignWare PCI Express 5.0 IP Interoperability with Intel’s Future Xeon Scalable Processor. In addition, Synopsys’ collaboration with Intel on CXL allowed us to deliver the industry’s first DesignWare CXL IP solution, including controller and 32GT/s PHY. We are working with other CPU vendors to support new applications using DesignWare CCIX and CXL IP for latency-optimized cloud and HPC solutions.

For more information:

DesignWare IP for HPC & Cloud Computing SoCs

DesignWare IP for PCIe

DesignWare CCIX IP

DesignWare CXL IP

Authors:
Scott Durrant, Strategic Marketing Manager, and Gary Ruggles, Sr. Product Marketing Manager, Synopsys

Also Read:

USB 3.2 Helps Deliver on Type-C Connector Performance Potential

Synopsys is Enabling the Cloud Computing Revolution

Synopsys Delivers a Brief History of AI chips and Specialty AI IP


Register File Design at the 5nm Node

Register File Design at the 5nm Node
by Tom Dillinger on 03-10-2021 at 2:00 pm

lowVt bitcell

“What are the tradeoffs when designing a register file?”  Engineering graduates pursuing a career in microelectronics might expect to be asked this question during a job interview.  (I was.)

On the surface, one might reply, “Well, a register file is just like any other memory array – address inputs, data inputs and outputs, read/write operation cycles.  Maybe some bit masking functionality to write a subset of the data inputs.  I’ll just use the SRAM compiler for the foundry technology.”  Alas, that answer will likely not receive any kudos from the interviewer.

At the recent International Solid State Circuits Conference (ISSCC 2021), TSMC provided an insightful technical presentation into their unique approach to register file implementation for the 5nm process node. [1]

The rest of this article provides some of the highlights of their decision and implementation tradeoffs.  I would encourage SemiWiki readers to obtain a copy of their paper and delve more deeply into this topic (particularly before a job interview).

Register File Bitcell Implementation Options

There are three general alternatives for selecting the register file bit cell design:

  • an array of standard-cell flip-flops, with standard cell logic circuitry for row decode and column mux selection

The figure above illustrates n registers built from flip-flops, with standard logic to control the write and read cycles (shown separately above) – one write port and two read ports are shown.

  • a conventional 6T SRAM bitcell

The figure above illustrates an SRAM embedded within a stdcell logic block, where the supply voltage domains are likely separate.  Additional area around the SRAM is required, to accommodate the difference between the conventional cell layout rules and the “pushed” rules for (large) SRAM arrays.

  • a unique bitcell design, optimized for register file operation

For the 5nm register file compiler, TSMC chose the third option using the bitcell illustrated above, based on the considerations described below.  Note that the 16-transistor cell includes additional support for masked bit-level write, using the additional CL/CLB inputs.  The TSMC team highlighted that this specific bit-write cell design reduces the concern with cell stability for adjacent bitcells on the active wordline that are not being written – the “half-select” failure issue (wordline selected, bit column not selected).

Bitcell Layout

The foundry SRAM compiler bitcell typically uses unique (aggressive) layout design rules, optimized for array density.  Yet, there are specific layout spacing and dummy shape transition rules between designated SRAM macros and adjacent standard cell logic – given the large number of register files typically present in an SoC architecture, this required transition area is inefficient.

Flip-flops use the conventional standard cell design layout rules, with fewer adjacency restrictions to adjacent logic.

For the TSMC 5nm register file bitcell, standard cell digital layout rules were also used.

Peripheral Circuitry

A major design tradeoff for optimal register file PPA is the required peripheral circuitry around the bitcell array.  There are several facets to this tradeoff:

  • complexity of the read/write access cycle

The flip-flop implementation shown above is perhaps the simplest.  All flip-flop outputs are separate signals, routed to multiplexing logic to select “column” outputs for a read cycle.  Yet, the wiring demand/congestion and peripheral logic depth grows quickly with the number of register file rows.

The SRAM uses dotted bitcell inputs and outputs along the bitline column;  the decoded row address is the only active circuit on the bitline.  A single peripheral write driver and differential read sense circuit supports the entire column.

The TSMC register file bitcell also adopts a dotted connection for the column, but separates the write and read bit lines.  The additional transistors comprising the read driver in the cell (P6, N6, P7, and N7 in the bitcell figure above) offer specific advantages:

  • the read output is full-swing, and static (while the pass gate N7/P7 is enabled)

No SRAM differential bitline precharge/discharge read access cycle is needed, saving power.  The read operation does not disturb the internal, cross-coupled nodes of the bitcell.

  • the read and write operations are independent

The use of separate WWL and RWL controls allows a concurrent write operation and read operation to the same (“write-through”) or different row.

Although based on digital standard cell design rules, note that the peripheral circuitry for the TSMC register file design needs some special consideration.  The read output transfer gate circuit presents a diffusion node at the bitcell boundary, with multiple dotted bitcell rows.  This node is extremely sensitive to switching noise, and requires detailed analysis.

Vt Selection

The choice of standard cell design rules also allows greater flexibility for the TSMC register file bitcell.  For example, low Vt devices could be selectively used in the read buffer for improved performance, with a minor impact on bitcell leakage current, as illustrated below.

VDD Operation

Perhaps the greatest register file implementation tradeoff pertains to the potential range of operating supply voltages available to foundry customers.  At advanced process nodes, the range of supply voltages needed for different target markets has increased.  Specifically, very low power applications require aggressive reductions in VDDmin – e.g., for the 5nm process node, logic functionality down to ~0.4-0.5V (from the nominal VDD=0.75V) is being pursued.

The use of standard cell design rules enables the register file implementation to scale the supply voltage with the logic library – indeed, the embedded register file can be readily integrated with other logic in the block in a single power domain.

Conversely, the traditional SRAM cell design at advanced nodes increasingly requires a “boost” during the write operation, to ensure sufficient design margin across a large number of memory bitcells, using aggressive design rules.  This write assist cycle enables a reduction in the static SRAM supply voltage, reducing the SRAM leakage current.  Yet, it also introduces considerable complexity to the access cycle with the charge-pump boost precursor (possibly even requiring a read-after-write operation to confirm the written data).

Write Power

Another comparison to a conventional SRAM bitcell worth mentioning is that the feedback loop in the TSMC register file bitcell is broken during the write operation.  (Most flip-flops circuits also use this technique.)  The write current overdrive used to flip the state of the SRAM bitcell with cross-coupled inverters dissipates greater power during this cycle.

Testsite and Measurement Data

The first figure below shows the 5nm register file testsite photomicrograph, with two array configurations highlighted.  The second figure illustrates the measured performance data for 4kb and 8kb register file macros, across VDD and temperature ranges.  Note the selection of a digital process design enables functional operation down to a very low VDDmin.

(Astute observers will note the nature of temperature inversion in the figure – operation at 0C is more limited than at 100C.)

The testsite macros also included DFT and BIST support circuitry – the test strategy (and circuit overhead) is definitely part of the register file implementation tradeoff decision.

Summary:  The Final Tradeoff

Like all tradeoffs, there is a range of applicability which much be taken into account.  for the case of register file implementation using either flip-flops, conventional SRAM bitcells, or a unique bitcell as developed by TSMC for the 5nm node, the considerations are:

  • area:  dense 6T SRAM cells with complex peripheral circuitry versus larger area cells (using digital design rules)
  • VDDmin support (power) and VDDmax capabilities (performance, reliability)
  • masked bit-write requirements
  • test methodology (e.g., BIST versus a simple scan chain through flip-flops)
  • and, last but certainly not least,
  • number of register file access ports (including concurrent read/write operation requirements)

The TSMC focus for their ISSCC presentation was on a 1W, 1R port architecture.  If more register file ports are needed, the other tradeoff assessments listed above change considerably.

The figure below illustrates the area tradeoff between an SRAM bitcell and the 5nm bitcell, indicating a “cross-over” point at ~40 rows (for 256 columns).  The 4kb (32×128) and 8kb (32×256) register file macros shown earlier fit with the preferred window for the fully digital bitcell design.

For reference, TSMC also shared this tradeoff for their previous 7nm register file design, as shown below (1W1R ports). [2]  Note the this figure also includes the lower range, where a flip-flop-based implementation is attractive.

Yet, as currently SoC architectures demand larger on-die local storage, the unique bitcell design in 5nm supporting optimum 4kb and 8kb macros hits the sweet spot.

Hopefully, this article will help you nail the register file design job interview question.   🙂

I would encourage you to read the TSMC papers describing their design approach and tradeoff assessments on 5nm (and 7nm) register file implementations.

-chipguy

References

[1]  Fujiwara, H., et al., “A 5nm 5.7GHz@1.0V and 1.3GHz@0.5V 4kb Standard-Cell-Based Two-Port Register File with a 16T Bitcell with No Half-Selection Issue”, ISSCC 2021, paper 24.4.

[2]  Sinangil, M., et al., “A 290mV Ultra-Low Voltage One-Port SRAM Compiler Design Using a 12T Write Contention and Read Upset Free Bitcell in 7nm FinFET Technology”, VLSI Symposium 2018.


TSMC Plans Six Wafer Fabs in Arizona

TSMC Plans Six Wafer Fabs in Arizona
by Scotten Jones on 03-10-2021 at 10:00 am

TSMC Fab 18 Remdering

There are reports in the media that TSMC is now planning six Fabs in Arizona (the image above is Fab 18 in Taiwan). The original post I saw referred to a Megafab and claimed six fabs with 100,000 wafers per month of capacity (wpm) for $35 billion dollars. The report further claimed it would be larger than TSMC fabs in Taiwan.

This report struck me as not reliable given that TSMC refers to their large fab clusters as Gigafabs not Megafabs and TSMC’s Fab 12, Fab 14, and Fab 15 each have capacity of around 300,000 wpm and Fab 18 is ramping to over 200,000 wpm.

Now similar reports are being repeated in more reputable sources, notably today I saw a report in EE News Europe that stated:

  • The site would be a Gigafab (correct terminology).
  • Filings with the city of Phoenix describe three phases of building.
  • TSMC has reportedly offered to double employee salaries to move to the US.

I am still not sure about the six fab part, the Phoenix documents are reported to say three phases although I suppose each phase could be two fabs. The other issue I have is that 100,000 wpm for six fabs is just under 17,000 wpm per fab, those are smaller fabs than TSMC typically builds and would be sub optimal from a cost perspective.

What I would think would be more likely is three fabs of just over 30,000 wpm each for a total of 100,000 wpm. Maybe they will build three fabs initially for 100,000 wpm and then have the option to build three more fabs later for an additional 100,000 wpm. Fab 18 in Taiwan has three fabs P1, P2 and P3 that are running 5nm with an original capacity of just under 30,000 wpm each although they are now being expanded to 40,000 wpm each, 120,000 wpm total. There is also P4, P5, and P6 under construction for 3nm that will likely each be around 30,000 wpm each initially bringing the site to around 200,000 wpm.

The $35 billion dollar price tag is high for 100,000 wpm of 5nm but would make sense if it also included some preparation for additional phases or 3nm capability. I should also point out the initial budget number for fabs is often an estimate and can increase or decrease as the fab is built depending on final capacity and how many fab phases are included in the initial amount. I believe TSMC has spent more money on phase 1, 2 and 3 of Fab 18 for 5nm than they originally announced and will also be spending more money on phases 4, 5 and 6 for 3nm than originally announced.

My best guess as of todays is the fab will have three phases initially producing 100,000 wpm total with the option to add three more phases in the futures to reach 200,000 wpm, that would be more consistent with TSMC Fab 18 in Taiwan.

However, the specifics work out it does appear that TSMC is now looking at building a full scale Gigafab in the US instead of the small fab originally planned. I see this as good news for the global semiconductor supply due to the high risk presented by having so much of the world’s leading edge logic capacity concentrated in Taiwan. This is especially concerning with Taiwan being located on an active fault line, the view in China that Taiwan is a rouge province that must be brought back under China control and the resource limits of a small island.

 


Cadence Underlines Verification Throughput at DVCon

Cadence Underlines Verification Throughput at DVCon
by Bernard Murphy on 03-10-2021 at 6:00 am

Verification Throughput min

Paul Cunningham, CVP and GM of the System Verification Group at Cadence gave the afternoon Keynote on Tuesday at DVCon and doubled down on his verification-throughput message. At the end of the day, what matters most to us in verification is the number of bugs found and fixed per dollar per day. You can’t really argue with that message. This is the ultimate metric for semiconductor product verification. Cycles per second, debug features, this engine versus that engine—these are ultimately mechanisms for delivering that outcome.

Throughput starts with best in class engines

That’s not to say that cycles per second and so on are unimportant. Paul has a very grounded engineering viewpoint. The horsepower underneath this throughput needs to be the best of the best in each instance. But, on top of that, what Paul (and Anirudh) call logistics—the most effective use of these resources to meet that ultimate goal—have become just as important. This view draws on an analogy with package delivery in our increasingly online world. Planes are used for long-distance transportation, long-haul trucks for plane-to-warehouse transportation and vans for last-mile delivery/pickup. Each mode has strengths and weaknesses: Speed versus setup time versus reach. Logistics is about combining these effectively to maximize throughput.

The same applies in verification. Simulation is the last-mile equivalent; emulation is the long-haul truck, and FPGA prototyping is the fastest throughput with longest time to setup. Paul added that no analogy is perfect; in verification we also have formal, which plays an important role in this throughput story. Maybe the physical logistics people need to find a parallel to up their game!

Logistics and machine learning

But having fast planes and trucks and vans is not enough to maximize throughput. FedEx, UPS and others put huge investments into scheduling and routing traffic to meet their throughput goals. The same principle must be applied to verification, for which you need logistics management on top of the engines. Xcelium-ML provides an example which leverages machine learning (ML) to reduce regression runs while maintaining the same level of coverage. We all know that some number of regression cycles are low value because they’re simply proving over and over again that “even though this code didn’t change, we checked it anyway and it still works.” This is particularly important in randomized simulations, where many randomizations may be very low value. The trick is to know what tests you can drop without missing some subtle but fatal error. That’s where machine learning comes in.

Another area where ML can play a significant role is in formal proving and regressions. We often think of formal as hard because there are many different proof engines and methodologies. You may need several of these to find your way to a proof. Which method to use on what problem has in the past been seen a question requiring a lot of deep expertise in the domain. The Jasper team has captured a lot of that expertise through ML methods, to find the best engines and methodologies to quickly arrive at a proof. Or to navigate through an optimum chain of alternatives.

Logistics between emulation and prototyping

Better logistics is not always about ML. Cadence have optimized Palladium emulation and Protium FPGA prototyping for better logistics between the engines through a unified compile front-end, unified transactor support and unified hardware bridge support. When you want to run high-performance emulations with maximum debuggability on Palladium, you do so. And when you want to switch to even higher performance for embedded software debug, you switch to Protium. With a minimum of fuss. Run into a problem during a Protium debug? Switch back to Palladium for better debug insight. Logistics again. You can switch from truck to plane and back to truck as needs demand.

Optimizing engines for better throughput will always be a priority. Optimizing logistics for better throughput between regression runs and between engines is what will squeeze out the maximum in bugs found per dollar per day. Which is ultimately what we have to care about.

For more information, visit the Cadence verification portal HERE.

Also Read

TECHTALK: Hierarchical PI Analysis of Large Designs with Voltus Solution

Finding Large Coverage Holes. Innovation in Verification

2020 Retrospective. Innovation in Verification


Hearables: From Earbuds to Life Augmentation and Beyond

Hearables: From Earbuds to Life Augmentation and Beyond
by Kalar Rajendiran on 03-09-2021 at 10:00 am

Hearables Market Players Source IDTechEx Research

As the months of 2020 passed by, I started noticing more and more people sporting what looked like fashionable ear accessories. I’m of course referring to True Wireless Stereo (TWS) earbuds. With the rapid increase in online meetings due to social distancing requirements, it appeared that adoption of TWS earbuds was even faster than during the prior years. It was hard to believe that just a little more than four years ago, people had pushed back when Apple dropped support for the headphone jack in favor of their branded version of wireless earbuds (marketed as AirPods). My curiosity was piqued. I wanted to learn what was in store not just for the earbuds market but for the broader product category called Hearables, under which earbuds fall into. The term Hearables was introduced in April of 2014 simultaneously by Apple in the context of their acquisition of Beats Electronics and by product designer and wireless applications specialist Nick Hunn in a blogpost about a wearable technologies internet platform. Interestingly, the initial description for hearable technology seems to have come from a company called Valencell back in 2006. Valencell described it as a wearable ear-worn multimedia platform for health monitoring, entertainment, guidance and cloud-based communications.

Following is a summary of what I learned about the Hearables market, size, projected growth, players and market trends as well as opportunities that exist for semiconductor companies to offer valuable solutions to this market.

Market Size: According to Allied Market Research, the global hearables market size was valued at $21.90 billion in 2018 and is projected to reach $93.90 billion by 2026, growing at a CAGR of 17.2% from 2019 to 2026.

There are many players not captured in the above chart, some of whom are:

Starkey, Bragi, Doppler, Miracle-Ear, Valancell, Earin AB, Eargo, AKG, Audio-Technica, Edifier, Xiaomi, Amoi, QCY, and Anker Innovations.

Market Trends: An article titled “Hear come the Hearables”, published in the IEEE Spectrum magazine is a very interesting read and provides scientific insights into the foundational elements for the next wave of devices. The author of this article is Poppy Crum, Chief Scientist at Dolby Laboratories and an adjunct professor at Stanford University. In her article, she explains that the following data can be effectively accessed through the ears.

  • Heart rate
  • Blood oxygen levels
  • Movement
  • Temperature
  • Eye movements
  • Skin resistance
  • Stress hormone levels
  • Brain electrical activity
  • Vagus nerve stimulation

Note: Vagus nerve is the tenth cranial nerve, extending from its origin in the brainstem through the neck and the thorax down to the abdomen. It carries extensive range of signals from the digestive system and organs to and from the brain.

The market can be expected to offer human friendly solutions that incorporate augmented reality (AR) and IoT as applicable to immerse the user and bring a level of personal experience that has not been possible before. The next wave is going to include advanced devices that fit in our ears and leverage artificial intelligence (AI), robotics and IoT, all without interfering with our usual daily activities. The devices would be capable of recognizing one’s physiological, physical and emotional status and propose/trigger actions in response to that status.

Concerns:

  • Adverse long-term effects on hearing ability due to daily, prolonged use
  • EMF radiation exposure during to daily, prolonged use

The above types of concerns are not new. Will the next wave of devices increase these health risks?

Challenges: Maximizing battery life of these devices.

Opportunities for Semiconductor Companies:

The market opportunity for Hearables devices is big and so is the opportunity for semiconductor solutions to help implement these devices. But given the highly competitive device market with so many players, there is tremendous time to market and cost pressures.

Those semiconductor companies who are able to provide cost-effective, ultra-low power solutions to enable these devices will stand to gain a large market share. The Hearables market was an early adopter of near-threshold voltage (NTV) design techniques for its promise of ultra-low power benefit. But NTV designs have historically been difficult to implement for reliable operation.

Opportunity #1: Provide easy and cost-effective way to implement NTV designs for reliable operation. One approach may be via semiconductor IP blocks and supporting software drivers. The IP should be able to adapt the chip power usage based on real time performance needs. The solution should be programmable to the minimum energy point and still be able to step up to process user input at real time speeds.

Opportunity #2: Leverage NTV technology for energy harvesting by converting motion energy into electrical energy, thereby prolonging battery life of the hearable device.

ARM capitalized on the mobile market with its low power RISC processor cores. Similarly, an entity that enables an ultra-low power solution could capitalize on the Hearables market that is projected to rapidly grow to ~$94 billion in just a few years.

 


A Review of Clock Generation and Distribution for Off-Chip Interfacing

A Review of Clock Generation and Distribution for Off-Chip Interfacing
by Tom Dillinger on 03-09-2021 at 6:00 am

clocking

At the recent ISSCC conference, Mozhgan Mansuri from Intel gave an enlightening (extended) short course presentation on all thing related to clocking, for both wireline and wireless interface design. [1]  The presentation was extremely thorough, ranging from a review of basic clocking principles to unique circuit design strategies for synthesizing and distributing clocked signals.

Personally, I found her talk to be both an excellent refresher and a source of lots of new information (for me, at least) – I thought the highlights of her talk might be of interest to SemiWiki readers.  There was a plethora of topics covered – I’ll focus on the wireline-based design considerations.  I would encourage you to review her ISSCC short course material, both wireline and wireless clocking features.

Wireline DataRate Trends

A graph depicting the progress in wireline “per lane datarates” is shown below, for several interface standards.

The PPA benefits of Moore’s Law is paralleled by interface datarate enhancements, doubling every ~2-3 years.  Yet, as wirelines span silicon, packaging, board interconnect, connectors, and cables, silicon technology scaling alone does not account for all of the datarate enhancements.  Improvements in package/PCB materials and simulation tool advances have certainly helped.

The key to this growth has been the ongoing interface circuit enhancements supporting the Tx and Rx ends of the lane.  The associated clock generation (and Rx clock recovery) techniques have been at the heart of those circuit innovations as depicted below, showing both embedded clock in data and forwarded clock options.

 

Clock Definitions

The basic clock definitions are shown below:

  • clock period
  • (50/50) duty cycle
  • clock skew (static duty cycle error, the difference between the half cycle durations)
  • jitter between cycles (dynamic;  both deterministic (e.g., due to supply voltage variations) and random (e.g., due to thermal and flicker noise in devices))

 

 

Note in the last figure above that jitter may accumulate over time, as depicted for the odd-inverter, free-running oscillator clock source.

The figure below illustrates two key measurements (and specs) for clock distribution.  The first half of the figure illustrates the frequency response of a circuit to the jitter frequency content;  the second illustrates the “tolerance” of the Rx clock recovery circuitry to jitter.

 

The figures include a typical specification “mask” over frequency.  The “ideal” jitter transfer curve depicted above provides a “0 dB, no jitter amplification” target mask through a clock distribution component.  The jitter tolerance mask spec enables designers to develop the Rx clock recovery circuitry, subsequently ensuring that the Tx jitter sources do not exceed the mask limits.

Clock Synthesis Circuitry

To generate high-frequency clocks on-chip, the common method is to employ one of two main circuit types – a phase-locked loop (PLL), and a delay-locked loop (DLL).  Their principal function is to provide a “multiplied” clock output derived from a lower-frequency (high-quality) reference clock, as described below.  Another key clock synthesis configuration is used to phase-align individual clocks tapped from an on-die oscillator, with an “injection locked oscillator” (ILO).

  • PLL

 

The PLL consists of:

  • a voltage-controlled oscillator – e.g., a free-running oscillator with adaptive response to an input voltage signal that modulates the oscillator loop delay (examples given shortly)
  • a divide-by-N counter (the multiplicative factor of the PLL)
  • a phase detector, that provides an output signal proportional to the leading/lagging phase difference between the reference and divided VCO clocks (example shortly)
  • a low-pass filter that effectively blocks short-duration signals from the phase detector from influencing the control input to the VCO

The frequency bandwidth response of the PLL defines the jitter response, a key design tradeoff.  For example, a lower bandwidth will reduce the sensitivity to jitter in the reference clock input.  A higher bandwidth will reduce the sensitivity to VCO jitter.

  • DLL

 

The figure above illustrates the principles underlying a (multiplying) delay-locked loop (DLL).  The free-running VCO oscillator in the PLL is replaced by a delay line, whose individual delay elements are controlled by the phase-detector and low-pass filter output – in the figure, a simple inverter delay chain is shown.  The jitter in the DLL clock output is “reset” by using the reference clock edge every N cycles, using the multiplexer output providing the delay chain input – see the timing diagram in the figure.

  • Injection Locked Oscillator

Another option for clock synthesis is the use of injection current into an oscillating system to provide output clock phase adjust control.

A high-level block diagram of the ILO is shown below. [2]  There are three components of note:

  • an oscillator (depicted simply as an nFET and inverting amplifier)
  • a tuned tank circuit
  • the injection current source

Recall the physics experiment where multiple metronomes of (nominally) the same time period are loosely-coupled – over time, they will synchronize (YouTube video link).

An injection current of frequency f will similarly synchronize the output voltage of the combined system to this frequency.  However, due to the relative impedances of the three components, there will be a resulting phase shift between the system output voltage and the constituent currents I_tank, I_osc, and I_inj, as depicted below.

In short, Vout = (Z_tank * I_tank), where I_tank = (I_osc + I_inj).  These are complex quantities with both magnitude and phase.  The key feature of the ILO is that the magnitude of the injected current adjusts the phase of the output voltage.

The ILO is thus an ideal method to align (or “rotate”) the phase of a clock output, relative to a reference – the phase difference detector increases/decreases the magnitude of the injected current accordingly.

Consider the case where it is desirable to generate clocks from multiple internal stages of an oscillator, each clock shifted/aligned by a specific phase.  The example below shows 4 clocks of the same frequency, each phase shifted by 90 degrees.

Logical operations on these shifted clocks derive unique pulses – e.g., clock_0 AND clock_270.  When presented with data training patterns with transitions corresponding to logical operations of these shifted clocks, phase differences between the data and clock pulses can be detected and aligned using the injection lock current.  Once aligned, the clocks can then be used to transmit/receive data at a high datarate – 4X the reference clock frequency, in the example above.

The previous discussion referred to the block diagrams of the clock generation circuitry – Mozhgan elaborated on these units in her presentation.

  • VCO

Examples of a voltage-controlled oscillator from her talk is shown in the figure below.

The first example is a simple (odd-numbered) loop of inverters, providing a free-running oscillation – the delay of each stage is modified by the voltage control signal.  (Other means of introducing delay control are also frequently used – e.g., adding a variable capacitive load to each stage using a varactor;  using “current-starved” inverters with an additional series nFET/p/FET in the pulldown/pullup stack, whose device gates provide voltage control input.)  A disadvantage of this free-running topology is the sensitivity to noise on the supply/control input.

The second example shown above includes an operational amplifier/regulator as a low-pass filter to improve the supply noise rejection.

  • Phase Difference Detector

The clock generation circuits that compare a reference to a (divided) clock use a phase difference detector to provide the control signal(s) to the VCO.  There are numerous detector topologies in common use – a simple (digital) example implementation is shown below. [3, 4]

This topology fits with the oscillator control circuits that use two inputs – “UP” and “DOWN” to represent a lagging/leading phase difference between the reference and generated clock.  (A low-pass filter is needed to remove any spurious flop output pulses between the rising clock and asynchronous reset input.)

Clock Distribution

Mozhgan presented some of the common design topologies for distributing an on-die generated clock to the (Tx or Rx) fanout.  The figure below depicts three examples, for the case where a single (global) clock spans a considerable distance before being tapped to a series of sinks:

  • a (differential, low-swing signaling) repeaterless topology, regarding the interconnect as an LC transmission line
  • an inverter repowering chain
  • a chain driven by (differential) current-mode logic inverters

(The differential methods require additional circuitry at the clock sinks.) These topologies present different tradeoffs, relative to:  jitter, phase skew, impact on slew rate from bandwidth losses, power dissipation, and power supply noise rejection.  Clock distribution planning is clearly an integral part of developing a Tx or Rx interface solution.

Mozhgan’s presentation covered a wealth of additional topics, not highlighted here – e.g., wireline Rx clock-data alignment strategies (for both forwarded clock and embedded SerDes clock interfaces), clock generation for wireless transmission/receivers, clock power optimization.  Hopefully, the few topics presented here have whetted your appetite to learn more about the unique characteristics of Tx/Rx clocking.  I would encourage you to review Mozhgan’s ISSCC presentation.

-chipguy

 

References

[1]  Mozhgan Mansuri, “Clocking, clock distribution, and clock management in wireline and wireless subsystems”, ISSCC 2021, Short Course SC-3.

[2]  http://rfic.eecs.berkeley.edu/ee242/pdf/Module_7_4_IL.pdf

[3]  https://analog.intgckts.com/phase-locked-loop/phase-frequency-detector/

[4]  https://www.electronics-tutorials.ws/filter/filter_5.html

 

 


USB 3.2 Helps Deliver on Type-C Connector Performance Potential

USB 3.2 Helps Deliver on Type-C Connector Performance Potential
by Tom Simon on 03-08-2021 at 10:00 am

USB 3.2 Lane Usage

Despite sounding like a minor enhancement version for USB, USB 3.2 introduces many important changes for the USB specification. To see where USB has come from and where it is going, it is essential to look at what is found in USB 3.2. The other salient point is that now the Type-C connector has split out from the underlying USB specification and takes on a life of its own. Additionally, it is important to understand USB 3.2 because it plays a key role in the USB4 standard.

Synopsys, as a major provider of USB IP and contributor to the standards, has published an informative white paper that clearly explains what is new in the USB 3.2 specification and how all the elements, including the Type-C connector work together. The paper written by Morten Christiansen, Technical Marketing Manager at Synopsys, is titled “USB 3.2: A USB Type-C Challenge for SoC Designers”.

USB 3.0 was fine for mechanical disk drives with spinning platters, but Flash based SSD drives easily exceed the available bandwidth. USB 3.2 offers up to 20Gbps, which is four times the throughput of USB 3.0. USB 3.2 also allows more flexibility for connected display devices. In addition to adding support for longer cables for video, it also allows for an alternate mode for higher bandwidth video using all of its additional lanes to carry display data.

This brings us to why Type-C connectors are so important to USB 3.2 and beyond. There are four sets of differential pairs on the Type-C connector. Previously with 3.1 and 3.0 only one TX and RX pair was used. The actual pairs used depended on the orientation of the connector. The nomenclature for USB 3.2 connections speeds is noted as Gen X x Y, where X denotes the lane speed and the Y denotes the number of lanes used. Gen 1 is 5Gbps, and Gen 2 is 10 Gbps. Thus, Gen 2×1 is one lane at 10G and Gen 1×2 is 2 lanes at 5G each for a total of 10G. Consumer facing information on speeds will focus on the resultant speed and not on the internal mechanics or version numbers.

USB 3.2 Lane Usage

Higher data rates open up some interesting options for using USB in new ways. Synopsys suggests that using existing USB ports for debug data can save on extra hardware ports and allow for much better tracing and debug. Synopsys USB device controllers support External Buffer Control (EBC) for efficient movement of debug data through USB ports. The automotive market will also see benefits from USB 3.2 due to longer permitted cable lengths. Higher data rates here will also help speed infotainment system firmware and application updates. These might include maps and navigation data, etc.

The Synopsys white paper does an excellent job of describing the lane bonding and data striping that is used to increase transfer rates. The paper also talks about the changes required to the USB controller to handle the Ordered Sets for USB 3.2 and the encoding it uses. They point out that the higher 20 Gbps data rates can reveal issues in existing device controller CPU/memory configurations or software stacks, even though the previous software stack is compatible with USB 3.2.  In the PHY it is essential to move to two independent RX/TX lane pairs and a digital crossbar instead of relying on analog multiplexers, as was sufficient with the older Gen 1 data rates.

At the end of the paper the author discusses the methods that Synopsys uses to prototype and test their IP and silicon. They use HAPS-80 FPGA based prototyping systems for their USB 3.2 IP controller development. For example, they are able to set up systems with both prototyped USB 3.2 Hosts and Device controllers. With this they can run the xHCI software stack on a connected Windows system.

Synopsys includes links to USB 3.2 resources for those interested in digging deeper. Their paper does a good job of spelling out the important points needed to better understand USB 3.2 and how it fits into the entire USB roadmap. As mentioned before, they touch on how USB 3.2 fits into USB4 and will continue to play an important role as USB moves forward. The paper is available for download at the Synopsys website.

Also Read:

Synopsys is Enabling the Cloud Computing Revolution

Synopsys Delivers a Brief History of AI chips and Specialty AI IP

The Heart of Trust in the Cloud. Hardware Security IP


Features of Short-Reach Interface IP Design

Features of Short-Reach Interface IP Design
by Tom Dillinger on 03-08-2021 at 6:00 am

eye diagram

The emergence of advanced packaging technologies has led to the introduction of new types of data communication interfaces.  There are a number of topologies that are defined by the IEEE 802.3 standard, as well as the Optical Internetworking Common Electrical I/O CEI standard. [1,2]  (Many of the configurations of interest involve connectivity between chips and electro-optical conversion modules for fiber communications.)

The figures below depict (some of) the classifications used to distinguish the physical characteristics of these interfaces.

The acronyms in the figures above are:  Long Reach = LR;  Medium Reach = MR;  Very Short Reach = VSR;  Extra Short-Reach = XSR;  Ultra Short-Reach = USR.

For any interface, designers need to address the data throughput, power, area, and cost tradeoffs between implementations using parallel data bus and high-speed serial connections.

At the recent International Solid State Circuits Conference (ISSCC), researchers at Cadence presented their 7nm process node IP design solution for short-reach transceiver communications for die-to-die interfaces. [3]  The remainder of this article summarizes the highlights of their presentation, and the unique features incorporated into their design.

Data Lanes, Data Rates, and Beachfront FOM

The SerDes differential signal pair topology is widely used for long-reach distances, but the additional embedded clock data recovery receiver overhead would be extremely inefficient for wide data interfaces at short distances.

For short-reach communications, specifically die-to-die interfaces on multi-die packages, parallel interfaces are used to provide the requisite composite data transfer rate.  When designing the parallel interface, architects need to address the tradeoffs between the achievable transmitted data rate, signal losses at the receiver, routing resources required, and sources of (static) skew and (dynamic) jitter.

Specifically, individual pins/lanes in the parallel interface are grouped into a link, which includes data encoding — more on that shortly.  A forwarded clock is sent with the data.

For the Cadence short-reach IP design, a link on the die consists of 7 Tx and 7 Rx data lanes, as illustrated in the top figure above.  The bottom figure provides a Tx-to-Rx block diagram.  The full IP macro is designed to support a 192-bit interface — the data is divided into 6 groups, 32 bits each.  The Tx sends 6 bits of parallel data over the link, serialized over 32 cycles.

The metrics used to describe the implementation are:

  • raw datarate per lane/pin

The Cadence short-reach IP provides 40Gb/sec/lane – more on how this is transmitted shortly.

  • “beachfront”:  expressed as effective datarate per mm, for the die edge

The Cadence short-reach IP provides 480Gb/sec/mm.

  • power dissipation per bit:  for example, 1.7pJ/bit
  • signaling levels

The Cadence IP design chose single-ended, NRZ signaling;  the transmitted signal level for successive ‘0’ and ‘1’ data values remains at the same voltage level.  The implications of using a single-ended connection for the data, rather than a differential signal pair, were addressed using a clever data encoding method.

6/7b Data Encoding

In long-reach serial communications, there are numerous design issues to address, including:

  • insertion and reflection losses, resulting in signal amplitude degradation
  • (inductive) signal noise due to supply/GND switching current transients
  • crosstalk between adjacent lanes

To address these issues, SerDes designers use:

  • balanced routing, with focus on impedance matching for traces and vias
  • differential signaling
  • data encoding (e.g., 8/10b)

The first design criterion above focuses on improving the signal fidelity at the receiver.  Differential signaling reduces the noise on the signals by balancing the supply/GND current distribution.  To further enhance the current distribution temporal balance, eight bits of data are encoded into ten serially-transmitted bits, then decoded past the receiver.  The encoding ensures an equal number of 1-to-0 and 0-to-1 transitions.

For the short-range parallel IP, Cadence designers also focused on defining:

  • insertion loss/crosstalk ratio guidelines
  • module trace layout and layer selection guidelines
  • a 6/7b encoding for the parallel link

Like the 8/10b SerDes approach for (temporal) balance, the encoding of the data in each link improves the (spatial) balance in switching current transitions.  The figure below illustrates the characteristics of the link encoding used.

The balance between transmitted ‘0’ and ‘1’ values in the parallel link 6/7b encoding provides “differential-like” switching, significantly reducing the magnitude of the ground current loops at both the Tx and Rx ends of the physical link.

Clock Generation and Data Calibration

A key feature of the Cadence IP is the generation of 40Gbps on a lane from the internal 10GHz PLL.  Each data unit interval is one-fourth of the clock cycle.  It is necessary to phase-align 4 separate clock taps derived from the 10GHz reference.

The Cadence design employs a unique strategy.  The figure above illustrates how a set of pre-defined data patterns can be used to correct the duty cycle of each clock and the skew variation between clocks, each nominally shifted by one unit interval (90 degrees).  For example, a repeating NRZ data transmit pattern of ‘….00110011….’ should align with the two edges of a clock.  The transmit pattern ‘….00111100….’ should align across different clock edges.   The most detailed transmit pattern of ‘….10101010….’ should align with the edges of successive clocks.  For each of these patterns, the overall ratio of 1’s and 0’s is equal.

As depicted above, the transmit output is fed back into the Tx circuitry, and is sub-sampled at a lower frequency (the DIV clock generator in the figure).  Each calibration step involves adjusting the duty cycle and clock phases until the sub-sampled output is 50%, for each pre-defined pattern.   Seven unique data pattern calibration steps are applied, as illustrated below.  The figure also illustrates the equivalent logic for transmitting the 40Gbps data.

For the Rx lane, the calibration steps are slightly more intricate, involving a sequence of communications with the Tx.  The initial step adjusts the reference voltage to an input of the receiver data comparator, to align the switching threshold to the midpoint voltage of the transmitted data – a digital-to-analog converter (DAC) driven by the Rx controller establishes this voltage.  The DAC output voltage to the comparator is dynamically adjusted during operation to maintain the (vertical eye center) voltage threshold over supply and temperature variations.

The second step is similar to the Tx clock phase alignment described above.  The pre-defined set of Tx patterns for each lane captured at the receiver are used to adjust the phases of the Rx sampling clocks, derived from the forwarded clock in the link.

Testsite Results

A micrograph of the IP testsite, the eye diagram, and the specs for the IP are shown in the figure below.

Note the improved eye diagram with the 6/7b spatial data encoding.  The bit-error-rate (BER) curve is illustrated below.

Summary

The researchers at Cadence recently described their short-range transceiver design for the 7nm process node.  The design includes several features of note:

  • single-ended NRZ signaling
  • a 6/7b data encoding scheme in the link, to minimize current switching noise
  • a Tx/Rx clock phase alignment method using a set of pre-defined Tx data patterns

The resulting beachfront throughput and power efficiency are impressive.  I would encourage you to peruse their ISSCC presentation.

Appendix

The emergence of advanced multi-die packaging – aka, “chiplet” packaging – has introduced a new class of high-speed (parallel) interface design, for SR/XSR/USR topologies.  There has been quite a bit of industry commentary about the need for short-reach interface standards, to enable a broader adoption of chiplets from multiple sources.  The Open Domain Specific Architecture (ODSA) consortium has taken on the challenge of driving standardization in this area. (link)  The CHIPS Alliance has also been working on developing specifications for chiplet interfaces. (link)

Yet, as I was reading the Cadence ISSCC technical paper, it struck me that there are very unique features and a complex protocol for link training (with dynamic alignment over voltage and temp) that are a distinct differentiator.  These features require a complete, full Tx-to-Rx end-to-end IP implementation between chiplets.  The ODSA and CHIPS Alliance certainly have their work cut out for them, to enable Tx and Rx transceiver IP implementations from different sources.  (The PCI-SIG has been able to climb this hurdle — it will be interesting to see how this evolves for chiplet interfaces.)

-chipguy

References

[1]  https://www.ieee802.org/3/

[2]  https://www.oiforum.com/technical-work/current-work/

[3]  McCollough, K., et al., “A 480Gb/s/mm 1.7pJ/b Short-Reach Wireline Transceiver Using Single-Ended NRZ for Die-to-Die Applications”, ISSCC 2021, paper 11.3.

Also Read:

112G/56G SerDes – Select the Right PAM4 SerDes for Your Application

Lip-Bu Hyperscaler Cast Kicks off CadenceLIVE

How does TensorFlow Lite on Tensilica HiFi DSP IP Sound?