RVN! 26 Banner revised (800 x 100 px) (600 x 100 px)

Analog Bits at TSMC OIP – A Complete On-Die Clock Subsystem for PCIe Gen 5

Analog Bits at TSMC OIP – A Complete On-Die Clock Subsystem for PCIe Gen 5
by Mike Gianfagna on 09-09-2020 at 10:00 am

Design Integration of Complete On die Clock Subsystem for PCIe Gen 5

This is another installment covering TSMC’s very popular Open Innovation Platform event (OIP), held on August 25. This event presents a diverse and high-impact series of presentations describing how TSMC’s vast ecosystem collaborates with each other and with TSMC.  The talk covered here focuses on a complete on-die clock subsystem for PCIe Gen 5. Alan Rogers, president and CTO of Analog Bits, provided detail and motivation regarding the need for on-chip clock support to facilitate high-performance communication subsystems.

A memorable quote from Alan was “we will discuss how to synchronize the serial data interfaces which can move tens of billions of bits of data each and every second on and off a chip through a pair of wires.”

Alan began with a historical perspective on data communications. He pointed out that historically a SerDes would be supplied with a discrete clock chip, provided on the PCB. This worked fine when the SerDes had a limited number of high-end lanes that were not cost constrained. He went on to explain that today, there are a very high number of SerDes lanes and clock synchronization off-chip becomes very difficult to achieve. Performance demands are simply inconsistent with inter-chip transmission of remote clock sources. The system constraints in terms of dollars, power or pin count preclude a non-integrated solution.

With this motivation, Alan explored some of the on-chip options offered by Analog Bits. He provided an example to illustrate the scope and flexibility that can be achieved. The example, shown on the right, provides autonomous clocking of multiple SerDes protocols integrated on TSMC 16FFC technology with a combined clock unit. This example illustrates multiple rows of Analog Bits SerDes and seven coils of independent LC PLLs, each capable of driving a different frequency with a different spectral modulation to any of tens to hundreds of SerDes.

Central to a high frequency clock generator is a high performance PLL. Alan provided several examples of the types of PLLs supported by Analog Bits. These technologies provide a range of jitter, speed, power and area for various applications. Some, such as the LC oscillator design, work best only at high frequencies. The figure below summarizes the various options.

Alan then illustrated several examples of Analog Bits clocking solutions for various applications in several technologies. These included:

  • A ring-based chip-to-chip clock generator in TSMC N7
  • A PLL for cost-sensitive applications supporting PCI Gen 2/3 data rates
  • A PCIe Gen 5 reference clock in TSMC N7/6

Silicon measurements of spread spectrum PLLs in TSMC 16FFC were presented. The data showed very good correlation to simulation results. Data for best-case conditions (FF wafers, high BW, high VCO amp) and worst-case conditions (SS wafers, low BW, low VCO amp) was presented. Closed-loop transient noise vs. silicon data was also reviewed.

Regarding collaboration with TSMC, Alan described an N6 test chip that taped out in June 2020. On board was a large complement of Analog Bits IP, including:

  • Ring OSC PLL
  • LC PLL
  • Bandgap
  • OSC pads
  • RC oscillator
  • TX/RX IO’s

Alan described the customizable architectures available from Analog Bits to clock numerous protocols, including:

  • 16FFC: PCIe 3/4, SATA, SAS3/4, XFI, 10—KR
  • N7/N6: PCIe 3/4/5 and can be expanded to other protocols
  • N5: Available soon for PCIe 4/5, SATA, Ethernet

Clearly, Analog Bits provides a complete on-die clock subsystem for PCIe Gen 5 and beyond. Alan concluded by stating that Analog Bits has been a long-term partner of TSMC, providing a wide range of popular mixed signal IP for many applications. You can learn more at https://www.analogbits.com.

Also Read:

Cerebras and Analog Bits at TSMC OIP – Collaboration on the Largest and Most Powerful AI Chip in the World

AI processing requirements reveal weaknesses in current methods

7nm SERDES Design and Qualification Challenges!


Highlights of the TSMC Technology Symposium – Part 3

Highlights of the TSMC Technology Symposium – Part 3
by Tom Dillinger on 09-09-2020 at 8:00 am

CoWoS features

Recently, TSMC held their 26th annual Technology Symposium, which was conducted virtually for the first time.  This article is the last of three that attempts to summarize the highlights of the presentations.  This article focuses on the technology design enablement roadmap, as described by Cliff Hou, SVP, R&D.

Key Takeaways

  • Design enablement is available for N7, N6, N5, and N3, both EDA reference flows and Foundation IP.
  • N3 “specialty” IP is in development, in collaboration with the IP Partners.
  • Automotive (AEC-Q100) Grade 1 qualification is progressing for N7, offering an attractive PPA migration from N16 (available 4Q20).
  • EDA tool support is available for leading 2.5D/3D package technologies:  SoIC, InFO, CoWoS.  New EDA flow support required for (>1X reticle size) packages will be available 4Q20 (e.g., package warpage analysis).

Introduction

It is no secret that a major factor in TSMC’s foundry success has been the investment in the design enablement ecosystem, which spans the collaboration between TSMC and:

EDA partners

  • enhancing tool algorithms for new process node requirements, from place-and-route to physical design layout verification
  • collaborating with TSMC on implementation of trailblazing designs, from process bring-up memory array testsites to advanced Arm cores
  • preparing an integrated (and qualified) “reference flow” for a new process node

IP providers

  • developing critical IP functionality in a new node to complement TSMC’s Foundation IP
  • qualifying test silicon in the new node for the various TSMC platforms – IoT, mobile, HPC, and (the most demanding) automotive

Design Center Alliance (DCA) service providers

  • offering a range of front-end design resources, back-end implementation skills, custom design support, and DFT services

Value Chain Aggregator (VCA) providers

  • offering a broad range of support, throughout the IC “value chain”, extending all the way from product architecture definition to final wafer assembly/test/qualification services

and, the most recent addition to the Open Innovation Platform (OIP) ecosystem,

Cloud Alliance partners

  • collaborates with TSMC and EDA partners to provide a secure, scalable cloud compute environment for some (i.e., burst demand) or all of the IC design flow

The heart of the Open Innovation Platform is the TSMC Design Enablement (DE) organization.  Cliff provided an update on the enablement status for the upcoming advanced process nodes and packaging technologies, across the various design platforms.

Tool Certification

It should be noted that EDA tool certification at a new node is far more complex than simply running a set of SPICE circuit simulations and updating the runsets used for DRC/LVS/ERC physical verification.  Each node transition commonly introduces new, complex layout design rules, often requiring significant algorithm development by the EDA partner to provide the functionality and language commands needed to code the runset.  Multi-patterning, forbidden pitches, run-length dependent rules, line cut rules, and specific fill requirements across multiple mask levels all have been introduced at recent nodes.  For block composition flows at successive nodes, each cell library may have rules that define new constraints on cell placement, pin access routing, and power distribution/gating.  Reaching tool/flow production certification is no mean feat.

Additionally, new process nodes (and their application markets) may necessitate the introduction of completely new flows:

  • an “aging flow” that integrated the effects of NBTI, PBTI, and HCI into a measure of performance degradation over time, using new device aging models
  • a local heating flow that reflects how the unique thermal dissipation paths in FinFET-based designs impact chip failure mechanisms (especially electromigration)

N7/N6/N5/N3

  • full EDA tool certification, for both custom IP design and cell-based block composition, for all nodes (N5:  v0.9 PDK;  N3:  v0.1 PDK)
  • EDA “utility” certification (e.g., fill algorithms)

(Cliff’s certification charts focus on tool offerings from the major EDA Partners.)

N6 is a variant of N7, offering a yield improvement (fewer mask layers) and the ability to achieve a logic block density improvement using an optimized N6 high-density cell library.

  • N7 automotive platform flows and IP ready (AEC-Q100 Grade 1)
  • N5 automotive platform in 2022 (Grade 1)

Note that there are two common reliability qualification designations for the AEC-Q100 automotive platform, both based on zero fails after 1K hours HTOL stress test on sampled lots, plus HAST and temperature cycling endurance tests:   Grade 1:  -40C to 125C;  Grade 0:  -40C to 150C  (for “under the hood” applications).

When describing the (Grade 1) qualification activity for N7 and N5, Cliff highlighted some of the additional design enablement considerations for the automotive platform:

  • a “low DPPM” Design Rule Manual and DRC runset
  • aging model qualified for the automotive part lifetime and operating temperature
  • automotive platform-specific EM rules
  • automotive platform-specific latchup and ESD design rules
  • soft error upset analysis

Since the automotive “defect parts per million” shipped criterion is stringent, a specific set of DRC rules at the node is employed.

The demand for high-throughput, low power computation in the vehicles of the future is great , and must also meet the AEC-Q100 qualification criteria (Grade 1).  The TSMC design enablement team is extending the technology definition, design rules, models, and Foundation IP evaluation to provide this support at advanced process nodes.

N12e

At the Symposium, TSMC introduced a new ultra low power N12FFC+ variant, denoted as N12e.  This process is specifically designed for IoT (and AIoT, or AI at the edge) applications, offering a transition from N22ULL (planar) to N12e (FinFET).

  • N12e EDA tools certified (major new features added, listed below)

The design enablement for N12e is faced with the challenges of:

  • analyzing and modeling layout dependent effects (LDE), where device impacts are magnified at low VDD
  • developing SPICE models valid for VDD=0.4V
  • providing statistical device model support valid for low VDD operation
  • providing cell characterization, delay calculation, and static timing analysis support valid for low VDD operation;  specific focus is required for flip-flop setup/hold measures at low VDD

(At low supply voltage, the cell delay arc statistical variation is decidedly non-Gaussian, due to the “near Vt” operation.)

Advanced Packaging:  SoIC, InFO, CoWoS  (3D Fabric) 

With the rapid growth of 2.5D and 3D packaging options, the TSMC Design Enablement team has expanded their scope to include the appropriate physical verification and electrical/thermal analysis EDA flow support:

  • redistribution Layer (RDL) routing and through via routing rules (through CoWoS silicon interposer or InFO wafer compound)
  • routed interconnect impedance matching and shielding requirement (e.g., on a CoWoS interposer, to support wide bus width connectivity to HBM stacks)
  • die-to-die bond rules (SoIC)
  • LVS verification throughout the 2.5D/3D package connectivity
  • RC and RLC parasitic extraction for a complex package geometry – especially, inter-die coupling capacitance for SoIC
  • IR and EM analysis of the power distribution network throughout the package assembly
  • signal integrity analysis
  • thermal analysis – especially, through 3D stacked die
  • ESD analysis

EDA tools are ready for SoIC (3D), InFO and CoWoS (both 2.5D), with the following exceptions, as new flows need to be certified:

  • large (>>1X max reticle size) multi-die floorplan package “warpage analysis”  (available for InFO and CoWoS in 4Q20)
  • static timing analysis for stacked die in an SoIC, with temperature/voltage distribution and “multi-corner” process variation between die (available 4Q20)

 

The TSMC Design Enablement team continues to provide EDA tool and reference flow support for the challenges introduced by advanced process nodes, ranging from new aging models to timing/electrical analysis at low VDD operation.  The 2.5D and 3D package technology offerings require a close collaboration between TSMC and EDA developers to address new requirements – e.g., unique package interconnect/via design rules, stacked die timing analysis.

As mentioned above, TSMC’s focus on design enablement distinguishes their process and package technology offerings.

For more information on the TSMC Design Enablement support for the OIP Partners and platforms, please follow these links – OIP and Technology Platforms.

-chipguy

Highlights of the TSMC Technology Symposium – Part 1

Highlights of the TSMC Technology Symposium – Part 2

 


Blue Cheetah Technology Catalyzes Chiplet Ecosystem

Blue Cheetah Technology Catalyzes Chiplet Ecosystem
by Tom Simon on 09-09-2020 at 6:00 am

Blue Cheetah Ecosystem

There are many reasons today for dividing up large monolithic SoCs into chiplets that are connected together inside a single package. Let’s look at just some of these reasons. Many SoCs share a common processing core with application specific interfaces and specialized processing engines. Using chiplets would mean that it is easier to reuse the main processing core and ancillary blocks to easily build special purpose IC’s for various markets. Mixing analog and digital functions on a single die can be difficult, especially at more advanced nodes. It would be more cost effective and simpler to build separate analog chiplets and interface them with digital die. Yield is also an issue. The larger a die is the more likely it is that there can be a defect or fabrication failure. This is especially painful when the entire chip has to be rejected. A failed chiplet can be discarded easily without adversely affecting the other parts of a large chiplet based IC design. The list of advantages goes on, but I think you get the idea.

Of course, chiplet based designs introduce new requirements and have some drawbacks. However, it has been pointed out that Gordon Moore saw the potential advantages of this approach back in 1964 when he said, “It may prove to be more economical to build large systems out of smaller functions, which are separately packaged and interconnected.” I recently saw this quote in a presentation from DAC by Intel, CHIPS Alliance and Blue Cheetah. The first issue that needs to be dealt with to make chiplets effective and practical is providing a method for interfacing them to each other.

To address this need, CHIPS Alliance developed the Advanced Interface Bus (AIB), which offers a standardized high bandwidth interface between chiplets. It uses wide parallel connections with dense microbump arrays. It can work at modest clock rates and can transfer massive amounts of data. The topic of AIB and its features is a whole other discussion from the DAC presentation. What Intel, Blue Cheetah and the CHIPS Alliance wanted to talk about is how to rapidly implement the analog AIB PHY for each technology node used in all the chiplets found in a design.

So-to-speak, this is where the rubber meets the road. When dividing up monolithic SoC’s into chiplets each one will need to have an AIB PHY implemented as an analog block – which is traditionally a difficult and time-consuming specialized design task. This is where Blue Cheetah comes in. They have been extending the BAG Framework, initially developed at UC Berkeley, to create generators for producing fully automated analog designs, from schematics and layouts to test benches, LEFs, LIBs, and behavioral models. While not all their generators are open source, the AIB PHY is freely available on GitHub. The BAG framework is also open source, making this a revolutionary proposition.

We are accustomed to open source software and development libraries and even tools. However, in the world of hardware design, open source has been a long time coming. Though recently we have seen the emergence of RISC-V providing an open source processor ISA. From the looks of it Blue Cheetah and BAG technology will change how we think of analog blocks in terms of reuse and retargeting. Adding to that the notion of open source generators for blocks that are needed to catalyze innovations, it seems that things are about to get very interesting.

Blue Cheetah Ecosystem

Of course, not all of the generators that run in the Blue Cheetah offering are open source. This makes it useful for companies that want the advantages of generators without having to surrender rights to them. Blue Cheetah can even help with this internal development. There is a lot to digest here, and even more in the details of the AIB PHY generator operation and capabilities offered in the presentation. A key point is that this is not just layout generation. Rather, the AIB custom block generator is configurable and produces signoff quality schematics and layouts along with industry standard integration views.

The layout flow is equally sophisticated. The BAG API lets the generator developer specify a high level floorplan, which can be used to produce designs across may target technologies and parameters.

The test vehicle for the AIB PHY using BAG has already been delivered and tested. It was a 2-channel test chip taped out on Intel 22FFL in January 2020. The silicon met the required specifications, showing zero error loopback results at 2Gbps.

Clearly the Blue Cheetah Ecosystem is very helpful for designers who are dividing up large monolithic SoCs into chiplets. The presentation points to the RTL and the generator for the AIB PHY on GitHub. The age of chiplets has arrived, leading to new kinds of designs and further innovation. While generator development requires expertise, once they are implemented a larger community of developers can leverage them to boost productivity. This is no different from how open source works. Next we can look forward to AIB 2.0, which will offer improved bandwidth and density.

Also Read:

Analog Design Acceleration for Chiplet Interface IP


S2C Announces 300 Million Gate Prototyping System with Intel® Stratix® 10 GX 10M FPGAs

S2C Announces 300 Million Gate Prototyping System with Intel® Stratix® 10 GX 10M FPGAs
by Daniel Nenni on 09-08-2020 at 10:00 am

10M0904

In 2016 SemiWiki published a book “Prototypical: The Emergence of FPGA-Based Prototyping for SoC Design”. Today we are writing Prototypical II since a LOT of prototyping innovation has happened in the last four years, absolutely.

For example:

Quad 10M Prodigy™ Logic System extends the capacity leadership to simplify today’s innovative SoC/ASIC design and verification

San Jose, CA – September 7, 2020 – S2C, a world leader in FPGA-based prototyping solutions for accelerated SoC verification, today introduces the new Quad 10M Prodigy Logic System equipped with four Stratix 10 GX 10M FPGA devices. Stratix 10 GX 10M FPGA is the world’s largest capacity FPGA device with 10.2M Logic Elements, 253Mb M20K memory and 3,456 DSP blocks. Quad 10M Prodigy Logic System is an astounding 300 million equivalent ASIC gate prototyping solution with an attractive cost-per-gate pricing.

Quad 10M Prodigy Logic System Highlights:

  • Large capacity and scalability with 40.8M Logic Elements, 1,012Mb memory and 13,824 DSP blocks
  • 4,608 high-performance I/Os for inter FPGA connections and daughter cards
  • 160 high-speed transceivers that can run up to 16Gbps
  • Compatible with 90+ Prodigy Prototype Ready IPs
  • Integrated Multi-Debug Module
  • Compact, sleek, all-in-one chassis for clean, portable, and well-organized work environment

The increasing scale of SoC design demands a greater FPGA prototyping capacity in pre-silicon verifications. The Quad 10M Prodigy Logic System is equipped with four Intel 10M FPGA devices in a single chassis with unified power and control module. The newly designed control module has built-in debug hardware to enable high-performance deep trace capability for multiple FPGAs without extra peripherals. Enhanced partitioning tools can perform automatic intra-FPGA partition with DIB insertion between 10M dies and inter-FPGA partition using pin multiplexing over multiple FPGAs. The elegant systems design creates the fusion of complexity and easy-to-use.

The Quad 10M Prodigy Logic Systems work seamlessly with other Prodigy prototyping components such as Prodigy Player Pro™ software, Prodigy Multi-Debug Module and Prodigy ProtoBridge™ to provide unrivaled configuration, partitioning, deep-trace debug and co-modeling capabilities.

“We continue to deliver the highest capacity and easiest-to-use rapid prototyping solutions,” commented Toshio Nakama, CEO of S2C. “We are pleased to introduce Quad 10M Prodigy Logic System, the largest capacity ever in our Prodigy product family, to address the silicon development for data center, 5G wireless communication and autonomous driving.”

Availability
The Quad 10M Prodigy Logic System is available for purchase now. For more information, please contact your local S2C sales representative, or visit www.s2cinc.com..

About S2C
S2C, is a global leader of FPGA prototyping solutions for today’s innovative SoC/ASIC designs. S2C has been successfully delivering rapid SoC prototyping solutions since 2003. With over 500 customers and more than 3,000 systems installed, our highly qualified engineering team and customer-centric sales team understands our users’ SoC development needs. S2C has offices and sales representatives in the US, Europe, China, Korea, Japan and Taiwan regions. For more information please visit www.s2cinc.com.

Intel® and Stratix® are Copyrights of Intel Corporation

S2C, S2C logo, Prodigy, ProtoBridge, and Player Pro are trademarks or registered trademarks of S2C. All other trademarks are the property of their respective owners.

Media Contact
Aki Huang
MARCOM Specialist
Email: marketing@s2cinc.com

Also Read:

Webinar: Hyperscale SoC Validation with Cloud-based Hardware Simulation Framework

WEBINAR: Prototyping With Intel’s New 80M Gate FPGA

S2C Delivers FPGA Prototyping Solutions with the Industry’s Highest Capacity FPGA from Intel!


Combo Wireless. I Want it All, I Want it Now

Combo Wireless. I Want it All, I Want it Now
by Bernard Murphy on 09-08-2020 at 6:00 am

Mixed wireless

When we think of wireless it is natural to wonder “which one – cellular, Wi-Fi, BLE?” Our phones support everything but those are pricey devices. What if we wanted the same combo wireless option in a low-cost IoT device, maybe something that only need to send a small amount of data periodically? Logistics applications are a good example. NB-IoT is an ideal low-energy wireless protocol for this application, especially now we’re starting to see the beginnings of satellite support, promising world-wide coverage. Logistics also needs location support which can now be provided by GNSS, the superset of GPS including services like GLONASS and Galileo. OK, that’s not a wireless protocol, but it can be handled in the same unit that handles the baseband. Which then allows us to track our package at sea, in the air, wherever it may be.

Wi-Fi for positioning

But what happens when the package disappears inside a warehouse? There’s no cellular signal and the IoT device can’t see positioning satellites. Then it can be useful to have Wi-Fi support since there are almost certainly Wi-Fi access points around the warehouse. You could re-establish communication that way, though Wi-Fi is not especially low power. More importantly you can get a coarse sense of positioning – a radius around the access point with strongest reception.

Bluetooth for positioning and provisioning

Add to that Bluetooth, especially BLE which is low power. With Bluetooth mesh you can achieve significant range for communication, certainly across a large storage facility. You also have some interesting positioning options using the angle-of arrival and angle of departure features in the 5.1 standard and onwards. Whether this level of accuracy is necessary or not will depend very much on the application.

There’s another important benefit to Bluetooth – provisioning. NB-IoT is not a good way to transfer large amounts of data, the kind of data you may need to transfer when you’re first setting up a device or when you’re updating software. BLE is a much better way to handle those use-cases, providing suitable bandwidth at low power. Just bring your smartphone near the device and do the update.

Bluetooth sensor support in smart homes

Sensors for smart home applications are most likely to be designed with Bluetooth support. Smoke sensors, open windows, doors, that sort of thing. They’ll connect to a gateway which can connect to the cloud via NB-IoT. After all, these sensors don’t have a lot of data to transfer.

At this point you need support in your device for NB-IoT, BLE, Wi-Fi (because not all warehouses will have BLE) and GNSS. It has to be small, low cost and low power. Think about logistics tags; you want these to be vanishingly small and almost maintenance-free.

Combo wireless in IoT is happening

This is not an academic possibility. CEVA tells me they already have a customer doing almost exactly this, minus the BLE component. NB-IoT communication in the great outdoors, GNSS location positioning and approximate positioning indoors using Wi-Fi. All running on one core. The wireless guys (Paddy McWilliams, Franz Dugand and Tal Shalev) added that they’re now looking at incorporating BLE on the same core. Tal said that there’s no technical limitation here; they just hadn’t got around to this yet.

So you if you want it all, communication and positioning options, in one low profile, value-priced and low-energy, you can have most of that right now and should be able to have all of it pretty soon. You can learn more HERE.

Also Read:

Wi-Fi Bulks Up

5G Infrastructure Opens Up

Using IMUS and SENSOR FUSION to Effectively Navigate Consumer Robotics


Dolphin Design – Delivering High-Performance Audio Processing with TSMC’s 22ULL Process

Dolphin Design – Delivering High-Performance Audio Processing with TSMC’s 22ULL Process
by Mike Gianfagna on 09-07-2020 at 10:00 am

Dolphin Design – Delivering High Performance Audio Processing with TSMCs 22ULL Process

TSMC held their very popular Open Innovation Platform event (OIP) on August 25. The event was virtual of course and was packed with great presentations from TSMC’s vast ecosystem. One very interesting and relevant presentation was from Dolphin Design, discussing the delivery of high-performance audio processing using TSMC’s 22ULL process through their computing platforms and subsystems.

The OIP event followed TSMC’s Technology Symposium, which was held the day before. I’ve heard from more than one person that these virtual events were well produced, easy to follow and had the added advantage of not needing to get up at the crack of dawn to get a parking spot and a good seat. Virtual events are clearly the new normal.

Dolphin’s presentation began by discussing the business trends for AI applications in audio markets. This was followed by a discussion of ultra-low power (uLP) audio processing, an application use case and an overview of Dolphin’s platforms for audio processing. I’ll provide some highlights of each section of their presentation here.

Business Trends in AI Audio Markets

This section began by pointing out that voice is the easiest form of a user interface. This includes the following properties:

  • Intuitive
  • Quick and accurate
  • No contact
  • Straightforward
  • Easy integration

Voice-enabled devices need to address several technical challenges, including:

  • Voice detection
  • Keyword spotting
    • Voice pickup & noise reduction
  • Speaker separation
  • Active noise control
  • Speech recognition
  • Low power

So, voice-enabled devices represent the next revolution for user experience. The opportunity is to provide power optimized, local AI processing for things like speech recognition, wake-word detection and voice detection. Local processing will deliver better latency, lower cost and improved privacy since voice data is not sent to the cloud.

uLP Voice Detection and Keyword Spotting

Dolphin Design provided some very good detail on the benefits of their IP and associated platforms for voice detection. You can also see Tom Simons’s post on Dolphin Design and voice detection here. The figure below illustrates the high-performance and ultra-low power audio processing they can deliver for voice detection.

The Dolphin approach for voice detection provides the following benefits:

  • Stand-alone IP embedding a smart algorithm to detect voice activity
  • Automatic tuning of detection algorithms to the level of background noises
  • Short detection latency to avoid the need of buffering the audio stream
  • Ambient noise sensing for optimal adaptation of the key word spotting (KWS) algorithm to environmental conditions

A typical record lifetime of systems with a 25 mAh battery is ~5 hours without Dolphin technology and ~38 hours with Dolphin technology.

For keyword spotting, Dolphin Design can also deliver high-performance and ultra-low power audio processing using their MCU subsystem as shown in the figure below.

Using Dolphin’s CHAMELEON MCU subsystem yields the following benefits:

  • Up to 80x power reduction
  • Bringing KWS in µW range
  • No need for accelerator
  • Enables faster inference
    • for multiple speakers
    • for beamforming
    • still in mW range

 

Application use case: True Wireless Stereo (TWS) Earbuds

An example application for TWS earbuds was presented. Several Dolphin Design platforms and subsystems were used in this application. The benefits of each of these capabilities can be summarized as follows:

  • CHAMELEON MCU Subsystem
    • Compatible with main MCU
    • High bandwidth through low latency interconnect
    • Tiny ML accelerator with 32 MAC/cycle
    • <20 µA/MHz & 2µA deep sleep in TSMC 22uLL
  • BAT Audio Platform
    • Up to 768 kHz sample rate
    • Less than 7us analog to analog latency
    • Up to 8 analog and digital mic inputs
    • I2S/AHB data interface & I2C/APB control interface
  • SPIDER Power Management
    • Customizable & tailored power network
    • Standardized & predictable power management
    • 250 nA quiescent DCDC
    • 150 nA quiescent LDO
  • PANTHER DSP
    • Up to 64 MAC/cycle
    • Up to 16 cores scalability
    • Standard AXI interface
    • Enhanced SIMD DSP, NN instructions

Dolphin Design Platforms for Audio Processing

The following diagram summarizes Dolphin Design platforms and their capabilities in the field of audio and processing applications.

Dolphin summarized how they are delivering high-performance audio processing with TSMC’s 22ULL processes follows:

  • Audio/Voice markets will be dominant AI market in coming years
    • Smart Sensors approach will be the driving force
  • Dolphin Design has a long experience in Audio Codecs
  • New platforms will enable Voice User Interface
    • uLP speech recognition for enabling the voice-control world
    • Open platform as a design Backbone reusable for multiple projects, multiple processes, multiple processor vendors
    • Reduce key expertise bottlenecks
    • Faster TTM thanks to ready-to-use audio platform

You can learn more about the platforms and systems available from Dolphin Design here


Highlights of the TSMC Technology Symposium – Part 2

Highlights of the TSMC Technology Symposium – Part 2
by Tom Dillinger on 09-07-2020 at 8:00 am

3D Fabric

Recently, TSMC held their 26th annual Technology Symposium, which was conducted virtually for the first time.  This article is the second of three that attempts to summarize the highlights of the presentations.  This article focuses on the TSMC advanced packaging technology roadmap, as described by Doug Yu, VP, R&D.

Key Takeaways

  • SoIC (3D) multi-die integration will benefit from continuous process improvement on die bond pitch, driven by the areal density scaling of N7, N5, and N3.
  • The “back-end, die-first” InFO (2.5D) technology is being enhanced to embed a Local Silicon Interconnect (LSI) bridge, denoted as InFO-L.
  • The “back-end, die-last” CoWoS (2.5D) technology is also expanded to include a LSI bridge, embedded in an organic substrate (replacing the traditional silicon interposer).  CoWoS-L will offer a cost-effective method to integrate multiple die with memory stacks.
  • InFO offerings are being enhanced to support larger assemblies, with RDL interconnects spanning >1X max reticle size.  Similarly, CoWoS interposer dimensions will support >>1X max reticle size.
  • The full complement of SoIC, InFO, and CoWoS offerings have been incorporated into the TSMC “3D Fabric” product family, in anticipation of future system-level assemblies integrating both the 3D and 2.5D packaging technologies.

Introduction

Doug’s presentation covered the pillars of TSMC’s advanced packaging options:

  • the “front-end” SOIC die-to-die attach technology
  • the “back-end, chip-first” InFO (Integrated FanOut) technology
  • the “back-end, chip-last” CoWoS (Chip-on-Wafer-on-Substrate) technology

As will be discussed shortly, Doug announced several unique enhancements to the back-end options.

TSMC has grouped both the front-end and back-end options into a single package development roadmap, denoted as “3D Fabric”.  The last section of this article will illustrate how both these FE and BE technologies can be combined, into a complex 3D package solution.

SoIC

Background

SoIC technology enables direct die-to-die attach, using thermo-compression bonding between pads – here’s an earlier article that describes this process:  link.

Both face-to-face and face-to-back orientations are supported.  The face-to-back topology utilizes through silicon vias (TSVs) to provide the bonding pads.  TSVs also enable the addition of microbumps for subsequent package substrate attach.

(There is a variant of this technology that enables a highly efficient assembly flow, in the specific case where both die share the same footprint – “WoW’, for Wafer-on-Wafer.)

There are opportunities for integrating multiple die at the same level to the underlying base die (as in the figure above), as well as the capability to develop a vertical stack of thinned die.  The latter configuration is commonly used to construct high-bandwidth memory (HBM) stacks, with several memory die on top of a memory controller as the base.  Doug referenced a recently TSMC technical presentation illustrating a 12-high HBM stack (total thickness ~60um), utilizing the SoIC bonding technology.  (Reference:  Tsai, C.H., et al., “Low Temperature SoIC Bonding and Stacking Technology for 12/16-Hi High Bandwidth Memory (HBM)”, VLSI 2020 Symposium, Paper TH1.1.)

Advanced SoIC Development

  • thermal R (Tr)

An area of focus for SoIC development is the thermal resistance of the bond connections – it is critical that the heat generated within the die stack have a low Tr path to the package.  Doug’s presentation highlighted the continuous process improvement (CPI) activity that has reduced the bond Tr by ~25%.  (Similar CPI attention has been placed on reducing the Tr for microbump connections as well, by ~18%.)

  • bond pitch

Another major focus area is to scale the SoIC bond and TSV pitches, in conjunction with the areal density scaling of successive process nodes.  (If the bond and TSV pitch didn’t scale, that would adversely impact the realized density gains from migrating to the next node.)  Doug indicated the minimum bond and TSV pitches will indeed transition from 9um (N7) to 6um (N5) and to 4.5um (N3).  Doug also shared experimental data illustrating sub-um bond pitch reliability, for future node scaling.

Clearly, front-end SoIC packaging technology development is receiving considerable R&D investment.

InFO

Background

The Integrated FanOut technology utilizes a “back-end, chip-first” package assembly process.  As described in the earlier article mentioned above, InFO selects known good die and encapsulates them in a “reconstituted” wafer of an epoxy molding compound.

This enables the addition and patterning of dielectric and interconnect layers on top of the molding compound wafer to utilize existing fab equipment.  These interconnects, along with the final pattern of metal to the package attach microbumps, are collectively described as the redistribution layers (RDL).

As will be described shortly, TSMC is introducing alternative InFO technologies – the traditional InFO assembly with redistribution layers is now denoted as InFO-R.

There are other existing InFO designations – e.g., InFO-AIP with “antenna-in-package”, and InFO-PoP with “package-on-package”.  InFO-PoP integrates a chip stacked on top of the InFO assembly, whose microbumps attach to through-InFO vias (TIVs) in the molding compound to the RDL layers – see below.

The focus of the InFO package development presented at the Symposium was on enhancements to InFO-R, and a new InFO topology.

InFO-R Development

  • increasing reticle size

To enable greater flexibility in multi-die integration, TSMC has begun offering InFO assembly – e.g., die placement, encapsulation, and (specifically) RDL patterning – that exceeds the maximum photolithography reticle size.  The CoWoS technology has offered interconnect patterning on the silicon interposer that exceeds the 1X reticle size limit for some time;  this technique has recently been extended to InFO.  (1X maximum reticle size:  ~33mm x 26mm.)

Support for an InFO 1.7X reticle-size assembly will be available in 4Q20, with 2.5X in 1Q21 (qualified on a final package of 110mm x 110mm).  It is evident that there is significant customer demand for a low-cost package technology for ever-increasing multi-die configurations.

  • RDL interconnect

Key parameters for InFO-R are:  the die pad pitch to the RDL layers (40um), the RDL pitch (2um L/2um S), and the number of RDL layers (3).

Recently, TSMC R&D published an article describing development of sub-micron L/S patterning – more die in a large InFO-R assembly will require greater interconnect routing density.  (Reference:  Pu, et al., IEEE ECTC, 2018, p. 45-51.)

InFO-L

As mentioned above, the RDL line/space pitch is a key characteristics of the multi-die InFO assembly.  Yet, this dimension is limited by the processes available for the deposition, patterning, and curing of the organic dielectric and metallization used for the RDL layers.

To enable greater die-to-die routing capacity, TSMC is introducing a Local Silicon Interconnect (LSI) “bridge chiplet” embedded within the RDL assembly on top of the encapsulated die.  Compared to the baseline InFO-R technology, the embedded silicon bridge in InFO-L offers:

  • 25 um die pad pitch for LSI connectivity (versus 40um)
  • 0.4um/0.4um L/S (versus 2um/2um)
  • 4 metal layers (using TSMC’s “Mz” metal thickness process module)
  • InFO-L will be qualified in 1Q21, on a 1X reticle-size assembly

InFO-SoIS

The typical package substrate used with InFO-R provides connectivity from the InFO bumps to the package BGA balls, with limited interconnect layers within the substrate.   At the Symposium, TSMC shoed a unique variant of InFO-R, where the substrate consists of a composite of organic layers, providing 14 metal interconnect planes.  This demonstration of a “System-on-Integrated Substrate” may evolve to production status for a large-area, multi-die InFO-R assembly requiring more connectivity to BGA balls.

CoWoS

Background

The back-end, chip-last assembly known as Chip-on-Wafer-on-Substrate (CoWoS) technology has traditionally used a silicon interposer as the intermediate-level interconnect substrate for multi-die integration.  This option has been the mainstay for system implementations with an array of processor die, typically with multiple HBM memory stacks.

  • reticle size

Over the years, CoWoS technology development has focused on supporting increasing silicon interposer dimensions.  TSMC will be expanding the interposer size to 3X max reticle (2021) and 4X max reticle (2023), to support model processors and HBM stacks in the overall package.

  • improved interposer electrical characteristics

CoWoS process R&D has enabled the following enhancements:

– up to 5 Cu metal layers

– lower sheet resistivity (improving by 3X in 1H21)

– embedded capacitors

The traditional CoWoS topology with silicon interposer is now designated as CoWoS-S, to differentiate from the new configurations that Doug presented at the Symposium.

CoWoS-L

A new chip-last offering was introduced – CoWoS-L.  Like the embedded LSI interconnect bridge added to the InFO offering, a similar configuration is being added to the CoWoS assembly.  The silicon interposer is replaced by an organic substrate with an embedded LSI chiplet, offering interposer-like interconnect signal density in a more cost-effective assembly.

CoWoS-L plans are to provide:   1.5X reticle size (1 die, 4 HBM2E stacks), currently in production;  3X reticle size (3 die, 8 HBM2E stacks), in 2Q21.

Full Front-End (3D) and Back-End (2.5D) Integration

The 3D Fabric product initiative envisions a combination of (SoIC + InFO) and (SoIC + CoWoS) assemblies.  A multi-die, multi-tiered SoIC could be integrated as part of a (chip-first) encapsulated InFO offering.  An example is illustrated below of an SoIC integrated as part of a (chip-last) CoWoS assembly.

The full 3D Fabric offering is illustrated below.

In the 3D Fabric collection, note that there is also a CoWoS-R variant shown – a chip-last assembly on an organic substrate with RDL layers and no embedded LSI bridge.  Given the large number of wires required in the typical CoWoS die plus HBM stack topology, the embedded LSI bridge of CoWoS-L is likely required.  Here’s a cross-section of CoWoS-R.

TSMC has made a major investment in advanced packaging development – SoIC, InFO, and CoWoS have become an integral part of system architecture definition.  Increasingly, architects will need enhanced “pathfinding” tools to assist with the myriad of performance, power, area/volume, signal integrity, power delivery, thermal dissipation, reliability, and cost tradeoffs.

For more info on the full suite of 3D Fabric offerings, please follow this link.

-chipguy

Highlights of the TSMC Technology Symposium – Part 1

Highlights of the TSMC Technology Symposium – Part 3


In-Chip Monitoring Helps Manage Data Center Power

In-Chip Monitoring Helps Manage Data Center Power
by Tom Simon on 09-07-2020 at 6:00 am

in-chip sensing

Designers spend plenty of time analyzing the effects of process, voltage and temperature. But everyone knows it’s not enough to simply stop there. Operating environments are tough and have lots of limitations, especially when it comes to power consumption and thermal issues. Thermal protection and even over-voltage protections have been in chips for many years. However, there is more at stake than just preventing failures. It’s necessary to tune the operation of SoCs so they have long life and lower cost of operation, plus they need to stay within the limits of the cooling systems used in the facilities where they are located. In-chip monitoring can help manage power consumption and thermal issues.

This was a topic at the recent TSMC OIP event. Stephen Crosher, CEO of Moortec, a provider of IP for in-chip monitoring, presented on the topic of “Challenges of N5 HPC and Hyperscaling within Data Centers.” Small savings at the chip level in power consumption and heat generation translate into meaningful results when scaled up. Stephen points out that hyperscale data centers can have in the order of millions of SoCs.

Data centers already consume 1-2% of all electricity produced globally. Chinese data centers alone over the next 5 years are expected to use as much electricity as all of Australia. Data center traffic and workloads are expected to rise by 80% over the next 3 years.

The only way to effectively manage power is to design in feedback systems to manage SoC operation so that they minimize the power. The first step in doing this is to ensure there is accurate and complete information about all three of process, voltage and temperature. With the right kind of in-chip monitoring capabilities many things can be done inside of an SoC to respond to each of these conditions.

Stephen provides examples of how tightening the voltage monitoring precision at the terminus of the supply nets from 5% to 1% can reduce supply guard banding and reduce power by ~10%. Multiplied across millions of chips, fractions of a penny per hour per chip translate into savings of millions of dollars per year. Likewise, for thermal management, more accurate sensors can prevent premature device throttling. Moortec’s ‘out-of-the-box’ high accuracy sensors can help avoid unnecessary throttling when compared to more alternative +/- 5% sensors, especially with considering that Moortec sensors can achieve even high accuracies if calibration can be accommodated in production test.

N5 is an appealing process for high performance chips. It offers around a 15% speed improvement along with an 80% greater logic density. It also reduces power consumption. However, at the same time the power density per square mm is going up. So dynamic voltage and frequency scaling will increasingly be important for managing energy consumption and thermal behavior. Stephen points out that for every watt saved on-chip, there is a commensurate reduction in facility cooling costs. Hyperscale data centers spend 40% of their operating costs on cooling, so there is even more incentive to lower server power use.

The future of in-chip monitoring looks very interesting with telemetry facilitating reporting and analysis. Some of the benefits could include enhanced device screening, power optimization, increased performance and extended reliability. Many of the benefits can go beyond large data centers and find their way into automotive, consumer and other applications. Moortec has been developing in-chip monitoring solutions since 2010 and have ample experience on a wide range of process nodes, including the most advanced. The presentation was eye opening as far as the impacts of chip level optimizations on facility, enterprise and even worldwide economies and environmental impact.


Inside the HP Nanoprocessor: A High-speed Processor That Can’t Even Add

Inside the HP Nanoprocessor: A High-speed Processor That Can’t Even Add
by Ken Shirriff on 09-06-2020 at 10:00 am

Inside the HP Nanoprocessor

The Nanoprocessor is a mostly-forgotten processor developed by Hewlett-Packard in 19741 as a microcontroller2 for their products. Strangely, this processor couldn’t even add or subtract,3 probably why it was called a nanoprocessor and not a microprocessor. Despite this limitation, the Nanoprocessor powered numerous Hewlett-Packard devices ranging from interface boards and voltmeters to spectrum analyzers and data capture terminals.4 The Nanoprocessor’s key feature was its low cost and high speed: Compared against the contemporary Motorola 6800,7 the Nanoprocessor cost $15 instead of $360 and was an order of magnitude faster for control tasks.

Recently, the six masks used to manufacture the Nanoprocessor were released by Larry Bower, the chip’s designer, revealing details about its design. The composite mask image below shows the internal circuitry of the integrated circuit.5 The blue layer shows the metal on top of the chip, while the green shows the silicon underneath. The black squares around the outside are the 40 pads for connection to the IC’s external pins. I used these masks to reverse-engineer the circuitry of the processor and understand its simple but clever RISC-like design.6

Combined masks from the Nanoprocessor. Click for larger image. “GLB“, to the left of the data bus, stands for the designers George Latham and Larry Bower. Files courtesy of Antoine Bercovici.

The Nanoprocessor was designed in 1974, the same year that the classic Intel 8080 and Motorola 6800 microprocessors were announced. However, the Nanoprocessor’s silicon fabrication technology was a few years behind, using metal-gate transistors rather than silicon-gate transistors that were developed in the late 1960s. This may seem like an obscure difference, but silicon gate technology was much better in several ways. First, silicon-gate transistors were smaller, faster, and more reliable. Second, silicon-gate chips had a layer of polysilicon wiring in addition to the metal wiring; this made chip layouts about twice as dense.8 Third, metal-gate circuitry required an additional +12 V power supply. The Intel 4004 processor used silicon gates in 1971, so I’m surprised that HP was still using metal gates in 1974.9

A bizarre characteristic of the Nanoprocessor is its variable substrate bias voltage. For performance reasons, many 1970s microprocessors applied a negative voltage to the silicon substrate, with -5V provided through a bias pin.10 The Nanoprocessor has a bias pin, but strangely the bias voltage varied from chip to chip, from -2 volts to -5 volts. During manufacturing, the required voltage was hand-written on the chip (below). Each Nanoprocessor had to be installed with a matching resistor to provide the right voltage. If a Nanoprocessor was replaced on a board, the resistor had to be replaced as well. The variable bias voltage seems like a flaw in the manufacturing process; I can’t imagine Intel making a processor like that.

The HP Nanoprocessor, part number 1820-1691. Note the hand-written voltage “-2.5 V”. The last digit (1) of the part number is also hand-written, indicating the speed of the chip. Photo courtesy of Marc Verdiell.

Like most processors of that era, the Nanoprocessor was an 8-bit processor. However, it didn’t use RAM, but ran code from an external 2-kilobyte ROM. It contained 16 8-bit registers, more than most processors and enough to make up for the lack of RAM in many applications. Based on transistor count, the Nanoprocessor is more complex than the Intel 8008 (1972) and slightly less complex than the 6800 (1974) or 6502 (1975).11 Its architecture uses its transistor count on different purposes from these processors, though. The Nanoprocessor lacks ALU functionality but in exchange, it has a large register set, taking up much of the die area. The Nanoprocessor has 48 instructions, a considerably smaller instruction set than the 6800’s 72 instructions. However, the Nanoprocessor includes convenient bit set, clear, and test operations, which these other processors lacked.12 The Nanoprocessor supports indexed register access, but lacks the complex addressing modes of the other processors.

The block diagram below shows the internal structure of the Nanoprocessor. The main I/O feature is the 4-bit “I/O Instruction Device Select” which allows 15 devices to receive I/O operations. In other words, the select pins indicate which I/O device is being read or written over the data lines. External circuitry uses these signals to do whatever is necessary for the particular application, such as storing the data in a latch, sending it to another system, or reading values. More I/O is provided through seven “Direct Control I/O” pins (GPIO pins) that can be used for inputs or outputs. If not connected to external circuitry, these pins operate as convenient bit flags; the Nanocomputer can set a value and then read it back. The Control Logic Unit performs increments, decrements, shifts, and bit operations on the accumulator, lacking the arithmetic and logical operations of a standard Arithmetic/Logic Unit (ALU).

Block diagram, from the Nanoprocessor User’s Guide.

 

I reverse-engineered the Nanoprocessor’s circuitry from the masks and determined how the functional blocks map onto the die, below. The largest feature is the set of 16 registers in the center-left. To the right is the comparator and then the accumulator, along with its increment, decrement, shift, and complement circuitry. The instruction decoder circuitry takes up much of the space above and to the right of the comparator and accumulator. The bottom part of the chip is dominated by the 11-bit program counter, along with the one-entry interrupt stack and subroutine stack. The control circuitry implements the Nanoprocessor’s almost-trivial instruction timing: one fetch cycle followed by one execute cycle.13 In most microprocessors, the control circuitry takes up a large fraction of the chip, but the Nanoprocessor’s control circuitry is just a small block.

Functional components of the HP Nanoprocessor, based on my reverse-engineering. Underlying die photo by Pauli RautakorpiCC BY 3.0.

 

Understanding the masks

The chip was fabricated using six masks, each used for constructing one layer of the processor using photolithography. The photo below shows the masks; each one is a 47.2×39.8 cm Mylar sheet. These sheets are 100× enlargements of the masks used to produce the 4.72×3.98 mm silicon die (for comparison, about 33% smaller than the 6800’s die). Each 3-inch silicon wafer held about 200 integrated circuits, fabricated together on the wafer, and then tested, cut apart, and packaged.

The chip’s masks, courtesy of Antoine Bercovici

 

To explain the role of the masks, I’ll start with the structure of a metal-gate MOSFET, the transistor used in the Nanoprocessor. At the bottom, two regions of silicon (green) are doped to make them conductive, forming the source and drain of the transistor. A metal strip in between forms the gate, separated from the silicon by a thin layer of insulating oxide. (These layers—Metal, Oxide, Semiconductor—give the MOS transistor its name.) The transistor can be considered a switch controlled by the gate. The metal layer also provides the main wiring of the integrated circuit, although the silicon layer is also used for some wiring.

Structure of a metal-gate MOSFET.

 

Masks are a key part of the integrated circuit construction process, specifying the position of the components. The diagram below shows how a mask is used to dope regions of the silicon. First, the silicon wafer is oxidized to form an insulating oxide layer on top, and then light-sensitive photoresist is applied. Ultraviolet light polymerizes and hardens the photoresist, except where the mask blocks the light. Next, the soft, unexposed photoresist is dissolved. The wafer is exposed to hydrofluoric acid, which removes the oxide layer where it is not protected by photoresist. This yields holes in the oxide that match the mask pattern. The wafer is then exposed to a high-temperature gas which diffuses into the unprotected silicon regions, modifying the silicon’s conductivity. These processing steps create tiny doped silicon regions matching the masks’s pattern. As will be shown below, the other masks are used for different processing steps, but using the same photoresist-and-mask process.

How a photomask is used to dope regions of silicon.

 

I’ll zoom in on the Nanoprocessor’s die and show how one of its circuits is constructed from the six masks. (This two-transistor circuit is an inverter, flipping the binary value of its input.) The first mask dopes regions of silicon to make them conductive, using the photolithography steps described above. The doped regions (green) will become transistor source/drains or wiring between components.

The first mask creates conductive silicon regions.

Next, the die is covered with an insulating oxide layer. The second mask (magenta) is used to etch openings in the oxide, exposing the silicon underneath. These openings will be used to create transistor gates as well as connecting metal wiring to the silicon.

The second mask creates openings in the oxide layer.

The third mask (gray) exposes a region to ion implantation, which changes the doping of the silicon, and thus the transistor’s properties. This turns the upper transistor into a special depletion-mode transistor that pulls logic gate outputs high.

The third mask is used to increase the doping of the upper transistor.

 

Next, the silicon is covered with an additional thin layer of insulating oxide, forming the gate oxide for the transistors. The fourth mask (orange) removes this oxide from regions that will become contacts between the silicon and the metal layer. After this step, most of the die is covered with a thick insulating oxide layer. The oxide layer is very thin over the transistor gates (magenta), and there are contact holes in the oxide from the current mask (orange).

The fourth mask creates holes in the oxide.

 

The fifth mask (blue) is used to create the metal wiring on top; a uniform metal layer is applied and then the undesired parts are etched off. In locations where the fourth mask created holes in the oxide, the metal layer contacts the silicon and forms a connection. In locations where the third mask created a thin oxide layer, the metal layer forms the transistor gate between two silicon regions. Finally, the entire wafer is covered with a protective glassy layer. The sixth mask (not shown) is used to form holes in this layer over the pads around the edges of each chip. Once the wafer is cut into individual silicon dies (dice?), bond wires are attached to the pads, connecting the die to the external pins.

The fifth mask creates the metal wiring.

 

The schematic below shows how the circuitry above forms a two-transistor inverter. The two transistor symbols correspond to the two transistors created by the masks. When there is no input, the upper transistor (connected to +5 volts) pulls the output high. When the input is high, it turns on the lower transistor. This connects the output to ground, pulling the output low. Thus, the circuit inverts the output.

Schematic of an NMOS inverter, corresponding to the masks above.

 

Although the diagrams above show just a single inverter, these masking steps create the entire processor with its 4639 transistors.11 The diagram below shows a larger part of the die with dozens of transistors forming more complex gates and circuitry. One cute thing I noticed on the masks is a tiny heart with HP inside, below the chip’s number.14

Chip art: HP inside a heart, below the part number 9-4332A

Controlling a clock with the Nanoprocessor

To understand how the Nanoprocessor was used in practice, I reverse-engineered the code from an HP 98035 clock module. This module was plugged into an HP desktop computer15 to provide a real-time clock, as well as millisecond-accurate timings, intervals, and periodic events. The design of the clock module was rather unusual. To preserve the time when the computer was powered-down, the clock module was built around a digital watch chip with a backup battery.17 Inconveniently, the digital watch chip wasn’t designed for computer control: it generated 7-segment signals to drive an LED, and it was set through three buttons. To read the time, the Nanoprocessor had to convert the 7-segment display outputs back into digits. And to set the time, the Nanoprocessor had to simulate the right sequence of button presses to advance through the digits.

Nanoprocessor (white chip) as part of an HP clock module. The 2-kilobyte ROM is to the left of the Nanoprocessor. The two 256-bit×4 RAM chips are to the right. The Texas Instruments clock chip is the large black chip below the green NiCad battery. Photo courtesy of Marc Verdiell.

 

The host computer controlled the clock module by sending it ASCII strings such as “S 12:07:12:45:00” to set the clock to 12:45:00 on December 7 (or on July 12 if the module was running in European mode). The module’s various interval timers, periodic alarms, and counters were controlled with similar commands such as “Unit 2 Period 12345”. The module supported 24 different commands, and the Nanoprocessor had to parse them. (See the manual for details.)

Here’s some sample code reverse-engineered from the clock board ROM. This code is from the interrupt handler that increases the time and date every second. The code below determines how many days in the current month so it knows when to move to the next month. The columns are the byte value, the corresponding opcode, and my description of the instruction.

d0 STR-0 Store the next byte (7) in register 0.
07
0c SLE Skip two instructions if accumulator <= register 0.
03 DED Decrement the accumulator in decimal mode
5f NOP No operation
d0 STR-0 Store the next byte (0x31) in register 0
31
30 SBZ-0 Skip two instruction bytes if accumulator bit 0 is zero
81 JMP-1 Jump to 0x1c9 (end of this code block)
c9
a1 CBN-1 Clear accumulator bit 1
d0 STR-0 Store the next byte (0x30) in register 0
30
0f SAN Skip two instruction bytes if accumulator not zero
d0 STR-0 Store next byte (0x28) in register 0
28
view rawdays-in-month hosted with ❤ by GitHub

 

This code takes a month number (01-12 BCD) in the accumulator and returns (in register 0) the number of days in the month (28, 30, or 31 BCD). Not bad for 16 bytes of code, even if it ignores leap years. How does it work? For months past 7 (July), it subtracts 1. Then, if the month is odd, it has 31 days, while an even month has 30 days. To handle February, the code clears bit 1 of the month. If the month is now 0 (i.e. February), it has 28 days.

This code demonstrates that even though a processor without addition sounds useless, the Nanoprocessor’s bit operations and increment/decrement allow more computation than you’d expect.16 It also shows that Nanoprocessor code is compact and efficient. Many things can be done in a single byte (such as bit test and skip) that would take multiple bytes on other processors.12 The Nanoprocessor’s large register file also avoids much of the tedious shuffling of data back and forth often required in other processors. Although some call the Nanoprocessor more of a state machine controller than a microprocessor, that understates the capabilities and role of the Nanoprocessor.

While the Nanoprocessor doesn’t include an ALU or have instructions for accessing RAM, these could be added as I/O devices. The clock module has 256 bytes of RAM to hold its multiple counter and timer values, accessed through four I/O ports. Other products added ALU chips to support arithmetic operations.18

Conclusions

The Nanoprocessor is an unusual processor. My first impression was that it wasn’t even a “real processor”, lacking basic arithmetic functionality. The chip was built with obsolete metal-gate technology, a few years behind other microprocessors. Most bizarrely, each chip required a different voltage, hand-written on the package, suggesting difficulty with manufacturing consistency. However, the Nanoprocessor provided high performance in its microcontroller role, much faster than other processors at the time. Hewlett-Packard used the Nanoprocessor in many products in the 1970s and 1980s, in roles that were more complex than you’d expect. strings and performing calculations.

While the Nanoprocessor has languished in obscurity, without even a mention on Wikipedia, the masks recently revealed by its designer shed light on this unusual corner of processor history. Thanks to Antoine Bercovici for scanning and remastering the masks, Larry Bower for the donation, and John Culver at The CPU Shack for sharing the donation. Thanks to Marc Verdiell for dumping the clock board ROM.

I plan to write about the internal circuitry of the Nanoprocessor so follow me on Twitter at @kenshirriff for updates on Part II. I also have an RSS feed.

Notes and references

  1. More information on the HP Nanoprocessor and its history is in CPU Shack’s recent article The Forgotten Ones: HP Nanoprocessor, as well as at HP9825.com and The HP 9845 Project
  2. I’m not completely comfortable calling the Nanoprocessor a microcontroller since it uses an external program ROM, while a microcontroller usually has everything, including the ROM, on a single chip. (It is like the Intel 4004 in this way.) However, the Nanoprocessor resembles a microcontroller in most ways: it is designed for embedded control applications, with a Harvard architecture and an instruction set optimized for I/O, running a program from ROM with minimal storage. 
  3. On the topic of computers that can’t add, the desk-sized IBM 1620 computer (1959) didn’t have addition circuitry, but used table lookup for addition. It had the codename CADET; people joked that this stood for “Can’t Add, Doesn’t Even Try.” 
  4. I’ve determined that the Nanoprocessor was used in the following HP products (and probably others): HP 9845BHP 3585A spectrum analyzer, HP 3325A Synthesizer / Function Generator, HP 9885 floppy disk drive, HP 3070B data capture terminal, HP 98034 HPIB interface for the HP 9825 calculator, HP 98035 real time clock for the HP 9825 desktop computer, HP 7970E tape drive interface, HP 4262A LCR meter, HP 3852 Spectrum Analyzer, and HP 3455A voltmeter. 
  5. The mask images can be downloaded here (warning: 122 MB PSD file). 
  6. The Nanoprocessor is like a RISC (Reduced Instruction Set Computer) processor in many ways, although it predated the RISC concept by several years. In particular, the Nanoprocessor is designed with a simple opcode structure, all instructions execute in one cycle (after the fetch cycle), the register set is large and orthogonal, and addressing is simple. These RISC characteristics yielded a high clock speed compared to more complex processors. 
  7. Interestingly, the Nanoprocessor’s competition during development was the Motorola 6800, rather than an Intel processor. The Nanoprocessor’s key feature was performance: it ran at 4 MHz, compared to 1 MHz for the 6800. (Both processors took 2 cycles to perform a basic instruction, while the 6800 took up to 7 cycles for more complex instructions.)The Nanoprocessor designers wrote a timing comparison, estimating that the Nanoprocessor could count six times faster than the 6800 and handle interrupts over sixteen times faster. The proposal assumed a 5 MHz Nanoprocessor while the actual chip fell a bit short, running at 4 MHz. The projected cost of the Nanoprocessor was $15 per chip, compared to $360 for the Motorola 6800. 
  8. I’m impressed with the density of the Nanocomputer’s layout given its limitations: one layer of metal wiring and no polysilicon. I’ve looked at other metal-gate chips and their layouts are horribly inefficient, with a lot more wiring than transistors. However, the Nanoprocessor’s circuits are arranged efficiently, with very little wasted space. 
  9. The Nanoprocessor’s fabrication technology was ahead of the Intel 8080 and Motorola 6800 in one way: it used depletion-mode pull-up transistors, more advanced than the enhancement-mode transistors in the 8080 and 6800. Depletion-mode transistors resulted in faster, lower-power logic gates, but required an additional manufacturing step. For the Nanoprocessor, this step used mask #3 (the gray mask). In processors such as the MOS Technology 6502 and Zilog Z-80, depletion-mode transistors allowed the processor to run off a single voltage instead of three. Unfortunately, the Nanoprocessor still required three voltages due to its metal-gate transistors. 
  10. Early DRAM memory chips and microprocessor chips often required three supplies: +5V (Vcc), +12V (Vdd) and -5V (Vbb) bias voltage. In the late 1970s, improvements in chip technology allowed a single supply to be used instead. The Intel 8080 microprocessor (1974) used enhancement-mode transistors and required three voltages, but the improved 8085 (1976) used depletion-mode transistors and was powered by a single +5V supply. Starting in the late 1970s, many microprocessors used an on-chip charge pump to generate the negative bias voltage. I wrote about the 8086’s charge pump here
  11. By my count, the Nanoprocessor has 4639 transistors. The instruction decoder is constructed from pairs of small transistors for layout reasons; combining these pairs yields 3829 unique transistors. Of these, 1061 act as pull-ups, while 2668 are active. In comparison, the 6502 has 4237 transistors, of which 3218 are active. The 8008 has 3500 transistors and the Motorola 6800 has 4100 transistors. 
  12. Early microprocessors didn’t have bit set, reset, and test operations (although these could be accomplished with AND and OR). The Z-80 (1976) added bit operations, but they took two bytes and were much slower than the Nanoprocessor. 
  13. The Nanoprocessor sticks to its model of executing the instruction in one cycle even for two-byte instructions: The second byte is fetched during the execute cycle, so the overall timing is unchanged. 
  14. The Nanoprocessor has two different part numbers. The 1820-1691 was the 2.66 MHz version, while the 1820-1692 was the 4 MHz version. The last digit of the part number was hand-written on each chip after testing its performance. (The part number is unrelated to the chip’s number 9-4332A on the die.) 
  15. The HP 9825 was a 16-bit desktop computer, running a BASIC-like language. It was introduced in 1976, five years before the IBM PC, and was a remarkably advanced system for its time. The back of the HP 9825 had three I/O slots for adding modules such as the real time clock.
  16. An HP 9825 with tape drive, LED display, and printer. From Marc Verdiell’s collection.

  17. I came across one place in the code where it needs to add two BCD digits to form one byte. This was accomplished by a loop that decremented one number while incrementing the second. When the first number reached zero, the result was the sum. Thus, even without an ALU, addition is possible but slow. 
  18. The Texas Instruments watch chip was implemented with Integrated Injection Logic (I2L) to keep power consumption low. Nowadays, a low-power chip would use CMOS, but that wasn’t common at the time. Integrated Injection Logic was built from bipolar transistors, similar to TTL, but using different high-density, low-power circuitry. I discussed Integrated Injection Logic in detail in this blog post. The Texas Instruments chip may be the X-902 in a DIP package. 
  19. The clock board schematic shows how the two 256×4 RAM chips are connected to the Nanoprocessor. The Nanoprocessor’s I/O port select pins are connected to the “3-8 Decoder” U5, which produces a separate signal for each I/O port. Three of these signals go to the RAM chip’s control pins, while one signal controls the Data Latch chips U9 and U10 that hold write data.
    RAM chips connected to the Nanoprocessor. From the Clock service manual.

     

    All I/O ports use the Nanoprocessor’s data bus (top) for communication, so the data bus is connected to both the address and data pins of the RAM chips. For a read, the RAM address is written to the RAM chips via one I/O port and then the data is read from RAM via a second port. In both cases, the values go across the data bus, while the signal from the “3-8 Decoder” indicates what to do with the values. For a write, the first I/O operation stores the byte value in the latches, and then the second I/O operation sends the address to the RAM chips. While this may seem like a clunky, Rube-Goldberg approach, it works well in practice; a read or write can be done with two bytes of instructions.

    (Many processors, such as the 6502, used memory-mapped I/O; I/O devices were mapped into the memory address space and accessed through memory read/write operations. The Nanoprocessor is the opposite, putting RAM into the I/O port space and accessing it through I/O operations.)

    Adding an ALU uses a similar approach, as in the HP 3455A voltmeter (schematic), which contains two Nanoprocessors. The voltmeter uses two 74LS181 ALU chips to implement an 8-bit ALU that it uses to scale value and compute percentage error. Two output ports provide the arguments and another port specifies the operation. The 8-bit result is read from a port, while the processor reads the carry through a GPIO pin. (At this point, I’d wonder if it wouldn’t be better to use a processor that includes arithmetic.) 


PCI Express in Depth – Transaction Layer

PCI Express in Depth – Transaction Layer
by Luigi Filho on 09-06-2020 at 7:00 am

PCI Express in Depth Transaction Layer

In the last article i write about the Data Link Layer, in this article i’ll write about the Transaction Layer.

This layer’s primary responsibility is to create PCI Express request and completion transactions. It has both transmit functions for outgoing transactions, and receive functions for incoming transactions.

The Transaction Layer uses TLPs to communicate request and completion data with other PCI Express devices. TLPs may address several address spaces and have a variety of purposes. Each TLP has a header associated with it to identify the type of transaction.

I’ll explain two main things, the Transaction Layer Packet (TLP) and the TLP Handling

Transaction Layer Packet

An generic TLP is show in the figure below:

A TLP consists of a header, an optional data payload, and an optional TLP digest. The Transaction Layer generates outgoing TLPs based on the information it receives from its device core. The Transaction Layer then passes the TLP on to its Data Link Layer for further processing. The Transaction Layer also accepts incoming TLPs from its Data Link Layer.

TLP Headers

All TLPs consist of a header that contains the basic identifying information for the transaction. The TLP header may be either 3 or 4 DWords in length, depending on the type of transaction.

The format of first DWord is shown in the figure below

To the article don’t get much extensive, i’ll not cover each bit, but if you like, leave a comment with this request and then i’ll cover bit by bit

TLP Data Payload

Whether or not a TLP contains a data payload depends on the type of packet. If present, the data payload is DWord-aligned for both the first and last DWord of data.

The data payload for a TLP must not exceed the maximum allowable payload size, as defined in the device’s control register (and more specifically, the Max_Payload_Size field of that register).

TLP Digest

The Data Link Layer provides the basic data reliability mechanism within PCI Express via the use of a 32-bit LCRC. This LCRC code can detect errors in TLPs on a link-by-link basis and allows for a retransmit mechanism for error recovery.

To ensure end-to-end data integrity, the TLP may contain a digest that has an end-to-end CRC. This optional field protects the contents of the TLP through the entire system, and can be used in systems that require high data reliability. The Transaction Layer of the source component generates the 32-bit ECRC.

TLP Handling

A TLP that makes it through the Data Link Layer has been verified to have traversed the link properly, but that does not necessarily mean that the TLP is correct. A TLP may make it across the link intact, but may have been improperly formed by its originator. As such, the receiver side of the Transaction Layer performs some checks on the TLP to make sure it has followed the rules. If the incoming TLP does not check out properly, it is considered a malformed packet, is discarded (without updating receiver flow control information) and generates an error condition. If the TLP is legitimate, the Transaction Layer updates its flow control tracking and continues to process the packet

An flowchart is shown below 

Request Handling

If the TLP is a request packet, the Transaction Layer first checks to make sure that the request type is supported. If it is not supported, it generates a non-fatal error and notifies the root complex.

I’ll not get into details, but an flowchart is show below of this process.

Completion Handling

If a device receives a completion that does not correspond to any outstanding request, that completion is referred to as an unexpected completion. Receipt of an unexpected completion causes the completion to be discarded and results in an error-condition (nonfatal).

The receipt of unsuccessful completion packets generates an error condition that is dependent on the completion status. The details for how successful completions are handled and impact flow control logic i’ll not cover because this article is getting too big.

But as always, if you want me to cover, leave a comment.

Remember that this series was suggested by a reader.

With this post i’ll end the PCIe in Depth series, if you lost some article, you can check the first, the physical layer and data-link layer.