webinar banner AI 2026 v2

In-Memory Computing for Low-Power Neural Network Inference

In-Memory Computing for Low-Power Neural Network Inference
by Tom Dillinger on 07-17-2020 at 10:00 am

von Neumann bottleneck

“AI is the new electricity.”, according to Andrew Ng, Professor at Stanford University.  The potential applications for machine learning classification are vast.  Yet, current ML inference techniques are limited by the high power dissipation associated with traditional architectures.  The figure below highlights the von Neumann bottleneck.  (A von Neumann architecture refers to the separation between program execution and data storage.)

The power dissipation associated with moving neural network data – e.g., inputs, weights, and intermediate results for each layer – often far exceeds the power dissipation to perform the actual network node calculation, by 100X or more, as illustrated below.

A general diagram of a (fully-connected, “deep”) neural network is depicted below.  The fundamental operation at each node of each layer is the “multiply-accumulate” (MAC) of the node inputs, node weights, and bias.  The layer output is given by:   [y] = [W] * [x] + [b], where [x] is a one-dimensional vector of inputs from the previous layer, [W] is the 2D set of weights for the layer, and [b] is a one-dimensional vector of bias values.  The results are typically filtered through an activation function, which “normalizes” the input vector for the next layer.

For a single node, the equation above reduces to:

yi = SUM(W[i, 1:n] * x[1:n]) + bi

For CPU, GPU, or neural network accelerator hardware, each datum is represented by a specific numeric type – typically, 32-bit floating point (FP32).  The FP32 MAC computation in the processor/accelerator is power-optimized.  The data transfer operations to/from memory are the key dissipation issue.

An active area of neural network research is to investigate architectures that reduce the distance between computation and memory.  One option utilizes a 2.5D packaging technology, with high-bandwidth memory (HBM) stacks integrated with the processing unit.  Another nascent area is to investigate in-memory computing (IMC), where some degree of computation is able to be completed directly in the memory array.

Additionally, data scientists are researching how to best reduce the data values to a representation more suitable to very low-power constraints – e.g., INT8 or INT4, rather than FP32.  The best-known neural network example is the MNIST application for (0 through 9) digit recognition of hand-written numerals (often called the “Hello, World” of neural network classification).  The figure below illustrates very high accuracy achievable on this application with relatively low-precision integer weights and values, as applied to the 28×28 grayscale pixel images of handwritten digits.

One option for data type reduction would be to train the network with INT4 values from the start.  Yet, the typical (gradient descent) back-propagation algorithm that adjusts weights to reduce classification errors during training is hampered by the coarse resolution of the INT4 value.  A promising research avenue would be to conduct training with an extended data type, then quantize the network weights (e.g., to INT4) for inference usage.  The new inference data type values from quantization could be signed or unsigned (with an implicit offset).

IMC and Advanced Memory Technology

At the recent VLSI 2020 Symposium, Yih Wang, Director in the Design and Technology Platform Group at TSMC, gave an overview of areas where in-memory computing is being explored to support deep neural network inferencing.[1]   Specifically, he highlighted an example of IMC-based SRAM fabrication in 7nm that TSMC recently announced.[2]  This article summarizes the highlights of his presentation.

SRAM-based IMC

The figure below illustrates how a binary multiply operation could be implemented in an SRAM.  The “product” of an input value and a weight bit value is realized by accessing a wordline transistor (input) and a bit-cell read transistor (weight).  Only in the case where both values are ‘1’ will the series device connection conduct current from the (pre-charged) bitline, for the duration of the wordline input pulse.

In other words, the ‘1’ times ‘1’ product results in a voltage change on the bitline, dependent upon the Ids current, the bitline capacitance, and the duration of the wordline ‘1’ pulse.

The equation for the output value yi above requires a summation across the full dimension of the input vector and a row of the weight matrix.   Whereas a conventional SRAM memory read cycle activates only a single decoded address wordline, consider what happens when every wordline corresponding to an input vector bit value of ‘1’ is raised.  The figure above also presents an equation for the total bitline voltage swing as dependent on the current from all (‘1’ * ‘1’) input and weight products.

Another view of the implementation of the dot product with an SRAM array is shown below.  Note that there are two sets of wordline drivers – one set for the neural network layer input vector, and one set of for normal SRAM operation (e.g., to write the weights into the array).

Also, the traditional CMOS six-transistor (6T) bit cell is designed for a single active wordline (with restoring sense amplification for data and data_bar).  For the dot product calculation where many input wordlines could be active, an 8T cell with separate Read bitline from Write bitlines is required – the voltage swing equation above applies to the current discharging this distinct Read bitline.

The figures above are simplified, as they illustrate the vector product using ‘1’ or ‘0’ values.  As mentioned earlier, the quantized data types for low power inference are likely greater than one bit, such as INT4.  The implementation used by TSMC is unique.  The 4-bit value of the input vector entry is represented as a series of 0 to 15 wordline pulses, as illustrated below.  The cumulative discharge current on the Read bitline represents the contribution from all input pulses on each wordline row.

The multiplication product output is also an INT4 value.  The four output signals use separate bitlines   – RBL[3] through RBL[0] – as shown below.  When the product is being calculated, the pre-charged bitlines are discharged as described above.  The total capacitance on each bitline is the same – e.g., “9 units” – the parallel combination of the calculation and compensation capacitances.

After the bitline discharge is complete, the compensation capacitances are disconnected.  Note the positional weights of the computation capacitances – i.e., RBL[3] has 8 times the capacitance of RBL[0].  The figure below shows the second phase of evaluation, when the four Read bitlines are connected together.  The “charge sharing” across the four line capacitances implies that the contribution of the RBL[3] line is 8 times greater than RBL[0], representing its binary power in a 4-bit multiplicand.

In short, a vector of 4-bit input values – each represented as 0-15 pulses on a single wordline—is multiplied against a vector of 4-bit weights, and the total discharge current is used to produce a single (capacitive charge-shared) voltage at the input to an Analog-to-Digital converter.  The ADC output is the (normalized) 4-bit vector product, which is input to the bias accumulator and activation function for the neural network node.

Yih highlighted the assumption that the bitline current contribution from each active (‘1’ * ‘1’) product is the same – i.e., all active series devices will contribute the same (saturated) current during the wordline pulse duration.  In actuality, if the bitline voltage drops significantly during evaluation, the Ids currents will be less, operating in the linear region.  As a result, the quantization of a trained deep NN model will need to take this non-linearity into account when assigning weight values.  The figure below indicates that a significant improvement is classification accuracy is achieved when this corrective step is taken during quantization.

IMC with Non-volatile Memory (NVM)

In addition to using CMOS SRAM bit cells, Yih highlighted that an additional area of research is to use a Resistive-RAM (ReRAM) bit cell array to store weights, as illustrated below.  The combination of an input wordline transistor pulse with a high-R or low-R resistive cell defines the resulting bitline current.   (ideally, the ratio of the high resistance state to the low resistance state is very large.)  Although similar to the SRAM operation described above, the ReRAM array would offer much higher bit density.  Also, further fabrication research into the potential for one ReRAM bit cell to have more than two non-volatile resistive states offers even greater neural network density.

Summary

Yih’s presentation provided insights into how the architectural design of memory arrays could readily support In-Memory Computing, such as the internal product of inputs and weights fundamental to each node of a deep neural network.  The IMC approach provides a dense and extremely low-power alternative to processor plus memory implementations, with the tradeoff of quantized data representation.   It will be fascinating to see how IMC array designs evolve to support the “AI is the new electricity” demand.

-chipguy

 

References

[1]  Yih Wang, “Design Considerations for Emerging Memory and In-Memory Computing”, VLSI 2020 Symposium, Short Course 3.8.

[2]  Dong, Q., et al., “A 351 TOPS/W and 372.4 GOPS Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications”, ISSCC 2020, Paper 15.3.

Also, please refer to:

[3] Choukroun, Y.., et al., “Low-bit Quantization of Neural Networks for Efficient Inference”, IEEE International Conference on Computer Vision, 2019, https://ieeexplore.ieee.org/document/9022167 .

[4] Agrawal, A., et al., “X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories”, IEEE Transactions on Circuits and Systems, Volume 65, Issue 12, December, 2018, https://ieeexplore.ieee.org/document/8401845 .

 

Images supplied by the VLSI Symposium on Technology & Circuits 2020.

 

 


Mentor Cuts Circuit Verification Time with Unique Recon Technology

Mentor Cuts Circuit Verification Time with Unique Recon Technology
by Mike Gianfagna on 07-17-2020 at 6:00 am

Screen Shot 2020 07 13 at 2.42.36 PM

Most of us will remember the productivity boost that hierarchical analysis provided vs. analyzing a chip flat. This “divide and conquer” approach has worked well for all kinds of designs for many years. But, as technology advances tend to do, the bar is moving again. The new challenges are rooted in the iterative nature of high complexity design.

A typical design has many, many blocks that all mature on different schedules. Nonetheless, these blocks need to be verified as they mature and there’s the rub. Called “dirty” designs, early versions of a complex chip are missing lots of detail. This creates a problem since certain checks are appropriate for these early designs, but others will generate huge numbers of false errors thanks to the incomplete nature of the circuit. Debugging some early design issues, such as shorts, can be incredibly time-consuming as well since a short can impact a huge part of the network.

Mentor has come up with a way to deal with these problems. It’s called Recon technology and the company recently announced the addition of the technology to its Calibre nmLVS product. I had the opportunity to get a briefing on this new technology from Hend Wagieh, Sr. Product Manager of Circuit Verification, Calibre Design Solutions at Mentor.

Hend explained the new Recon technology had its launch last year at DAC 2019 with Calibre nmDRC-Recon. DRC has similar challenges with incomplete designs and Calibre nmDRC provides a 6X – 12X performance boost for early checks. Hend began by reviewing the complexity challenges that new nodes, such as 5nm, present:

  • Circuit verification rules have become more complex
    • More devices and polygon counts
    • More dummy devices added
    • More device parameters and compounded calculations
  • Rules expanding in scope
    • Context sensitivity
    • Color-awareness
    • FinFETs
    • Retargeting
    • Multi-patterning

This complexity coupled with the incomplete nature of early designs spells big trouble for LVS unless there is some intelligence applied. This is where the Recon technology essentially provides a paradigm shift for LVS. The new approach can be summarized as follows:

Objective: Only execute what’s necessary to resolve early design main pain points

  • Categorization: Focus on specific types of violations
  • Prioritization: Address the most impactful errors first
  • Task Distribution: Allow teams to focus on specific set of design issues
  • Partitioning: Split data for easier debugging and root cause analysis

Hend described something called “Minimum Selective Extraction”. The Recon approach basically sorts all error checks into early and late design versions and applies intelligence regarding the way early design checks are done to minimize run time and maximize identification and correction of real errors. The result is faster early run times and cleaner late checks.

Hend spent some time discussing short paths in early (dirty) designs. She explained that an average-size early design could have about 20K short paths, with a short in the power/ground grid extending throughout the entire chip. A customer has reported spending 80% of their verification cycle debugging shorts, with complex shorts taking weeks to fix.

Using Recon technology, this problem can be managed very effectively, with up to 30x faster iterations and 3x leaner hardware usage. The results are quite dramatic.

A customer perspective was provided in the press release:

“The Calibre nmLVS-Recon approach establishes an entirely new paradigm for circuit verification use models,” said Jongwook Kye, vice president of Design Enablement Team at Samsung Electronics. “By combining the Calibre nmLVS-Recon technology with Samsung’s existing certified sign-off Calibre nmLVS design kits, our mutual customers will experience faster iterations on early ‘dirty’ designs, driving accelerated LVS verification cycles. All of this will help mutual customers tape out sooner at Samsung.”

The Calibre nmLVS-Recon flow can be used with any foundry/integrated device manufacturer’s (IDM) Calibre sign-off design kit “as is,” and on any process technology node. The product will be released in phases as shown in the diagram below.

The Calibre nmLVS-Recon initial offering will be available to the market with the Calibre family release in July of 2020, with planned additional capabilities in later releases. For more information please visit: https://bit.ly/2ZA7qjn.


Talking Sense With Moortec … Ask Yourself a “Probing” Question?

Talking Sense With Moortec … Ask Yourself a “Probing” Question?
by Ramsay Allen on 07-16-2020 at 10:00 am

Managing and controlling thermal conditions in-chip is nothing new and embedded temperature monitoring has been going on for many years. What is changing however, is the granularity and accuracy of the sensing now available to SoC design teams. Thermal activity can be quite destructive and if not sufficiently monitored can cause over-heating and excessive power consumption which in turn can impact device longevity and reliability.

In other walks of life we are very thermally aware, you wouldn’t go on a long journey in your car without occasionally glancing at the temperature gauge on your dashboard or bake a cake without accurately controlling the temperature of your oven. The cost of failure with these examples are several orders of magnitude lower than your SoC, so why gamble, why would you not take even more care of your device? You wouldn’t? Right? Yet many companies are doing just that and not investing in this kind of technology at the early design phase, leaving them open to chip level thermal issues, which in turn, have a measurable impact on the system performance.

The car and cake examples are only single point checks. In an SOC you need to have greater visibility, to read multiple probe points giving you precise thermal measurements beside or within CPU cores, high speed interfaces or high activity circuitry. This type of extended functionality is now a critical requirement for chips operating within any large complex device. Moortec’s recently announced addition to their existing embedded in-chip sensing fabric, has this very functionality and its adoption in AI, Data Center and 5G devices would suggest a credible solution now exists.

For some time there has been a demand for tighter, deeper thermal control of semiconductor devices…now you have an efficient means of implementing it… so now, not only will the water in your car radiator not boil and your cakes not burn, but your SoC is also far less likely to overheat!

In case you missed any of Moortec’s previous “Talking Sense” blogs, you can catch up HERE.


Using AI to Locate a Fault. Innovation in Verification

Using AI to Locate a Fault. Innovation in Verification
by Bernard Murphy on 07-16-2020 at 10:00 am

innovation in verification

After we detect a bug, can we use AI to locate the fault, or at least get close? Paul Cunningham (GM of Verification at Cadence), Jim Hogan and I continue our series on novel research ideas, through a paper in software verification we find equally relevant to hardware. Feel free to comment.

 

The Innovation
This month’s pick is Precise Learn-to-Rank Fault Localization Using Dynamic and Static Features of Target Programs. You can find the paper in the 2019 ACM Transactions on Software Engineering and Methodology. The authors are from KAIST, South Korea

There’s an apparent paradox emerging in the papers we have reviewed. Our verification targets are huge designs – 2N states, 2N^2 transitions. How can anything useful be deduced from such complex state-machines using quite small machine learning engines, here a 9-level genetic programming tree? We believe the answer is approximation. If you want to find the exact location of a fault, the mismatch is fatal. If you want to get close, the method works very well. Automating that zoom-in is a high-value contribution, even though we have to finish manually.

This paper is very detailed. Again we’ll summarize only takeaways. A key innovation is to use multiple techniques to score suspiciousness of program elements. Two are dynamic, spectrum based (number of passing and failing tests that execute an element) and mutation-based (the number of failing tests which pass when an element is mutated). Three are static, dependency and complexity metrics on files, functions and statements. The method uses these features to learn a ranking of most probable suspicious statements from program examples with known faults. In inference, the method takes a program failing at least one test and generates a ranked list of probable causes.

The learning method the authors use is a genetic programming tree. Think of arbitrary expressions of these features, represented as numbers. Each expression can be represented as a tree. They’re training expressions through random selection to find those that most closely fit the training examples. A genetic programming tree is mathematically not so different from a neural net, doing a conceptually similar job in a somewhat different way. It’s a nice change to see a paper highlighting how powerful genetic programing can be, as a contrast to the sea of neural net papers we more commonly find.

Paul’s view
This overall concept is growing on me, that a program could look at code and observed behaviors, and draw conclusions with the same insight as an experienced programmer. We code in quite restricted ways, but we don’t know how to capture that intuition in rules. However, we do know where to look as experienced programmers when we see suspicious characteristics, in behaviors in testing, in complexities and interdependencies in files and functions. This isn’t insight based on how each state toggles over time. We have a more intuitive, experience-based view, drawing on those higher-level features. I see this innovation capturing that intuition.

I think this is a very practical paper. The authors do a very systematic analysis of how each feature contributes, for example taking away one feature at a time to see how accuracy of detection is affected. Slicing and dicing their own analysis to understand where it might be soft, why it’s robust. For example, they only had to use 20% of the mutants that a mutation-only analysis needed, an important consideration because the mutation analysis is by far the biggest contributor to run-time. Reducing the number of mutants they consider reduces run-time by 4-5X. Yet their overall accuracy when combining this reduced mutation coverage with other features is way better than the full mutation-based approach alone.

The big revelation for me is that while they use multiple methods, individually are not new (spectrum-based and mutation-based fault location, static complexity metrics, genetic programming), they show that putting these together, the whole is much greater than the sum of the parts. This is a very well thought-through paper. I could easily see how this idea could be turned into a commercial tool.

Jim’s view
I like Paul’s point about the pieces individually not being that exciting, but when you pull them together, you’re getting results that are much more interesting. Our regular engineering expectation is that each component in what we build has to make an important contribution. In that way, when you add them together you expect a sum of important contributions. Maybe in these intuitive approaches it doesn’t always work that way, a few small components can add up to a much bigger outcome.

I’m starting to feel there’s a real pony in here. I have a sense that this is the biggest of the ideas we’ve seen so fa. I’m hearing from my team-mates that the authors have done well on analyzing their method from all possible angles. There are indications that there’s a significant new idea here, built on well-established principles, showing a whole new level of results. I’d say this is investable.

My view
I’m going to steal a point that Paul made in our off-line discussion. This feels similar to sensor fusion. One sensor can detect certain things, another can detect different things. Fused together you have a much better read than you could have got from each sensor signal separately. Perhaps we should explore this idea more widely in machine learning applications in verification.

You can read the previous Innovation in Verification blog HERE.


Novel DFT Approach for Automotive Vision SoCs

Novel DFT Approach for Automotive Vision SoCs
by Tom Simon on 07-16-2020 at 6:00 am

Mentor Tessent IC Design

You may have seen a recent announcement from Mentor, a Siemens business, regarding the use of their Tessent DFT software by Ambarella for automotive applications. The announcement is a good example of how Mentor works with their customers to assure design success. On the surface the announcement comes across as a nice block and tackle success. However, digging deeper there is a more interesting story to tell.

Ambarella designs vision processors for AI edge applications, among these are automotive systems. This brings ISO 26262 into play to ensure that the reliability of the systems is commensurate with the risk associated with a potential failure. Ambarella used Mentor’s Tessent LogicBIST, MemoryBIST and MissionMode products to develop the DFT features in their CV22FS and CV2F automotive camera system-on-chips (SoCs).

Digging deeper into the story behind the announcement, I had a conversation with Mentor’s Lee Harrison about how Ambarella worked with Mentor to develop a unique test solution that helps Ambarella get the most flexibility as they design new SoCs. Ambarella wanted to build a modular approach into their blocks so that the test functionality of each block is self-contained.

For in-system test, typically each chip will have a top level MissionMode controller that connects to the MBIST and LBIST in each block. This top-level test controller will have ROM for the patterns or rely on CPU control. Ambarella went with the approach of having each block use a MissionMode controller and having RAM at the top level for the test data that is downloaded at start up. The MissonMode controller RAM is loaded using a DMA feature in the MissionMode Controller.

Lee explained that even though there is a slight start-up time penalty for loading the local RAM from the top-level ROM, Ambarella benefits from having each block signed off for DFT before the chip is assembled. This offers them huge benefits in terms of IP reuse and simplification of the top-level integration.

I have written recently about how Mentor works with customers to develop key new features of their DFT products. While this is a little different, it offers an example of customer cooperation that works to everyone’s benefit. The architectural advantages of Tessent are evident from the results obtained in this example.

Lee also mentioned that the work with Ambarella predated the development of Tessent Observation Scan. If this were added to their flow, it would save more time because of the reduction in the number of patterns. The two-fold benefit would be that the data transfer at start up would take less time and the actual test runs would be faster as well.

In the automotive market in-system test is essential to provide test functionality at start-up, during system operation and after the system has “powered off.” Mentor’s MissionMode controller enables each of these operations. There are numerous white papers and videos on the Mentor website that discuss their automotive test solutions. In particular, if you are interested in reading the Ambarella release, it is available there as well.

 

 


A tour of Cliosoft’s participation at DAC 2020 with Simon Rance

A tour of Cliosoft’s participation at DAC 2020 with Simon Rance
by Mike Gianfagna on 07-15-2020 at 10:00 am

Simon Rance

As chip complexity grows, so does the need for a well-thought-out design data management strategy.  This is a hot area, and Cliosoft is in the middle of it.  When I was at eSilicon, we used Cliosoft technology to manage the design and layout of high-performance analog designs across widely separated design teams. The tool worked great and everyone was always working on the correct version. Over the years I’ve developed an appreciation for the importance of an industrial-grade strategy to manage design data and revisions. And no, spreadsheets and white boards don’t qualify as industrial grade.

I was curious what Cliosoft was up to at DAC this year, so I reached out to an old friend from Atrenta who is the head of marketing at Cliosoft, Simon Rance. It turns out Cliosoft is doing a lot at DAC and Simon took me on a tour of the planned events.

The first one we discussed is a poster session presented with Lawrence Berkeley National Laboratory, Method and Apparatus to Promote Cross-Institution Design Collaboration. This one is certain to take you out of your traditional concept of a design project. The challenges of “high-energy physics project development” will be discussed. Collaboration is quite widespread and Cliosoft provides a master data repository. This data backbone is used by Brookhaven National Laboratory, Fermi National Accelerator Laboratory and Lawrence Berkeley National Laboratory. OK, enough name-dropping. This poster will be presented on Wednesday, July 22 from 7:30 AM to 8:30 AM Pacific time.

Next up is a poster session about designing in the cloud with Amazon Web Services, Efficient & Cost Effective EDA Environment Built Easily in AWS Cloud. First, the challenges of on-premise data centers are discussed:

  • Peak-capacity resource planning
  • Continuous upgrades of hardware
  • Capital expense

A methodology to address these issues using Cliosoft technology is then discussed. Some eye-catching statistics are documented:

  • 90 percent disk space savings
  • 2 – 3.5X performance gain

Impressive. The methods to achieve these kinds of results are detailed in this presentation. I had some first-hand experience with designing in the cloud at eSilicon and I can tell you the efficiency and flexibility benefits are real. You should check it out. This poster will be presented on Tuesday, July 21 from 7:30AM – 8:30AMPacific time.

Speaking of the cloud, the next poster session we discussed was one with Google, Efficient & Cost Effective EDA Environment Built Easily in Google Cloud.  The challenges cataloged here are:

  • Shared storage performance
  • Unstable networks

A methodology to replicate your EDA environment in the cloud is discussed. Key items to consider include:

  • A wise choice of compute infrastructure
  • The cloud compatibility of the software
  • Cloud connectivity to all design sites
  • Data privacy and retention compliance

The presentation reports a 75 percent improvement in file access on the cloud. This poster will be presented on Tuesday, July 21 from 7:30AM – 8:30AM Pacific time.

The final session I discussed with Simon is a presentation in the technical program at DAC. I can tell you these slots are not easy to get. Each submission goes through a rigorous peer review and only the best ones survive. The presentation is entitled Silicon-Based Quantum Computer Design and Verification.

This is a joint presentation with Cliosoft and Equal1.Labs. Quantum computing is pretty exotic stuff. Equal1.Labs claims to have the first 16 qubit compact quantum computer demonstrator, code named alice mk1. I would definitely catch this one. The presentation is Monday, July 20 from 1:30PM – 3:00 PM Pacific time (session 6.2).

You can register for DAC here.  Enjoy the show.

Also Read

How to Grow with Poise and Grace, a Tale of Scalability from ClioSoft

How to Modify, Release and Update IP in 30 Minutes or Less

Best Practices for IP Reuse


A Look at the Die of the 8086 Processor

A Look at the Die of the 8086 Processor
by Ken Shirriff on 07-15-2020 at 6:00 am

Intel 8086 Die

The Intel 8086 microprocessor was introduced 42 years ago last month,1 so I made some high-res die photos of the chip to celebrate. The 8086 is one of the most influential chips ever created; it started the x86 architecture that still dominates desktop and server computing today. By looking at the chip’s silicon, we can see the internal features of this chip.

The photo below shows the die of the 8086. In this photo, the chip’s metal layer is visible, mostly obscuring the silicon underneath. Around the edges of the die, thin bond wires provide connections between pads on the chip and the external pins. (The power and ground pads each have two bond wires to support the higher current.) The chip was complex for its time, containing 29,000 transistors.

Die photo of the 8086, showing the metal layer. Around the edges, bond wires are connected to pads on the die. Click for a large, high-resolution image.

Looking inside the chip
To examine the die, I started with the 8086 integrated circuit below. Most integrated circuits are packaged in epoxy, so dangerous acids are necessary to dissolve the package. To avoid that, I obtained the 8086 in a ceramic package instead. Opening a ceramic package is a simple matter of tapping it along the seam with a chisel, popping the ceramic top off.

The 8086 chip, in 40-pin ceramic DIP package.

With the top removed, the silicon die is visible in the center. The die is connected to the chip’s metal pins via tiny bond wires. This is a 40-pin DIP package, the standard packaging for microprocessors at the time. Note that the silicon die itself occupies a small fraction of the chip’s size.

The 8086 die is visible in the middle of the integrated circuit package.

Using a metallurgical microscope, I took dozens of photos of the die and stitched them into a high-resolution image using a program called Hugin (details). The photo at the beginning of the blog post shows the metal layer of the chip, but this layer hid the silicon underneath.

Under the microscope, the 8086 part number is visible as well as the copyright date. A bond wire is connected to a pad. Part of the microcode ROM is at the top.

For the die photo below, the metal and polysilicon layers were removed, showing the underlying silicon with its 29,000 transistors.2 The labels show the main functional blocks, based on my reverse engineering. The left side of the chip contains the 16-bit datapath: the chip’s registers and arithmetic circuitry. The adder and upper registers form the Bus Interface Unit that communicates with external memory, while the lower registers and the ALU form the Execution Unit that processes data. The right side of the chip has control circuitry and instruction decoding, along with the microcode ROM that controls each instruction.

Die of the 8086 microprocessor showing main functional blocks.

One feature of the 8086 was instruction prefetching, which improved performance by fetching instructions from memory before they were needed. This was implemented by the Bus Interface Unit in the upper left, which accessed external memory. The upper registers include the 8086’s infamous segment registers, which provided access to a larger address space than the 64 kilobytes allowed by a 16-bit address. For each memory access, a segment register and a memory offset were added to form the final memory address. For performance, the 8086 had a separate adder for these memory address computations, rather than using the ALU. The upper registers also include six bytes of instruction prefetch buffer and the program counter.

The lower-left corner of the chip holds the Execution Unit, which performs data operations. The lower registers include the general-purpose registers and index registers such as the stack pointer. The 16-bit ALU performs arithmetic operations (addition and subtraction), Boolean logical operations, and shifts. The ALU does not implement multiplication or division; these operations are performed through a sequence of shifts and adds/subtracts, so they are relatively slow.

Microcode
One of the hardest parts of computer design is creating the control logic that tells each part of the processor what to do to carry out each instruction. In 1951, Maurice Wilkes came up with the idea of microcode: instead of building the control logic from complex logic gate circuitry, the control logic could be replaced with special code called microcode. To execute an instruction, the computer internally executes several simpler micro-instructions, which are specified by the microcode. With microcode, building the processor’s control logic becomes a programming task instead of a logic design task.

Microcode was common in mainframe computers of the 1960s, but early microprocessors such as the 6502 and Z-80 didn’t use microcode because early chips didn’t have room to store microcode. However, later chips such as the 8086 and 68000, used microcode, taking advantage of increasing chip densities. This allowed the 8086 to implement complex instructions (such as multiplication and string copying) without making the circuitry more complex. The downside was the microcode took a large fraction of the 8086’s die; the microcode is visible in the lower-right corner of the die photos.3

A section of the microcode ROM.

Bits are stored by the presence or absence of transistors. The transistors are the small white rectangles above and/or below each dark rectangle. The dark rectangles are connections to the horizontal output buses in the metal layer.

The photo above shows part of the microcode ROM. Under a microscope, the contents of the microcode ROM are visible, and the bits can be read out, based on the presence or absence of transistors in each position. The ROM consists of 512 micro-instructions, each 21 bits wide. Each micro-instruction specifies movement of data between a source and destination. It also specifies a micro-operation which can be a jump, ALU operation, memory operation, microcode subroutine call, or microcode bookkeeping. The microcode is fairly efficient; a simple instruction such as increment or decrement consists of two micro-instructions, while a more complex string copy instruction is implemented in eight micro-instructions.3

History of the 8086
The path to the 8086 was not as direct and planned as you might expect. Its earliest ancestor was the Datapoint 2200, a desktop computer/terminal from 1970. The Datapoint 2200 was before the creation of the microprocessor, so it used an 8-bit processor built from a board full of individual TTL integrated circuits. Datapoint asked Intel and Texas Instruments if it would be possible to replace that board of chips with a single chip. Copying the Datapoint 2200’s architecture, Texas Instruments created the TMX 1795 processor (1971) and Intel created the 8008 processor (1972). However, Datapoint rejected these processors, a fateful decision. Although Texas Instruments couldn’t find a customer for the TMX 1795 processor and abandoned it, Intel decided to sell the 8008 as a product, creating the microprocessor market. Intel followed the 8008 with the improved 8080 (1974) and 8085 (1976) processors. (I’ve written more about early microprocessors here.)

Datapoint 2200 computer. Photo courtesy of Austin Roche.

In 1975, Intel’s next big plan was the 8800 processor designed to be Intel’s chief architecture for the 1980s. This processor was called a “micromainframe” because of its planned high performance. It had an entirely new instruction set designed for high-level languages such as Ada, and supported object-oriented programming and garbage collection at the hardware level. Unfortunately, this chip was too ambitious for the time and fell drastically behind schedule. It eventually launched in 1981 (as the iAPX 432) with disappointing performance, and was a commercial failure.

Because the iAPX 432 was behind schedule, Intel decided in 1976 that they needed a simple, stop-gap processor to sell until the iAPX 432 was ready. Intel rapidly designed the 8086 as a 16-bit processor somewhat compatible with the 8-bit 8080,4 released in 1978. The 8086 had its big break with the introduction of the IBM Personal Computer (PC) in 1981. By 1983, the IBM PC was the best-selling computer and became the standard for personal computers. The processor in the IBM PC was the 8088, a variant of the 8086 with an 8-bit bus. The success of the IBM PC made the 8086 architecture a standard that still persists, 42 years later.

Why did the IBM PC pick the Intel 8088 processor?7 According to Dr. David Bradley, one of the original IBM PC engineers, a key factor was the team’s familiarity with Intel’s development systems and processors. (They had used the Intel 8085 in the earlier IBM Datamaster desktop computer.) Another engineer, Lewis Eggebrecht, said the Motorola 68000 was a worthy competitor6 but its 16-bit data bus would significantly increase cost (as with the 8086). He also credited Intel’s better support chips and development tools.5

In any case, the decision to use the 8088 processor cemented the success of the x86 family. The IBM PC AT (1984) upgraded to the compatible but more powerful 80286 processor. In 1985, the x86 line moved to 32 bits with the 80386, and then 64 bits in 2003 with AMD’s Opteron architecture. The x86 architecture is still being extended with features such as AVX-512 vector operations (2016). But even though all these changes, the x86 architecture retains compatibility with the original 8086.

Transistors
The 8086 chip was built with a type of transistor called NMOS. The transistor can be considered a switch, controlling the flow of current between two regions called the source and drain. These transistors are built by doping areas of the silicon substrate with impurities to create “diffusion” regions that have different electrical properties. The transistor is activated by the gate, made of a special type of silicon called polysilicon, layered above the substrate silicon. The transistors are wired together by a metal layer on top, building the complete integrated circuit. While modern processors may have over a dozen metal layers, the 8086 had a single metal layer.

Structure of a MOSFET in the integrated circuit.

The closeup photo of the silicon below shows some of the transistors from the arithmetic-logic unit (ALU). The doped, conductive silicon has a dark purple color. The white stripes are where a polysilicon wire crossed the silicon, forming the gate of a transistor. (I count 23 transistors forming 7 gates.) The transistors have complex shapes to make the layout as efficient as possible. In addition, the transistors have different sizes to provide higher power where needed. Note that neighboring transistors can share the source or drain, causing them to be connected together. The circles are connections (called vias) between the silicon layer and the metal wiring, while the small squares are connections between the silicon layer and the polysilicon.

Closeup of some transistors in the 8086. The metal and polysilicon layers have been removed in this photo. The doped silicon has a dark purple appearance due to thin-film interference.

Conclusions
The 8086 was intended as a temporary stop-gap processor until Intel released their flagship iAPX 432 chip, and was the descendant of a processor built from a board full of TTL chips. But from these humble beginnings, the 8086’s architecture (x86) unexpectedly ended up dominating desktop and server computing until the present.

Although the 8086 is a complex chip, it can be examined under a microscope down to individual transistors. I plan to analyze the 8086 in more detail in future blog posts8, so follow me on Twitter at @kenshirriff for updates. I also have an RSS feed. Here’s a bonus high-resolution photo of the 8086 with the metal and polysilicon removed; click for a large version.

Die photo of the Intel 8086 processor. The metal and polysilicon have been removed to reveal the underlying silicon.

Die photo of the Intel 8086 processor. The metal and polysilicon have been removed to reveal the underlying silicon.

Notes and references

  • The 8086 was released on June 8, 1978. 
  • To expose the chip’s silicon, I used Armour Etch glass etching cream to remove the silicon dioxide layer. Then I dissolved the metal using hydrochloric acid (pool acid) from the hardware store. I repeated these steps until the bare silicon remained, revealing the transistors. 
  • The designers of the 8086 used several techniques to keep the size of the microcode manageable. For instance, instead of implementing separate microcode routines for byte operations and word operations, they re-used the microcode and implemented control circuitry (with logic gates) to handle the different sizes. Similarly, they used the same microcode for increment and decrement instructions, with circuitry to add or subtract based on the opcode. The microcode is discussed in detail in New options from big chips and patent 4449184
  • The 8086 was designed to provide an upgrade path from the 8080, but the architectures had significant differences, so they were not binary compatible or even compatible at the assembly code level. Assembly code for the 8080 could be converted to 8086 assembly via a program called CONV-86, which would usually require manual cleanup afterward. Many of the early programs for the 8086 were conversions of 8080 programs. 
  • Eggebrecht, one of the original engineers on the IBM PC, discusses the reasons for selecting the 8088 in Interfacing to the IBM Personal Computer (1990), summarized here. He discussed why other chips were rejected: IBM microprocessors lacked good development tools, and 8-bit processors such as the 6502 or Z-80 had limited performance and would make IBM a follower of the competition. I get the impression that he would have preferred the Motorola 68000. He concludes, “The 8088 was a comfortable solution for IBM. Was it the best processor architecture available at the time? Probably not, but history seems to have been kind to the decision.” 
  • The Motorola 68000 processor was a 32-bit processor internally, with a 16-bit bus, and is generally considered a more advanced processor than the 8086/8088. It was used in systems such as Sun workstations (1982), Silicon Graphics IRIS (1984), the Amiga (1985), and many Apple systems. Apple used the 68000 in the original Apple Macintosh (1984), upgrading to the 68030 in the Macintosh IIx (1988), and the 68040 with the Macintosh Quadra (1991). However, in 1994, Apple switched to the RISC PowerPC chip, built by an alliance of Apple, IBM, and Motorola. In 2006, Apple moved to Intel x86 processors, almost 28 years after the introduction of the 8086. Now, Apple is rumored to be switching from Intel to its own ARM-based processors. 
  • For more information on the development of the IBM PC, see A Personal History of the IBM PC by Dr. Bradley. 
  • The main reason I haven’t done more analysis of the 8086 is that I etched the chip for too long while removing the metal and removed the polysilicon as well, so I couldn’t photograph and study the polysilicon layer. Thus, I can’t determine how the 8086 circuitry is wired together. I’ve ordered another 8086 chip to try again. 

Ansys Multiphysics Platform Tackles Power Management ICs

Ansys Multiphysics Platform Tackles Power Management ICs
by Mike Gianfagna on 07-14-2020 at 10:00 am

Screen Shot 2020 07 08 at 7.14.17 PM

Ansys addresses complex Multiphysics simulation and analysis tasks, from device to chip to package and system. When I was at eSilicon we did a lot of work on 2.5D packaging and I can tell you tools from Ansys were a critical enabler to get the chip, package and system to all work correctly.

Ansys recently published an Application Brief on how they address analysis of power management ICs. The tool highlighted is Ansys Totem, a foundry-certified transistor-level power noise and reliability platform for power integrity analysis on analog mixed-signal IP and full custom designs. I had the opportunity to speak with Karthik Srinivasan, Sr. Corporate Application Engineer Manager, Analog & Mixed Signal and Marc Swinnen, Director of Product Marketing at Ansys.

I began by probing the genealogy of Totem. Did it come from an acquisition? Interestingly, Totem is a completely organic tool that builds on the Multiphysics platform at Ansys that powers other tools such as the popular Ansys Redhawk.  Organic development like this is noteworthy – it speaks to the breadth and depth of the underlying infrastructure. As Totem is a transistor-level tool, it delivers Spice-like accuracy according to the Application Brief. I probed this a bit with Karthik. Was Totem actually running Spice, and if so, how do you get an answer for a large network in less than geologic time?

Totem changes the modeling paradigm for the network to deliver results much faster than traditional Spice. All non-linear elements are converted to a linear model. All transistors are modeled as current sources and capacitors. These models are then connected to the parasitic network of the power grid. An IR-drop and electromigration analysis is then performed. This cuts the computational complexity of the problem down quite a bit. Totem provides targeted accuracy for the analysis of interest, typically within 5-10 mV of Spice, even for advanced technology nodes.

We discussed other applications of this approach. Power management ICs contain very wide power rails to handle the large currents involved in their operation. These structures are typically analyzed with a finite element solver, resulting in very long run times, typically multiple days. Using the Totem approach, a result with similar accuracy can typically be delivered 5-6X faster.

Using the Ansys Multiphysics platform, analysis can be performed from transistor and cell library level all the way to the system level. One platform, one source of models. IP vendors are also developing and delivering Totem macro models along with the IP to facilitate this kind of multi-level analysis. Marc pointed out that custom macro models are a key enabling technology to support this kind of transistor to system analysis. One first does the detailed analysis in Totem and then creates a macro model of the result to drive Redhawk.

The Ansys Application Brief goes into a lot more detail about the analysis capabilities of Totem. You can access the Application Brief here. To whet your appetite, here are some of the topics covered:

  • Advanced Analysis: Power FETs, RDSON & sensitivity, guard ring weakness checks, transient power
  • Early Analysis: device R maps, interconnect R maps, guard ring weakness maps
  • PDN Noise Sign-Off: power, DvD, substrate noise

With DAC approaching, you can visit the Ansys virtual booth. Registration for DAC can be found here. There’s more to see from Ansys at DAC.  The company has an incredible 25 papers accepted in the designer track (that’s not a misprint). Four of them focus on Totem. I also hear that Ansys is planning a special semiconductor-focused virtual event in the Fall. Watch your inbox and SemiWiki for more information on that as it becomes available.

Also Read

Qualcomm on Power Estimation, Optimizing for Gaming on Mobile GPUs

The Largest Engineering Simulation Virtual Event in the World!

Prevent and Eliminate IR Drop and Power Integrity Issues Using RedHawk Analysis Fusion


Hierarchical CDC analysis is possible, with the right tools

Hierarchical CDC analysis is possible, with the right tools
by Bernard Murphy on 07-14-2020 at 6:00 am

Design complexity demands hierarchical CDC

Back in my Atrenta days (before mid-2015), we were already running into a lot of very large SoC-level designs – a billion gates or more. At those sizes, full-chip verification of any kind becomes extremely challenging. Memory demand and run-times explode, and verification costs explode also since these runs require access to very expensive servers in-house or in the cloud. Verifying hierarchically seems like an obvious solution but presents new problems in abstracting blocks in the analysis. Immediate ideas for abstraction invariably hide global detail which is critical to accuracy and dependability for sign-off. Implementing hierarchical CDC (clock domain crossing) analysis provides a good example.

The need for hierarchical CDC

The factors that make for a CDC problem don’t neatly bound themselves inside design hierarchy blocks. Clocks run all over an SoC and many domain crossings fall between function blocks. You might perhaps analyze two or more such blocks together, but you still have to abstract the rest, adding unknown inaccuracies to your analysis. Even this solution may fail for more extended problems like re-convergence or glitch prone logic. Add in multiple power domains and reset domains and the range of combinations you may need to test can become overwhelming. Clever user hacks can’t get around these issues unfortunately.

The unavoidable answer is to develop much better abstractions which can capture that global detail, detail that is necessary for CDC analysis but not captured in conventional constraints or other design data. That direction started in Atrenta and continues to be evolved in Synopsys through a concept of sign-off abstract models (SAMs). A SAM is a reduced and annotated model, much smaller than the full model. But it still contains enough design and constraint detail to support an accurate CDC analysis at the next level up.

Hierarchical analysis

The analysis methodology, which can extend through multiple levels of hierarchy, will typically start at a block/IP level where an engineer will first fully validate CDC correctness, then generate a SAM model through an automatic step. These models strip out internal logic except for logic at boundaries where that logic has relevance to CDC. The SAM model will also include assumptions made in the block-level analysis. At the next level up, CDC will between the assumptions at that level (e.g. sync/async relations between clocks) and those block-level assumptions.

When you have fixed any consistency problems at one level, you can run CDC analysis  at level next level up. Fix any problems there, generate a SAM model  for that level, and so on, up the hierarchy.

Hierarchy simplifies CDC review

There’s another obvious benefit to this approach. CDC noise becomes much more manageable. No need to wade through gigabytes of full-chip reports to find potential problems. You can now work through reasonably-sized reports at each level. Synopsys already has lots of clever techniques uses to reduce noise further within a level .

The secret sauce in this process is the detail in the SAM model, in generation, and in consistency checks between levels. To ensure that hierarchical analysis is entirely consistent with a full flat analysis. While subtracting the detail that would have been reported inside whatever you have abstracted. You can still run a final signoff before handoff, to be absolutely certain. Hierarchical CDC helps you to be a lot more efficient about how you get there.

You can learn more about the VC SpyGlass hierarchical CDC analysis flow HERE.

Also Read:

What’s New in Verdi? Faster Debug

Design Technology Co-Optimization (DTCO) for sub-5nm Process Nodes

Webinar: Optimize SoC Glitch Power with Accurate Analysis from RTL to Signoff


SystemC Methodology for Virtual Prototype at DVCon USA

SystemC Methodology for Virtual Prototype at DVCon USA
by Daniel Payne on 07-13-2020 at 10:00 am

Register Model min

DVCon was the first EDA conference in our industry impacted by the pandemic and travel restrictions in March of this year, and the organizers did a superb job of adjusting the schedule. I was able to review a DVCon tutorial called “Defining a SystemC Methodology for your Company“, given by Swaminathan Ramachandran of CircuitSutra. His company provides ESL design IP and services and their main office is in India.

Why SystemC

The SystemC language goes all the way back to a DAC 1997 paper, and the first draft version was released in 1999. SystemC is defined by Accellera and even has an IEEE standard 1666-2011. The Accellera SystemC/TLM (Transaction Level Modeling) 2.0 standard provides a solid base to start building, integrating and deploying models for use cases in various domains.

The ability to model a virtual platform of both SoC hardware and software concurrently using SystemC is the big driver. SystemC is a library built in C++, which has a rich and robust ecosystem consisting of libraries and development tools.

Virtual Prototypes

Virtual Prototypes are the fast software models of the hardware, typically at a higher level of abstraction, sacrificing cycle accuracy for simulation speed.

Virtual Platforms based on SystemC have been leading the charge for ‘left-shift’ in the industry. It has had a profound impact in the fields of pre-silicon software development, architecture analysis, verification and validation, Hardware-Software co-design & co-verification.

SystemC/TLM2.0 has become the de facto standard for development and exchange of IP and SoC models for use in virtual prototypes

SystemC Methodology for Virtual Prototypes

SystemC, a C++ library, offers the nuts and bolts to model the hardware at various abstraction levels.

Developing each IP model from scratch with low level semantics and boilerplate code can be a drain on engineering time and resources, leading to lower productivity and higher chances of introducing bugs. There is a need for a boost-like utility library on top of SystemC, that provides a rich collection of tool independent, re-usable modeling components that can be used across many IPs and SoCs.

One of the strengths of SystemC, and also its biggest weakness, is its versatility. SystemC allows you to develop models which can be at the RTL level, similar to Verilog / VHDL. It also allows you to develop the models at higher abstraction levels which can simulate as fast as real hardware. To effectively deploy SystemC in your projects, just learning the SystemC language is not sufficient, you need to understand the specific modeling techniques so that models are suitable for a specific use case. The modeling methodology or boost-like library on top of SystemC, for virtual prototyping use case should provide the re-usable modeling classes & components that encapsulate the modeling techniques required in virtual prototyping. Any model developed using this library will automatically be at higher abstraction levels, fully suitable for virtual prototypes.

Virtual prototyping tools from many EDA vendors comes with such a library, however models developed with these become tightly-coupled with the tools. Most of the semiconductor companies working on virtual platform projects end up developing such a library in-house in a tool independent fashion,

While defining such a methodology, one should try to identify and leverage recurring patterns in the model development. There will be some code sections or features that will be similar in all models. Instead of each modeling engineer implementing their own versions of these code sections, it will be better to maintain these in a common library to be used by all modeling engineers.

In addition, there may be set of common, re-usable modeling components required while developing the models of the various IP of the same application domain, e.g. audio / video. Every company has to carefully evaluate their needs and come up with the requirement specs of these common components.

Most of the time, there is a central methodology team who develops and maintain this library, and keep it up to date with latest standards.

This presentation covered a select list of components and features that may be used to build such a high productivity suite. These may be useful for the semiconductor and system companies, willing to start with virtual prototyping activities.

Over the years the team at CircuitSutra has built up their own SystemC library to accelerate virtual prototype projects. CircuitSutra Modeling Library (CSTML) has been successfully used in a wide variety of virtual platform projects for over a decade, and has become highly stable over that period of time.

Using CSTML as the base for your projects right from the beginning will ensure that your models are compliant with standards and can be integrated with any EDA tool. You may also use it as the base and further customize it to define your own modeling methodology.

Feature List

Some of these library elements are presented here:

  • Register Modeling
  • Smart TLM sockets
  • Configuration
  • Reporting/Logging
  • Model Generator
  • Smart Timer
  • Generic Router
  • Generic Memory
  • Python Integration

 

Register Modeling

Registers provide the entry point for embedded programmers to configure an IP, and as such are universally found in almost all IPs. Registers come in all shapes and sizes and are usually described using IPXACT register specifications.

Memory mapped registers  are mapped to CPU address maps. Registers may be further composed of bit-fields, each of which may control one or more aspects of an IP and report their status. Register read and write requests are typically handled via a TLM 2.0 target socket. We can marry the TLM2.0 (smart) target socket to the register library to provide seamless and automatic communication between the two.

Registers and bit-fields have five access types. The bit field read has three variants and write has ten variants. The number of permutations and combinations that this can offer is mind boggling, but with a register library, accompanied with code generation this complexity can be tucked away under a lightweight and consistent API to access registers and bit-fields. Further array-like access semantics provide syntactic sugar.

If we want to associate an action linked to a register access, we can enable it by registering a pre/post call-back with the appropriate register.

For e.g. If CNTL_BIT0 bit-field is set for an IP, then take some action. This may be implemented by providing a debug post call-back. This approach also simplifies code-reviews, as the functionality associated with a register access operation is localized, and this code can be kept separate from generated code.

static const int ADDR_CNTL = 0x104;
// Setup registers and associated bit-fields
// note: generated
void IP::register_setup() {
    // ...
}

// debug-write/post-cb (User written)
void IP::reg_cntl_cb(addr_t addr, value_t val) {
   if (m_reg[addr][CNTL_BIT0]){
       bar();
   }
}

// note: Register IP behavior
IP:IP() {
    register_setup();
    m_reg.attach_cb(ADDR_CNTL, &IP::reg_cntl_cb, 
    REG_OP_DBG_WRITE, REG_CB_POST);
}

Smart TLM Sockets

Accellera tlm_utils library provides some convenient sockets which simplify modeling TLM2.0 transactions, however they do not provide support for some commonly used features like Direct Memory Interface (DMI) management in LT modeling and tlm_mm (TLM Memory Manager) in case of AT transactions.

The TLM smart initiator socket provides built in support for tlm_mm and DMI manager that is transparent to the end-user. The tlm_mm may also be extended to support buffer, byte-enable and tlm_extensions memory management.

Similarly, TLM smart target provides a memory-mapped registration feature for resources that may be leveraged by resources like Registers and Internal Memory. It also handles gaps in memory maps based on configurable policies like ignoring them, raising an exception, etc.

Configuration

In a virtual platform you can quickly change any memory size, cache size, set policies and control debug levels using configuration. There’s a library to handle configuration aspects, and this tool reads in different file formats and then configures all of the IPs to be used in an SoC.

A configuration database provides a file-format (XML, JSON, lua etc.) agnostic way to store and retrieve configuration values, and this can be leveraged by SystemC/CCI for configuring the System.

It can support both Static (Config-file(s) based) and Dynamic (Tool based) Configuration updates. Using a Broker design pattern it can also help to limit visibility of certain parameters as desired by the IP/Integration engineer.

Reporting/Logging

SystemC provides the hooks, albeit basic, to support reporting with log-source capture, multiple log-levels, associating actions with logging etc. What is missing is a convenience class that can simplify log management at IP and integration level, which is provided by the CST Log module..

At the IP level we need capabilities to log not just (char*) strings, but also integers, registers, internal states, etc.

At the Integration level we need capabilities to filter out messages based on the log-source(s) in addition to log-levels. For non-interactive runs, and for debugging we may want to capture logs in files.

Tool configuration is also simplified if it has access to a centralized logging module.

Smart Timer

It is well known that introducing clocks, especially in LT simulation can drastically slow down the simulation. While developing the models for virtual platforms, generally the clock is abstracted away, and the timing functionality is implemented in a loosely timed fashion

Every SoC have one or more timer IP, so developing the LT model of these timers can be very tedious and error prone.

CSTML has a generic ‘Smart Timer’   that can be mapped to any of your (Timer) IP needs with either Loosely Timed or Clocked styles. This class is highly configurable, and provides support for most of the commonly required timer features: using up or down counting, supporting pre-scaling, controlled with enable or pause, and having a cycle or one-shot.

Model Generator

Given an IP specification, there is a fair amount of boilerplate code needed to implement registers, internal memory, interface-handing, and configurations. Manually transcribing the specification document to code can be time consuming and introduce bugs in the process.

Using machine-readable specifications like IPXACT, custom XMLs, and Excel sheets are becoming common. The Model Generator (python based) accepts file inputs (in different formats) to describe any IP block, and then it automatically creates the boilerplate code needed for:

  1. IP scaffolding including interfaces, registers, any internal memories, tlm-socket to register/memory binding, configuration params
    1. Doxygen comments provide contextual info drawn from the Inputs.
    2. User-code to be written is generated in separate sources, so that the IP code can be regenerated, if required, without loss of user customizations.
  2. Unit testbench (UT) with complementary interfaces, sanity test cases for testing memory map, registers, configurations.
  3. A Top module to instantiate and connect IP and UT.
  4. Configuration file(s) for IP/UT and Top.
  5. Build scripts (Cmake based) for building and testing IP.
  6. README.md to provide basic information on the IP, how to build, test.

You don’t have to start with a blank screen and hand-code all of the low-level details when you use the Model Generator approach. It even creates code that conforms to your own style guidelines for consistency.

Generic Router

Once we have a set of Master and Slave IPs, the next logical step is to connect them together based on the System memory-map. This is a common IP block required in a system, and CircuitSutra has made their generic router configurable to enforce your routing policy, it’s aware of DMI, and follows your security policies. All of the options are configurable with an external file.

The generic router provides a way to configure N-initiator and M-targets. The target memory map is configurable for each initiator. It also optionally provides a way to base-adjust the outgoing transaction address. Error handling of unmapped regions can also be configured.

Alternate routing policies like round-robin, fixed-routing and priority routing can also be implemented. The router can also be made DMI aware, handing not only the normal/debug transport APIs, but also the DMI forward transport interface with base-adjustment, and invalidate DMI backward interface. The handles both LT and AT style TLM requests. Logging the configured memory maps and time stamped transactions is very helpful during debugging.

Generic Memory

Many SoC devices are filled with over 50% area of memory IP blocks. It is good to have a generic memory model  that can range in size from a few MB up to multi-GB array. You configure each memory IP, define RW permissions, use logging and tracing for debug, and model single or multi-port instances.

Multiple configuration knobs are supported like the size of memory, read-write permissions and latency, byte-initialization at reset, and retention. It may also provide a feature to save/restore memory state to files. LT friendly memory implementations also provide support for DMI. Logging and tracing memory transactions are provided to help in debugging. More complex implementations may provide multiple ports with configurable arbitration policies

Python Integration

Test engineers do not have to be C++/SystemC experts to test the IP functionality. If the test scenarios are enumerated, they may be coded in any (scripting) language. A Python front-end for SystemC is quite popular due to its ease of interface with C/C++ code, and the general familiarity of engineers with the Python language. Writing tests in Python makes them more readable with fewer lines of code, and consequently fewer bugs. CSTML provides a generic testbench infrastructure that allows creating consistent self-checking unit test cases.

Summary

A well designed SystemC modeling methodology can be a big productivity boost  to create a Virtual Platform more quickly with less engineering effort and shorter debug than starting from scratch. The engineers at CircuitSutra have been honing their ESL design skills over the past decade using SystemC and their libraries across a wide range of domains:

  • Automotive
  • Storage
  • Application processors
  • IoT

They are working with leading EDA, semiconductor and systems companies.

View the archived tutorial from DVCon, starting at time point 21:40.

Related Blogs