DAC2025 SemiWiki 800x100

Cadence aquires Azuro

Cadence aquires Azuro
by Paul McLellan on 07-12-2011 at 12:20 pm

Cadence this morning announced that it has acquired Azuro. Azuro has become a leader in building the clock trees for high performance SoCs. A good rule of thumb is that the clock consumes 30% of the power in an SoC so optimizing it is really important. Terms were not disclosed.

The clock trees involve clock gating which can reduce clock tree power by 30% (and thus overall chip power by 10%). The can improve performance of the clock tree by reducing skew and thus overall clock frequency by up to 10%. And all while reducing the area of the clock tree by as much as 30%.

By reputation, Azuro’s technology is much better than the clock synthesis that comes for “free” in any of the major place and route systems. How easy it will continue to be to use, say, Synopsys for place and route while using Azuro’s ccopt (clock concurrent optimization technology) remains to be seen.


On-chip supercomputers, AMBA 4, Coore’s law

On-chip supercomputers, AMBA 4, Coore’s law
by Paul McLellan on 07-11-2011 at 12:45 pm

At DAC I talked with Mike Dimelow of ARM about the latest upcoming revision to the AMBA bus standards, AMBA 4. The standard gets an upgrade about every 5 years. The original ARM in 1992 ran at 10MIPS with a 20MHz clock. The first AMBA bus was a standard way to link the processor to memories (through the ARM system bus ASB) and to peripherals (through the ARM peripheral bus APB). Next year ARM-bsed chips will run at 2.5Ghz and deliver 7000 MIPS.

Eric’s story of Thomson-CSF’s attempt to build a processor of this type of performance in 1987 points out that in those days that would have qualified as a supercomputer.

The latest AMBA standard proposal actually steals a lot of ideas from the supercomputer world. One of the biggest problems with multi-core computing once you get a lot of cores is the fact that each core has its own cache and when the same memory line is cached in more than one place they need to be kept coherent. The simplest way to do this, which works fine for a small number of cores, is to keep the line in only one cache and invalidate it in all the others. Each cache monitors the address lines for any writes and invalidates its own copy, known as snooping. As the number of cores creeps up this become unwieldy and is a major performance hit as more and more memory accesses turn out to be to invalidated lines that therefore require an off-chip memory access (or perhaps another level cache, but much slower either way). The problem is further compounded by peripherals, such as graphics processors, that access memory too.

The more complex solution is to make sure that the caches are always coherent. When a cache line is written, if it is also in other caches then these are updated too, a procedure known as snarfing. The overall goal is to do everything possible to avoid needing to make an off-chip memory reference, which is extremely slow in comparison to a cache-hit and consumes a lot more power.

The news AMBA 4 supports this. It actually supports the whole continuum of possible architectures, from non-coherent caches (better make sure that no cores are writing to the same memory another is reading from) through the fully coherent snooped and snarfed caches described above.

I’ll ignore the elephant in the room of how you program these beasts when you have large number of cores. Remember, what I call Coore’s law: Moore’s law means the number of cores on a chip is doubling every couple of years, it’s just not obvious yet since we’re still on the flat part of the curve.

The other big hardware issue is power. On a modern SoC with heterogeneous cores and specialized bits of hardware then power can often be reduced by having a special mini-core. For example, it is much more power-efficient to use a little Tensilica core for MP3 playback than to use the main ARM processor, even though it is “just software” and so the ARM can perfectly well do it. Only one of the cores is used at a time: if no MP3 is playing the Tensilica core is powered down, if MP3 is playing then the ARM is (mostly) idle.

However, when you get to symmetrical multiprocessing, there is no point in powering down one core in order to use another: they are the same core so if you didn’t need the core then don’t put it on the chip. If you have 64 cores on a chip then the only point of doing it is because at times you want to run all 64 cores at once. And the big question, to which I’ve not seen entirely convincing answers, is whether you can afford power-wise to light up all those cores simultaneously. Or is there a power limitation to how many cores we can have (unless we lower their performance, which is almost the same thing as reducing the number of cores).

The AMBA 4 specification can be downloaded here.

Note: You must be logged in to read/write comments.


Design for test at RTL

Design for test at RTL
by Paul McLellan on 07-10-2011 at 3:09 pm

Design for test (DFT) imposes various restrictions on the design so that the test automation tools (automatic test pattern approaches such as scan, as well as built-in self-test approaches) will subsequently be able to generate the test program. For example, different test approaches impose constraints on clock generation and distribution, on the use of asynchronous signals such as resets, memory controls and so on. If all the rules are correctly addressed then the test program should be complete and efficient in terms of tester time.

The big challenge is that most of these restrictions and most of the analysis tools around work on the post-synthesis netlist. There are two big problems with this. The first problem is that if changes are required to the netlist it is often difficult to work out how to change the RTL to get the “desired” change to the netlist and the only fix is to simply update the netlist, and accept that the RTL, the canonical design representation, is not accurate. The second problem is that the changes come very late in the design cycle and, as with all such unanticipated changes, can disrupt schedules and perhaps even performance.

Paradoxically, most of these changes would have been simple to have made had the DFT rule-checking been performed on the RTL instead of the netlist. The changes can then be implemented at the correct level in the design, and earlier in the design cycle when they are planned and so not disruptive.

The challenge is that many of the DFT rules relate to specific architectures for scan chains but at the RTL level the scan chains have not yet been instantiated and will not be until much later in the design cycle. So to be useful, an RTL approach needs to first infer which chains will exist without the expensive process of instantiating them and hooking them all up. A second problem is that the various test modes alter how the clocks are generated and distributed (most obviously to the various scan chains etc). And a third issue is that test tools such as fault simulators required completely predictable circuit behavior. Bus contention or race conditions can create non-deterministic behavior which must be avoided. None of these three problems can be addressed by a simple topological analysis of the RTL.


Instead a look-ahead architecture is required that predicts how the suite of test tools will behave and can then check that all the rules will pass. This can be done using a very fast sythesis to produce a generic hierarchical netlist, but with enough fidellity to allow checks such as latch detection. The netlist can then be flattened for fast checking of topological rules like combinational loop detection. This approach allows for DFT rule-checking to be done even before the block or design runs through synthesis, including accurately estimating the test coverage and so avoiding a scramble late in the design cycle to improve an inadequate coverage.

The Atrenta SpyGlass-DFT white paper can be downloaded here.


Intel Twisting ARM?

Intel Twisting ARM?
by Daniel Nenni on 07-10-2011 at 11:00 am

Intel’s new Tri-Gate technology is causing quite a stir on the stock chat groups. Some have even said if Intel uses its Tri-Gate technology on only Intel processors ARM will be in deep deep trouble. These guys are “Intel Longs” of course and they are battling “Intel Shorts” with cut and paste news clips.

“ARM is in trouble & this is why. Future smartphones will require more & more capability/features/functions. That’s just the way it is. ARM is great at performance/power/other specs based on today’s capabilities. But, when the architecture gets stretched, all bets are off. We’re starting to see that today with certain benchmarks. Intel’s architecture will be far superior in the long run because they own the end-to-end (design to manufacturing), it will be scalable, it will be affordable, etc. The analysts are too dumb to understand this yet. They will in less than a year’s time though.” backbay_bston

I don’t own any of these stocks so I’m financially neutral but clearly I’m very suspicious of Intel’s Tri-Gate claims as I blogged in: TSMC Versus Intel: The Race to Semiconductors in 3D! That blog got me an invitation to the Intel RNB (Robert Noyce Building) to meet with one of their manufacturing guys and talk about Tri-Gate. I spend a lot of time in Asia and saw the horrors of 40nm statistical process variation (yield). More recently I have seen a near perfect implementation of 28nm HKMG, but I promise you I’m going into this meeting with an open mind and an Intel powered laptop.

In preparation for my technical deep dive on TriGate technology at RNB I need to come up with good questions so I will look smart. I could really use your help with this, here is what I have so far:

On the manufacturing side:
[LIST=1]

  • What is the difference between Tri-Gate and bulk CMOS HKMG?
  • Additional processing steps?
  • How many more masks/layers?
  • Special manufacturing equipment?

    On the design side:
    [LIST=1]

  • Spice Models: There are no “standards” for multi-gate Spice models — the Compact Model Council has not really made adoption of MG models a priority… What did Intel use for device models and circuit simulation? An approach internal to Intel? (Most of the modeling research published in technical journals to date has been for a single fin.)
  • Layout Dependent Effects: For several generations of planar technologies, the influence of Layout Dependent Effects has continued to increase — what are the LDE in a Tri-Gate technology? For example, for six device fins in parallel, do the fins on the outer edges behave differently than the middle fins? Or, is the volume of the fin so small that adjacent layout structures have little influence on the device current? (If LDE is less of an issue with Tri-Gates, that would be a major turning point in CAD tools and flows.) Restricted design rules?
  • Custom parasitic extraction with Tri-Gate is very challenging! There are unique device parasitics associated with Tri-Gates — the input gate resistance is more intricate due to the 3D topology over and between fins, and the parasitic gate-to-drain and gate-to-source capacitances are likewise more involved. What approach did Intel take toward parasitic extraction? (Were “standard” multiple-fin device combinations chosen to simplify the task of (custom) parasitic extraction?)
  • Why 6 and 2? Intel appears to have “standardized” on offering two design choices — six FinFET’s in parallel and two in parallel — what were the considerations that went into this choice? (also, see #3)
  • AMSdesign impact of Tri-Gate: Analog mixed-signal designs are constrained by the limited availability of diodes and resistors that are available in planar technology — what circuit methodology changes did the AMS design teams have to make? Did Intel ever consider offering a mixed TriGate and planar device offering on the same die.
  • MultiVt Device Options and Circuit Optimization: Tri-Gate does not offer the custom circuit designer as much freedom in design optimization, due to the quantization of the device width in increments of additional fins… what changes did Intel make to their circuit optimization methods? What device Vt and gate length options are available to designers for optimization?
  • Thermal Characteristics: What additional thermal heat transfer issues are present, due to the power dissipation in the small volume of the fin?
  • Tri-Gate vs. Dual-Gate FinFET’s: Was this comparison done? Why did Intel choose a “tri-gate” device, rather than a “dual-gate” device (with a thicker, non-contributing oxide on top of the fin? (Tri-Gate devices are reported to have worse leakage current behavior, at the top corners of the fin.)
  • Statistical Process Variation: How will it be addressed? What are the major contributors to statistical process variation with FinFET fabrication?
  • Fin Dimensions: The fin height, fin thickness, and spacing between fins are key manufacturing parameters toward achieving a high circuit density — what criteria did Intel use in optimizing the Tri-Gate device dimensions?

    Let me know what else interests you about Intel’s new Tri-Gate technology. Clearly the design side questions are for the people who believe Intel is a foundry.

    Tri-Gate technology certainly could be a game changer, especially for AMD. How is AMD going to compete on processor speed using 28nm Gate-First HKMG technology? Is this a factor in AMD’s inability to attract a top CEO candidate?



    For those of you who have not met me before here is a recent mug shot.
    Not only do I have a hot wife half my age but look at the size of my head. You can only imagine how smart I am. Plus I drive a Porsche. Cool AND smart, absolutely.


  • Low Power Webinar Series

    Low Power Webinar Series
    by Paul McLellan on 07-08-2011 at 4:57 pm

    At DAC 2011 in San Diego, Apache gave many product presentations. Of course not everyone could make DAC or could make all the presentations in which they were interested. So from mid-July until mid-August these presentations will be given as webinars. Details, and links for registration, are here on the Apache website.

    The seminars are as below. All webinars are 11am to 12pm PDT.

    • Ultra-low-power methodology, July 19[SUP]th[/SUP]
    • IP integration methodology, July 21[SUP]st[/SUP]
    • PowerArtist: RTL power analysis, reduction and debug, July 26[SUP]th[/SUP]
    • RedHawk: SoC power integrity and sign-off for 28nm design, July 28[SUP]th[/SUP]
    • Totem: analog/mixd signal power noise and reliability, August 2[SUP]nd[/SUP]
    • PathFinder: full-chip ESD integrity and macro-level dynamic ESD, August 4[SUP]th[/SUP]
    • Chip-Package-System (CPS) convergence solution, August 9[SUP]th[/SUP]
    • Sentinel: PSI IC-package power and signal integrity solution, August 11[SUP]th[/SUP]

    Once Upon A Time… ASIC designers developed IC for Supercomputer in the 80’s

    Once Upon A Time… ASIC designers developed IC for Supercomputer in the 80’s
    by Eric Esteve on 07-07-2011 at 10:41 am

    During last week-end, I had the good surprise to meet with one of my oldest friend, Eric, who remind me the old time, when we were working together as ASIC designers for… a Supercomputer project.

    In France, in a French company (Thomson CSF) active in the military segment and being able to spend which was at that time a fortune ($25M) to develop a supercomputer from scratch, and when I say from scratch, that mean that we had to invent almost everything, except the ASIC design methodology and the EDA tools, both being provided by VLSI Technology Inc. To be honest, we have been very lucky that a French solution (like Matra Harris Semiconductor or Thomson Composant Speciaux) had not be chosen, which could have happened for obscure political reasons. We had in our hands which were considered as the Rolls Royce for ASIC designers in 1987: all the design team was equipped with SUN workstation, and the design tool set from VLSI was really user friendly… except it was missing a synthesis tool, but none of us knew Synopsys, this obscure start-up, so we were pretty happy to start. Just for your information, I will describe what was the type of work done by a two engineer team during a 18 month period.

    Just a word about the project itself. The supercomputer chief architect was a talented University Professor, talented but this was his first contact with the industrial world. He had defined the machine architecture, based on three main areas: the CPU boards (based on off-the-shelf CPU chips, Weitek Abacus), the FIFO based interconnects network and the memory area, as well as six different ASIC devices. It was a “superscalar” architecture. The task I was assigned with Eric was to design all the function which will be reused by the different ASIC designs: the FIFO, the Test functions and the Clock distribution inside the chips.

    The first one was the easier, as we had to define the specification of a FIFO compiler, the compiler itself being a full custom design, at transistor level, would be subcontracted by VLSI Technology. We just took a pen and a paper and defined the memory point, transistor by transistor, and the FIFO behavior… in written. No simulations (SPICE was not part of our EDA package), just discussion with Michel Gigluielmetti, our interface at VLSI. VLSI was in charge of the compiler design and model generation, as we had to be able starting to design and integrate FIFO far away before seeing any working Silicon. When I look back, it was pretty risky, isn’t it?

    The Test Strategy was based on the newly introduced JTAG IEEE 1149.1 for « Standard Test Access Port and Boundary-Scan Architecture », this part was not that difficult, as everything was defined in the standard.

    Then, we looked at the Clock distribution. Remember, it was in 1987, there was no Clock Distribution Macro” that we could use. The Clock was a magic signal, running at 20 MHz (such a high speed!), that designers were using to run their simulations, a perfect signal with no skew… How to manage the clock distribution in a chip counting 2 or 3,000 Flip Flops? Starting inside the chip, we then thought about the inter chip communications, and drop a couple of equations… and discovered that we had a real issue here! How will the entire system works, with chips communicating from board to board which could be located at one meter from each other? What about the fly time within interconnect? And so on… The most surprising is that we (two beginners) discover this issue, and that none of the seasoned engineers working at a higher management level did not even thought about it!

    So Eric and I send a note, copying all the management, to raise the issue. Then started one of the most amazing, creative time, after the project leader decided to assign us to work full time on the clock distribution within the machine, inside and outside the ASIC. At first, define the basic equations:

    When you send a data from a “slow” device, you have to comply with:

    Temission_slow + Tinterconect + Tset-up + Skew < Clock_cycle

    But, when the emitter is “fast”, the equation becomes:

    Temission_fast + Tinterconnect > Thold + Skew

    This is at that step that we have discovered that two identical ASIC devices could exhibit variations from 1 to 3, when taking into account the voltage, process and temperature induced variations! Our managers guess what it was 10 or 20%… So we defined the clock distribution in the machine, selecting external buffers as fast as possible (for the minimum transit time, the specification was… 0ns), trying to minimize the impact of the buffering. But when doing this, we realize that it could not work for any case, even if we increase the clock period (and decrease the frequency, which was not really what you want to do when you design a supercomputer…). With the help of VLSI technology, we defined a kind of Digital Lock Loop (DLL), in order for the ASIC to self calibrate (a fast device would delay the time at which the data was emitted, to guarantee the hold time). We also defined different phase for the clock period, so we could artificially enlarge the clock cycle, to receive the data with no set-up problem. In other words, we had to invent the wheel, even if I am sure that the designers working at Cray Research did it before us! When I see the size of a team working today on a single device (OMAP5 or equivalent), I think we were very lucky to discover the ASIC design in such a way.

    Because there is a moral in every story, I must say that the project was suddenly ended by Thomson CSF top management, when it appeared that the machine would never work, at least at 20 MHz, the official reason being the difficulties from Weitek to ship the CPU. Then, the Engineering manager of the software team moved to another Thomson’ subsidiary, in charge of developing tools for the stock market. Eric moved to Australia for one year, to learned surf. By the way, he is still living there! As far as I am concerned, I stayed in ASIC design, doing chips for Aircraft motors or Analogue simulation for the TGV and finally was ASIC FAE for TI, where my largest customer was Advanced Computer Research Institute (ACRI) designing… a supercomputer! But that is another story…
    By Eric Esteve


    TSMC Financial Status Plus OIP Update!

    TSMC Financial Status Plus OIP Update!
    by Daniel Nenni on 07-05-2011 at 8:00 am

    Interesting notes from my most recent Taiwan trip: Taiwan unemployment is at a record low. Scooters once again fill the streets of Hsinchu! TSMC will be passing out record bonuses to a record amount of people. TSMC Fab expansions are ahead of schedule. The new Fab 15 in Taichung went up amazingly fast with equipment moving in later this year. When was the last time you saw a fab built ahead of schedule and under budget? Simply amazing! Taiwan is also ready to overtake Japan as the world’s largest semiconductor materials market. The Taiwan market grew from $6.9 billion in 2009 to estimated $9.1 billion in 2010, showing 36%+ growth. Go Taiwan!

    The Motley Fool did a nice TSMC financial article with pretty pictures. I like pretty pictures. The bottom line is that not only is TSMC the largest semiconductor foundry, TSMC is also the most profitable. The important point here is margins. Margins translates into pricing flexibility as supply outpaces demand, which is coming, believe it! Semiconductor manufacturing capacity utilization today is running at 90%+ in most segments. With all the new fab space coming online from TSMC, Samsung, Intel, and GlobalFoundries in 2012 it may be a different story. Either way TSMC wins.

    Unfortunately Motley Fool does not know semiconductors as they listed NVIDIA and LDK Solar as industry peers/competitors! DOH! One of the most amusing things I do for money is consult with Wall Street types and explain exactly what the semiconductor market is and who the real players are. I also slip in some EDA and Semi IP information whenever possible. Even with the recent acquisitions, Wall Street simply does not care about EDA, but I digress.

    The one semi-relevant example Motley Fool uses is number four foundry, SMIC. TSMC Gross Margins are 49.6% versus SMIC at 20.8%. UMC, the number two foundry, is at 27.5%. GlobalFoundries financials are private but I will see what I can find out. Intel and Samsung will never tell foundry capacity or margin numbers so I shouldn’t even be mentioning them in the same paragraph as the real foundries.

    Coming this fall from TSMC is the new and improved Open Innovation Platform Ecosystem Forum. TSMC is preparing a massive design ecosystem event on Tuesday, October 18th at the San Jose Convention Center. A call for papers already went out, 18 papers will be presented to an open forum of industry executives from TSMC, ecosystem partners, and customers. This is a DO NOT MISS event! There will be focused breakout sessions on all manner of design issues AND a pavilion with around 80 TSMC Design Ecosystem partners showing their wares. Plus, I will be there (free food), such a deal. The food is always good at TSMC events!

    The Open Innovation Platform® is the substantiation of TSMC’s Open Innovation model that brings together the thinking of customers and partners under the common goal of shortening design time, minimizing time-to-volume and speeding time-to-market, and ultimately time-to-money.

    No doubt this event will be sold out. Follow SemiWiki.com for TSMC OIP updates coming soon.

    Note: You must be logged in to read/write comments.


    Two More Transistor-Level Companies at DAC

    Two More Transistor-Level Companies at DAC
    by Daniel Payne on 07-02-2011 at 8:38 pm

    In my rush on Wednesday at DAC I had almost over-looked the last two companies I talked with: Invarian and AnaGlobe. These last two I had hand-written notes on paper, so I just got to the bottom of my inbox tonight to write up the final trip reports.

    Invarian
    Jens Andersen and Vladimir Schellbach gave me an overview of tools that perform temperature, package and analog layout analysis:

    • Models actual component temperature
    • Identifies electromigration
    • Finds hotspots
    • Solves full 3D heat transfer equation
    • Accounts for block layout impact
    • Accounts for power dissipation

    The Invarian tool named InVar works with a SPICE simulator like Berkeley’s Analog Fast SPICE tool. They analyze both analog and digital design flows. The only other competitor in this space would be Gradient DA.

    Summary
    Watch this startup, even with under a dozen people in Moscow and Silicon Valley they have an interesting focus on temperature variation that the big four in EDA haven’t starting serving yet. Their IR drop and EM analysis have plenty of competitors.

    AnaGlobe
    How would you load an IC layout that was 180GB in size? At AnaGlobe they use the Thunder chip assembly tool and get the design loaded in under two hours. Yan Lin gave me a quick overview of their tools.

    GOLF is a new PCell design environment based on OA.

    PLAG is another OA tool for flat panel layout.

    Summary
    AnaGlobe is certainly a technology leader for large IC database assembly. Their GOLF tool competes with Virtuoso, Ciranova and SpringSoft. PLAG looks to have little competition. Big name design companies use AnaGlobe tools: Nvidia, Marvell, SMIC, AMCC.


    Apache Design Automation acquired by Ansys

    Apache Design Automation acquired by Ansys
    by Daniel Payne on 06-30-2011 at 2:52 pm

    We all knew that Apache had filed for an IPO earlier and were just waiting for the timing and price to be revealed. Rumors have been circulating about an acquisition and today we know that the rumors were true asAnsys paid $310 million in cash for Apache.

    Ansys stock has surged some 35% over the past twelve months:

    Products
    This acquisition looks totally complimentary in terms of products. Ansys also purchased Ansoft back in 2008, so they now have a good mix of software tools across multiple disciplines:

    • Low-Power IC Design (Apache)
    • Electromagnetics
    • Explicit Dynamics
    • Fluid Dynamics
    • Multiphysics
    • Structural Mechanics

    It will be interesting to see if Ansys creates a division just for Apache tools or merges it into the Electromagnetics division.


    Cadence to launch PCIe gen-3 (8 GT/s) IP and VIP: fruit of Denali acquisition

    Cadence to launch PCIe gen-3 (8 GT/s) IP and VIP: fruit of Denali acquisition
    by Eric Esteve on 06-28-2011 at 10:59 am

    The recent announcement from Cadence, officially launching the PCI Express 3.0 Controller IP, as well as the associated Verification IP (VIP), made of Compliance Management System (CMS) which provides interactive, graphical analysis of coverage results, and PureSuite which provides the PCIe associated test cases, clearly demonstrate that the acquisition of Denali is bringing another fruit – after DDR4 Controller IP. Maybe some history will help. Back in 2006, Denali was known for their VIP products for Interface functions like PCIe, USB or SATA, when they first launch a PCI Express (gen-1 at that time) Controller IP. It was quite surprising, especially for their former partners, suddenly becoming their competitors! Nevertheless, they found a place on the market, positioning on the high end (and expensive) side, supporting Root Port or End Point and soon Single Root I/O Virtualization (SR-IOV), a solution targeting the PC Server market when Synopsys and PLDA where positioned on the mainstream PCIe IP market. Then PCIe 2.0 specification was issued, in 2007, and Denali was still in the race.
    With the launch of this PCIe 3.0 solution, still in the emerging phase and probably reserved, for the moment, for high end, advanced applications, like storage, supercomputing, enterprise and networking, Cadence/Denali is following the same strategy: high end, high margin IP by opposite to mainstream solution, well covered by the competition. The customer mentioned by Cadence in their Press Release, PMC-Sierra, and the 6Gb/s SAS Tachyon protocol controller ASSP integrating this IP is clearly in this market segment.

    Features

    The PCIe core includes these features:

    Single-Root I/O Virtualization
    The PCIe core provides a Gen 3 16-lane architecture in full support of the latest Address Translation Service (ATS) specification, Single-Root I/O Virtualization (SR-IOV) specification, including Internal Error Reporting, ID Based Ordering, TLP Processing Hints (TPH), Optimized Buffer Flush/Fill (OBFF), Atomic Operations, Re-Sizable BAR, Extended TAG Enable, Dynamic Power Allocation (DPA, and Latency Tolerance Reporting (LTR). SR-IOV is an optional capability that can be used with PCIe 1.1, 2.0, and 3.0 configurations.
    Dual-mode operation
    Each instance of the core can be configured as an Endpoint (EP) or Root Complex (RC).
    Power management
    The core supports PCIe link power states L0, L0s and L1 with only the main power. With auxiliary power, it can support L2 and L3 states.
    Interrupt support
    The core supports all the three options for implementing interrupts in a PCIe device: Legacy, MSI and MSIx modes. In the Legacy mode, it communicates the assertion and de-assertion of interrupt conditions on the link using Assert and De-assert messages. In the MSI mode, the core signals interrupts by sending MSI messages upon the occurrence of interrupt conditions. In this mode, the core supports up to 32 interrupt vectors per function, with per-vector masking. Finally, in the MSI-X mode, the controller supports up to 2048 distinct interrupt vectors per function with per-vector masking.
    Credit Management
    The core performs all the link-layer credit management functions defined in the PCIe specifications. All credit parameters are configurable.
    Configurable Flow-Control Updates
    The core allows flow control updates from its receive side to be scheduled in a flexible manner, thus enabling the user to make tradeoffs between credit update frequency and its bandwidth overhead. Configurable registers control the scheduling of flow-control update DLLPs.
    Replay Buffer
    The Controller IP incorporates fully configurable link-layer reply buffers for each link designed for low latency and area. The core can maintain replay state for a configurable number of outstanding packets.
    Host Interface
    The datapath on the host interface is configurable to be 32, 64, 128 or 256-bits. It may be AXI or Host Application Layer (HAL) interface.

    If we take a look more in depth into this PCIe gen-3 Controller, we see that Cadence has based the architecture on a 128 bit data path. This means, for 8 lane PCIe running at 8 GT/s, that the core is cadenced at 500 MHz (simply calculate 8 * 8000 / 128) which probably require to use technology nodes below 65 nm, as the core is –I guess- in the 500 K gates range, technology selection which is consistent with thesupercomputing and networking markets, and rather high end storage. Another remark we can do is that Cadence, as far as we can see, do not provide the PHY Interface for PCI Express function (PIPE). This version of the PIPE can run at 500 MHz for a 16 bit (or 250 MHz for a 32 bit and Cadence decided for the 16-b (per lane) Controller interface, allowing to keep in the same clock domain the PIPE and the Controller.
    Apparently Cadence let the PHY IP supplier taking care of the PIPE 3.0 design, which makes sense as it may be necessary to position carefully the PIPE in respect with the PHY, in term of chip topology.

    Offering the PCIe 3.0 VIP is a must, as Cadence is clearly and strongly positioned on Verification. If you consider the “size” of the PCIe 3.0 specification, counting more than 1000 pages, and offering many new features when compared with PCIe 2.0 and the latest engineering change notices (ECNs) such as ID-based Ordering, Re-Sizeable BARs, Atomic Operations, Transaction Processing Hints, Optimized Buffer Flush/Fill, Latency Tolerance Reporting and Dynamic Power Allocation, running Verification campaign on such an emerging product is a “must do”. Denali has always been very active in PCIe VIP, so Cadence has probably benefited from Denali’ long experience (seven years or so) for this protocol.

    Eric Esteve