SILVACO 073125 Webinar 800x100

First low-power webinar: Ultra-low-power Methodology

First low-power webinar: Ultra-low-power Methodology
by Paul McLellan on 07-13-2011 at 12:10 pm

The first of the low power webinars is coming up on July 19th at 11am Pacific time. The webinar will be conducted by Preeti Gupta, Sr. Technical Marketing Manager at Apache Design Solutions. Preeti has 10 years of experience in the exciting world of CMOS power. She has a Masters in Electrical Engineering from Indian Institute of technology, New Delhi, India.

Meeting the power budget and reducing operational and/or stand-by power requires a methodology that establishes power as a design target during the micro-architecture and RTL design process, not something that can be left until the end of the design cycle. Apache’s analysis-driven reduction techniques allow designers to explore different power saving modes. Once RTL optimization is completed and a synthesized netlist is available, designers can run layout-based power integrity to qualify the success of RTL stage optimizations, ensuring that the voltage drop in the chip is contained. Apache’s Ultra-Low-Power Methodology enables successful design and delivery of low-power chips by offering a comprehensive flow that spans the entire design process.

More details on the webinars here.

Register to attend here (and don’t forget to select semiwiki.com in the “How did you hear about it?” box).


And it’s Intel at 22nm but wait, Samsung slips ahead by 2nm…

And it’s Intel at 22nm but wait, Samsung slips ahead by 2nm…
by Paul McLellan on 07-12-2011 at 12:46 pm

Another announcement of interest, given all the discussion of Intel’s 22nm process around here, is that Samsung (along with ARM, Cadence and Synopsys) announced that they have taped out a 20nm ARM test-chip (using a Synopsys/Cadence flow).

An interesting wrinkle is that at 32nm and 28nm they used a gate-first process but that for 20nm they have switched to gate-last. Of course taping out a chip is different from having manufactured one and got it to yield well. There have been numerous problems with many of the novel process steps in technology nodes below 30nm.

The chip contains an ARM Cortex-M0 along with custom memories and, obviously, various test structures.

It is interesting to look at Intel vs Samsung’s semiconductor revenues (thanks Nitin!). In 2010 Intel was at $40B and Samsung was at $28B. But Samsung grew at 60% versus “only” 25% for Intel. Another couple of years of that an Samsung will take Intel’s crown as #1 semiconductor manufacturer.

As I’ve said before, Intel needs to get products in the fast-growing mobile markets, and I’m still not convinced that Atom’s advantages (Windows compatibility) really matter. Of course Intel’s process may be enough to make it competitive but that depends on whether Intel’s wafers are cheap enough.


Cadence aquires Azuro

Cadence aquires Azuro
by Paul McLellan on 07-12-2011 at 12:20 pm

Cadence this morning announced that it has acquired Azuro. Azuro has become a leader in building the clock trees for high performance SoCs. A good rule of thumb is that the clock consumes 30% of the power in an SoC so optimizing it is really important. Terms were not disclosed.

The clock trees involve clock gating which can reduce clock tree power by 30% (and thus overall chip power by 10%). The can improve performance of the clock tree by reducing skew and thus overall clock frequency by up to 10%. And all while reducing the area of the clock tree by as much as 30%.

By reputation, Azuro’s technology is much better than the clock synthesis that comes for “free” in any of the major place and route systems. How easy it will continue to be to use, say, Synopsys for place and route while using Azuro’s ccopt (clock concurrent optimization technology) remains to be seen.


On-chip supercomputers, AMBA 4, Coore’s law

On-chip supercomputers, AMBA 4, Coore’s law
by Paul McLellan on 07-11-2011 at 12:45 pm

At DAC I talked with Mike Dimelow of ARM about the latest upcoming revision to the AMBA bus standards, AMBA 4. The standard gets an upgrade about every 5 years. The original ARM in 1992 ran at 10MIPS with a 20MHz clock. The first AMBA bus was a standard way to link the processor to memories (through the ARM system bus ASB) and to peripherals (through the ARM peripheral bus APB). Next year ARM-bsed chips will run at 2.5Ghz and deliver 7000 MIPS.

Eric’s story of Thomson-CSF’s attempt to build a processor of this type of performance in 1987 points out that in those days that would have qualified as a supercomputer.

The latest AMBA standard proposal actually steals a lot of ideas from the supercomputer world. One of the biggest problems with multi-core computing once you get a lot of cores is the fact that each core has its own cache and when the same memory line is cached in more than one place they need to be kept coherent. The simplest way to do this, which works fine for a small number of cores, is to keep the line in only one cache and invalidate it in all the others. Each cache monitors the address lines for any writes and invalidates its own copy, known as snooping. As the number of cores creeps up this become unwieldy and is a major performance hit as more and more memory accesses turn out to be to invalidated lines that therefore require an off-chip memory access (or perhaps another level cache, but much slower either way). The problem is further compounded by peripherals, such as graphics processors, that access memory too.

The more complex solution is to make sure that the caches are always coherent. When a cache line is written, if it is also in other caches then these are updated too, a procedure known as snarfing. The overall goal is to do everything possible to avoid needing to make an off-chip memory reference, which is extremely slow in comparison to a cache-hit and consumes a lot more power.

The news AMBA 4 supports this. It actually supports the whole continuum of possible architectures, from non-coherent caches (better make sure that no cores are writing to the same memory another is reading from) through the fully coherent snooped and snarfed caches described above.

I’ll ignore the elephant in the room of how you program these beasts when you have large number of cores. Remember, what I call Coore’s law: Moore’s law means the number of cores on a chip is doubling every couple of years, it’s just not obvious yet since we’re still on the flat part of the curve.

The other big hardware issue is power. On a modern SoC with heterogeneous cores and specialized bits of hardware then power can often be reduced by having a special mini-core. For example, it is much more power-efficient to use a little Tensilica core for MP3 playback than to use the main ARM processor, even though it is “just software” and so the ARM can perfectly well do it. Only one of the cores is used at a time: if no MP3 is playing the Tensilica core is powered down, if MP3 is playing then the ARM is (mostly) idle.

However, when you get to symmetrical multiprocessing, there is no point in powering down one core in order to use another: they are the same core so if you didn’t need the core then don’t put it on the chip. If you have 64 cores on a chip then the only point of doing it is because at times you want to run all 64 cores at once. And the big question, to which I’ve not seen entirely convincing answers, is whether you can afford power-wise to light up all those cores simultaneously. Or is there a power limitation to how many cores we can have (unless we lower their performance, which is almost the same thing as reducing the number of cores).

The AMBA 4 specification can be downloaded here.

Note: You must be logged in to read/write comments.


Design for test at RTL

Design for test at RTL
by Paul McLellan on 07-10-2011 at 3:09 pm

Design for test (DFT) imposes various restrictions on the design so that the test automation tools (automatic test pattern approaches such as scan, as well as built-in self-test approaches) will subsequently be able to generate the test program. For example, different test approaches impose constraints on clock generation and distribution, on the use of asynchronous signals such as resets, memory controls and so on. If all the rules are correctly addressed then the test program should be complete and efficient in terms of tester time.

The big challenge is that most of these restrictions and most of the analysis tools around work on the post-synthesis netlist. There are two big problems with this. The first problem is that if changes are required to the netlist it is often difficult to work out how to change the RTL to get the “desired” change to the netlist and the only fix is to simply update the netlist, and accept that the RTL, the canonical design representation, is not accurate. The second problem is that the changes come very late in the design cycle and, as with all such unanticipated changes, can disrupt schedules and perhaps even performance.

Paradoxically, most of these changes would have been simple to have made had the DFT rule-checking been performed on the RTL instead of the netlist. The changes can then be implemented at the correct level in the design, and earlier in the design cycle when they are planned and so not disruptive.

The challenge is that many of the DFT rules relate to specific architectures for scan chains but at the RTL level the scan chains have not yet been instantiated and will not be until much later in the design cycle. So to be useful, an RTL approach needs to first infer which chains will exist without the expensive process of instantiating them and hooking them all up. A second problem is that the various test modes alter how the clocks are generated and distributed (most obviously to the various scan chains etc). And a third issue is that test tools such as fault simulators required completely predictable circuit behavior. Bus contention or race conditions can create non-deterministic behavior which must be avoided. None of these three problems can be addressed by a simple topological analysis of the RTL.


Instead a look-ahead architecture is required that predicts how the suite of test tools will behave and can then check that all the rules will pass. This can be done using a very fast sythesis to produce a generic hierarchical netlist, but with enough fidellity to allow checks such as latch detection. The netlist can then be flattened for fast checking of topological rules like combinational loop detection. This approach allows for DFT rule-checking to be done even before the block or design runs through synthesis, including accurately estimating the test coverage and so avoiding a scramble late in the design cycle to improve an inadequate coverage.

The Atrenta SpyGlass-DFT white paper can be downloaded here.


Intel Twisting ARM?

Intel Twisting ARM?
by Daniel Nenni on 07-10-2011 at 11:00 am

Intel’s new Tri-Gate technology is causing quite a stir on the stock chat groups. Some have even said if Intel uses its Tri-Gate technology on only Intel processors ARM will be in deep deep trouble. These guys are “Intel Longs” of course and they are battling “Intel Shorts” with cut and paste news clips.

“ARM is in trouble & this is why. Future smartphones will require more & more capability/features/functions. That’s just the way it is. ARM is great at performance/power/other specs based on today’s capabilities. But, when the architecture gets stretched, all bets are off. We’re starting to see that today with certain benchmarks. Intel’s architecture will be far superior in the long run because they own the end-to-end (design to manufacturing), it will be scalable, it will be affordable, etc. The analysts are too dumb to understand this yet. They will in less than a year’s time though.” backbay_bston

I don’t own any of these stocks so I’m financially neutral but clearly I’m very suspicious of Intel’s Tri-Gate claims as I blogged in: TSMC Versus Intel: The Race to Semiconductors in 3D! That blog got me an invitation to the Intel RNB (Robert Noyce Building) to meet with one of their manufacturing guys and talk about Tri-Gate. I spend a lot of time in Asia and saw the horrors of 40nm statistical process variation (yield). More recently I have seen a near perfect implementation of 28nm HKMG, but I promise you I’m going into this meeting with an open mind and an Intel powered laptop.

In preparation for my technical deep dive on TriGate technology at RNB I need to come up with good questions so I will look smart. I could really use your help with this, here is what I have so far:

On the manufacturing side:
[LIST=1]

  • What is the difference between Tri-Gate and bulk CMOS HKMG?
  • Additional processing steps?
  • How many more masks/layers?
  • Special manufacturing equipment?

    On the design side:
    [LIST=1]

  • Spice Models: There are no “standards” for multi-gate Spice models — the Compact Model Council has not really made adoption of MG models a priority… What did Intel use for device models and circuit simulation? An approach internal to Intel? (Most of the modeling research published in technical journals to date has been for a single fin.)
  • Layout Dependent Effects: For several generations of planar technologies, the influence of Layout Dependent Effects has continued to increase — what are the LDE in a Tri-Gate technology? For example, for six device fins in parallel, do the fins on the outer edges behave differently than the middle fins? Or, is the volume of the fin so small that adjacent layout structures have little influence on the device current? (If LDE is less of an issue with Tri-Gates, that would be a major turning point in CAD tools and flows.) Restricted design rules?
  • Custom parasitic extraction with Tri-Gate is very challenging! There are unique device parasitics associated with Tri-Gates — the input gate resistance is more intricate due to the 3D topology over and between fins, and the parasitic gate-to-drain and gate-to-source capacitances are likewise more involved. What approach did Intel take toward parasitic extraction? (Were “standard” multiple-fin device combinations chosen to simplify the task of (custom) parasitic extraction?)
  • Why 6 and 2? Intel appears to have “standardized” on offering two design choices — six FinFET’s in parallel and two in parallel — what were the considerations that went into this choice? (also, see #3)
  • AMSdesign impact of Tri-Gate: Analog mixed-signal designs are constrained by the limited availability of diodes and resistors that are available in planar technology — what circuit methodology changes did the AMS design teams have to make? Did Intel ever consider offering a mixed TriGate and planar device offering on the same die.
  • MultiVt Device Options and Circuit Optimization: Tri-Gate does not offer the custom circuit designer as much freedom in design optimization, due to the quantization of the device width in increments of additional fins… what changes did Intel make to their circuit optimization methods? What device Vt and gate length options are available to designers for optimization?
  • Thermal Characteristics: What additional thermal heat transfer issues are present, due to the power dissipation in the small volume of the fin?
  • Tri-Gate vs. Dual-Gate FinFET’s: Was this comparison done? Why did Intel choose a “tri-gate” device, rather than a “dual-gate” device (with a thicker, non-contributing oxide on top of the fin? (Tri-Gate devices are reported to have worse leakage current behavior, at the top corners of the fin.)
  • Statistical Process Variation: How will it be addressed? What are the major contributors to statistical process variation with FinFET fabrication?
  • Fin Dimensions: The fin height, fin thickness, and spacing between fins are key manufacturing parameters toward achieving a high circuit density — what criteria did Intel use in optimizing the Tri-Gate device dimensions?

    Let me know what else interests you about Intel’s new Tri-Gate technology. Clearly the design side questions are for the people who believe Intel is a foundry.

    Tri-Gate technology certainly could be a game changer, especially for AMD. How is AMD going to compete on processor speed using 28nm Gate-First HKMG technology? Is this a factor in AMD’s inability to attract a top CEO candidate?



    For those of you who have not met me before here is a recent mug shot.
    Not only do I have a hot wife half my age but look at the size of my head. You can only imagine how smart I am. Plus I drive a Porsche. Cool AND smart, absolutely.


  • Low Power Webinar Series

    Low Power Webinar Series
    by Paul McLellan on 07-08-2011 at 4:57 pm

    At DAC 2011 in San Diego, Apache gave many product presentations. Of course not everyone could make DAC or could make all the presentations in which they were interested. So from mid-July until mid-August these presentations will be given as webinars. Details, and links for registration, are here on the Apache website.

    The seminars are as below. All webinars are 11am to 12pm PDT.

    • Ultra-low-power methodology, July 19[SUP]th[/SUP]
    • IP integration methodology, July 21[SUP]st[/SUP]
    • PowerArtist: RTL power analysis, reduction and debug, July 26[SUP]th[/SUP]
    • RedHawk: SoC power integrity and sign-off for 28nm design, July 28[SUP]th[/SUP]
    • Totem: analog/mixd signal power noise and reliability, August 2[SUP]nd[/SUP]
    • PathFinder: full-chip ESD integrity and macro-level dynamic ESD, August 4[SUP]th[/SUP]
    • Chip-Package-System (CPS) convergence solution, August 9[SUP]th[/SUP]
    • Sentinel: PSI IC-package power and signal integrity solution, August 11[SUP]th[/SUP]

    Once Upon A Time… ASIC designers developed IC for Supercomputer in the 80’s

    Once Upon A Time… ASIC designers developed IC for Supercomputer in the 80’s
    by Eric Esteve on 07-07-2011 at 10:41 am

    During last week-end, I had the good surprise to meet with one of my oldest friend, Eric, who remind me the old time, when we were working together as ASIC designers for… a Supercomputer project.

    In France, in a French company (Thomson CSF) active in the military segment and being able to spend which was at that time a fortune ($25M) to develop a supercomputer from scratch, and when I say from scratch, that mean that we had to invent almost everything, except the ASIC design methodology and the EDA tools, both being provided by VLSI Technology Inc. To be honest, we have been very lucky that a French solution (like Matra Harris Semiconductor or Thomson Composant Speciaux) had not be chosen, which could have happened for obscure political reasons. We had in our hands which were considered as the Rolls Royce for ASIC designers in 1987: all the design team was equipped with SUN workstation, and the design tool set from VLSI was really user friendly… except it was missing a synthesis tool, but none of us knew Synopsys, this obscure start-up, so we were pretty happy to start. Just for your information, I will describe what was the type of work done by a two engineer team during a 18 month period.

    Just a word about the project itself. The supercomputer chief architect was a talented University Professor, talented but this was his first contact with the industrial world. He had defined the machine architecture, based on three main areas: the CPU boards (based on off-the-shelf CPU chips, Weitek Abacus), the FIFO based interconnects network and the memory area, as well as six different ASIC devices. It was a “superscalar” architecture. The task I was assigned with Eric was to design all the function which will be reused by the different ASIC designs: the FIFO, the Test functions and the Clock distribution inside the chips.

    The first one was the easier, as we had to define the specification of a FIFO compiler, the compiler itself being a full custom design, at transistor level, would be subcontracted by VLSI Technology. We just took a pen and a paper and defined the memory point, transistor by transistor, and the FIFO behavior… in written. No simulations (SPICE was not part of our EDA package), just discussion with Michel Gigluielmetti, our interface at VLSI. VLSI was in charge of the compiler design and model generation, as we had to be able starting to design and integrate FIFO far away before seeing any working Silicon. When I look back, it was pretty risky, isn’t it?

    The Test Strategy was based on the newly introduced JTAG IEEE 1149.1 for « Standard Test Access Port and Boundary-Scan Architecture », this part was not that difficult, as everything was defined in the standard.

    Then, we looked at the Clock distribution. Remember, it was in 1987, there was no Clock Distribution Macro” that we could use. The Clock was a magic signal, running at 20 MHz (such a high speed!), that designers were using to run their simulations, a perfect signal with no skew… How to manage the clock distribution in a chip counting 2 or 3,000 Flip Flops? Starting inside the chip, we then thought about the inter chip communications, and drop a couple of equations… and discovered that we had a real issue here! How will the entire system works, with chips communicating from board to board which could be located at one meter from each other? What about the fly time within interconnect? And so on… The most surprising is that we (two beginners) discover this issue, and that none of the seasoned engineers working at a higher management level did not even thought about it!

    So Eric and I send a note, copying all the management, to raise the issue. Then started one of the most amazing, creative time, after the project leader decided to assign us to work full time on the clock distribution within the machine, inside and outside the ASIC. At first, define the basic equations:

    When you send a data from a “slow” device, you have to comply with:

    Temission_slow + Tinterconect + Tset-up + Skew < Clock_cycle

    But, when the emitter is “fast”, the equation becomes:

    Temission_fast + Tinterconnect > Thold + Skew

    This is at that step that we have discovered that two identical ASIC devices could exhibit variations from 1 to 3, when taking into account the voltage, process and temperature induced variations! Our managers guess what it was 10 or 20%… So we defined the clock distribution in the machine, selecting external buffers as fast as possible (for the minimum transit time, the specification was… 0ns), trying to minimize the impact of the buffering. But when doing this, we realize that it could not work for any case, even if we increase the clock period (and decrease the frequency, which was not really what you want to do when you design a supercomputer…). With the help of VLSI technology, we defined a kind of Digital Lock Loop (DLL), in order for the ASIC to self calibrate (a fast device would delay the time at which the data was emitted, to guarantee the hold time). We also defined different phase for the clock period, so we could artificially enlarge the clock cycle, to receive the data with no set-up problem. In other words, we had to invent the wheel, even if I am sure that the designers working at Cray Research did it before us! When I see the size of a team working today on a single device (OMAP5 or equivalent), I think we were very lucky to discover the ASIC design in such a way.

    Because there is a moral in every story, I must say that the project was suddenly ended by Thomson CSF top management, when it appeared that the machine would never work, at least at 20 MHz, the official reason being the difficulties from Weitek to ship the CPU. Then, the Engineering manager of the software team moved to another Thomson’ subsidiary, in charge of developing tools for the stock market. Eric moved to Australia for one year, to learned surf. By the way, he is still living there! As far as I am concerned, I stayed in ASIC design, doing chips for Aircraft motors or Analogue simulation for the TGV and finally was ASIC FAE for TI, where my largest customer was Advanced Computer Research Institute (ACRI) designing… a supercomputer! But that is another story…
    By Eric Esteve


    TSMC Financial Status Plus OIP Update!

    TSMC Financial Status Plus OIP Update!
    by Daniel Nenni on 07-05-2011 at 8:00 am

    Interesting notes from my most recent Taiwan trip: Taiwan unemployment is at a record low. Scooters once again fill the streets of Hsinchu! TSMC will be passing out record bonuses to a record amount of people. TSMC Fab expansions are ahead of schedule. The new Fab 15 in Taichung went up amazingly fast with equipment moving in later this year. When was the last time you saw a fab built ahead of schedule and under budget? Simply amazing! Taiwan is also ready to overtake Japan as the world’s largest semiconductor materials market. The Taiwan market grew from $6.9 billion in 2009 to estimated $9.1 billion in 2010, showing 36%+ growth. Go Taiwan!

    The Motley Fool did a nice TSMC financial article with pretty pictures. I like pretty pictures. The bottom line is that not only is TSMC the largest semiconductor foundry, TSMC is also the most profitable. The important point here is margins. Margins translates into pricing flexibility as supply outpaces demand, which is coming, believe it! Semiconductor manufacturing capacity utilization today is running at 90%+ in most segments. With all the new fab space coming online from TSMC, Samsung, Intel, and GlobalFoundries in 2012 it may be a different story. Either way TSMC wins.

    Unfortunately Motley Fool does not know semiconductors as they listed NVIDIA and LDK Solar as industry peers/competitors! DOH! One of the most amusing things I do for money is consult with Wall Street types and explain exactly what the semiconductor market is and who the real players are. I also slip in some EDA and Semi IP information whenever possible. Even with the recent acquisitions, Wall Street simply does not care about EDA, but I digress.

    The one semi-relevant example Motley Fool uses is number four foundry, SMIC. TSMC Gross Margins are 49.6% versus SMIC at 20.8%. UMC, the number two foundry, is at 27.5%. GlobalFoundries financials are private but I will see what I can find out. Intel and Samsung will never tell foundry capacity or margin numbers so I shouldn’t even be mentioning them in the same paragraph as the real foundries.

    Coming this fall from TSMC is the new and improved Open Innovation Platform Ecosystem Forum. TSMC is preparing a massive design ecosystem event on Tuesday, October 18th at the San Jose Convention Center. A call for papers already went out, 18 papers will be presented to an open forum of industry executives from TSMC, ecosystem partners, and customers. This is a DO NOT MISS event! There will be focused breakout sessions on all manner of design issues AND a pavilion with around 80 TSMC Design Ecosystem partners showing their wares. Plus, I will be there (free food), such a deal. The food is always good at TSMC events!

    The Open Innovation Platform® is the substantiation of TSMC’s Open Innovation model that brings together the thinking of customers and partners under the common goal of shortening design time, minimizing time-to-volume and speeding time-to-market, and ultimately time-to-money.

    No doubt this event will be sold out. Follow SemiWiki.com for TSMC OIP updates coming soon.

    Note: You must be logged in to read/write comments.


    Two More Transistor-Level Companies at DAC

    Two More Transistor-Level Companies at DAC
    by Daniel Payne on 07-02-2011 at 8:38 pm

    In my rush on Wednesday at DAC I had almost over-looked the last two companies I talked with: Invarian and AnaGlobe. These last two I had hand-written notes on paper, so I just got to the bottom of my inbox tonight to write up the final trip reports.

    Invarian
    Jens Andersen and Vladimir Schellbach gave me an overview of tools that perform temperature, package and analog layout analysis:

    • Models actual component temperature
    • Identifies electromigration
    • Finds hotspots
    • Solves full 3D heat transfer equation
    • Accounts for block layout impact
    • Accounts for power dissipation

    The Invarian tool named InVar works with a SPICE simulator like Berkeley’s Analog Fast SPICE tool. They analyze both analog and digital design flows. The only other competitor in this space would be Gradient DA.

    Summary
    Watch this startup, even with under a dozen people in Moscow and Silicon Valley they have an interesting focus on temperature variation that the big four in EDA haven’t starting serving yet. Their IR drop and EM analysis have plenty of competitors.

    AnaGlobe
    How would you load an IC layout that was 180GB in size? At AnaGlobe they use the Thunder chip assembly tool and get the design loaded in under two hours. Yan Lin gave me a quick overview of their tools.

    GOLF is a new PCell design environment based on OA.

    PLAG is another OA tool for flat panel layout.

    Summary
    AnaGlobe is certainly a technology leader for large IC database assembly. Their GOLF tool competes with Virtuoso, Ciranova and SpringSoft. PLAG looks to have little competition. Big name design companies use AnaGlobe tools: Nvidia, Marvell, SMIC, AMCC.