100X800 Banner (1)

Physical Verification of 3D-IC Designs using TSVs

Physical Verification of 3D-IC Designs using TSVs
by Daniel Payne on 11-12-2011 at 10:36 am

3D-IC design has become a popular discussion topic in the past few years because of the integration benefits and potential cost savings, so I wanted to learn more about how the DRC and LVS flows were being adapted. My first stop was the Global Semiconductor Alliance web site where I found a presentation about how DRC and LVS flows were extended by Mentor Graphics for the Calibre tool to handle TSV (Thru Silicon Via) technology. This extension is called Calibre 3DSTACK.

With TSV each die now becomes double-sided in terms of metal interconnect. DRC and LVS have to now verify the TSV, plus front and back metal layers.

The new 3DSTACK configuration file controls DRC and LVS across the stacked die:

A second source that I read was at SOC IP where there were more details provided about the configuration file.

This rule file for the 3D stack has a list of dies with their order number, postion of each die, rotation, orientation, location of the GDS layout files and associated rule files and directories.

To do the parasitic extraction requires new information about the size and electrical properties of the microbumps, copper pillars and bonding materials.

One methodology is to first run DRC, LVS and extraction on each die separately, then add the interfaces. The interface between the stacked dies uses a separate GDS, and LVS/DRC checks are run against this GDS.

For connectivity checking between dies text labels are inserted at the interface microbump locations.

With these new 3D extensions then Calibre can run DRC, LVS and extraction on the entire 3D stack. A GUI helps you to visual the 3D rules and results from DRC, LVS and extraction:

TSMC Partner of the Year Award
Based on this extension of Calibre into the 3D realm, TSMC has just announced that Mentor was chosen as the TSMC Partner of the Year. IC designers continue to use the familiar Calibre rule decks with the added 3DSTACK technology.

Summary
Yes, 3D-IC design is a reality today where foundries and EDA companies are working together to provide tools and technology to extend 2D and 2.5D flows for DRC, LVS and extraction.

Further Info

var _gaq = _gaq || [];
_gaq.push([‘_setAccount’, ‘UA-26895602-2’]);
_gaq.push([‘_trackPageview’]);

(function() {
var ga = document.createElement(‘script’); ga.type = ‘text/javascript’; ga.async = true;
ga.src = (‘https:’ == document.location.protocol ? ‘https://ssl’ : ‘http://www’) + ‘.google-analytics.com/ga.js’;
var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.insertBefore(ga, s);
})();


SPICE Circuit Simulation at Magma

SPICE Circuit Simulation at Magma
by Daniel Payne on 11-11-2011 at 11:36 am

All four of the public EDA companies offer SPICE circuit simulation tools for use by IC designers at the transistor-level, and Magma has been offering two SPICE circuit simulators:

  • FineSIM SPICE (parallel SPICE)
  • FineSIM PRO (accelerated, parallel SPICE)

An early advantage offered by Magma was a SPICE simulator that could be run in parallel on multiple CPUs. The SPICE competitors have all now followed suit and re-written their tools to catch up to FineSim in that feature.

I also blogged about FineSIM SPICE and FineSIM Pro in June at DAC.

When I talk to circuit designers about SPICE tools they tell me that they want:

  • Accuracy
  • Speed
  • Capacity
  • Compatibility
  • Integration
  • Value for the dollar
  • Support

The priority of these seven attributes really depends on what you are designing.

Feedback from anonymous SPICE circuit benchmarks concludes that FineSim SPICE can be preferred versus Synopsys HSPICE:

  • Accuracy – about the same, qualified at TSMC for 65nm, 40nm and 28nm
  • Speed – FineSim SPICE can be 3X to 10X faster
  • Capacity – around 1.5M MOS devices, up to 30M RC elements
  • Compatibility – uses inputs: HSPICE, Spectre, Eldo, SPF, DSPF. Models: BSIM3, BSIM4. Outputs: TR0, fsdb, WDF.
  • Integration – co-simulates with Verilog, Verilog-A and VHDL
  • Value – depends on the deal you can make with your Account Manager
  • Support – excellent

Room for Improvement
Cadence, Synposys and Mentor all have HDL simulators that support: Verilog, VHDL, System Verilog and System C. These HDL simulators have been deeply integrated with their SPICE tools, letting you simulate accurate analog with the SPICE engine in context with Digital. Magma has no Verilog or VHDL simulator and only does a co-simulation, which is really primitive in comparison to these deeper integrations using single kernel technology.

Memory designers use hierarchy and FineSim Pro does offer a decent simulation capacity of 5M MOS devices, although it is not a hierarchical simulator so you cannot simulate a hierarchical netlist with 100M or more transistors in it. Both Cadence and Synopsys offer hierarchical SPICE simulators. With FineSim Pro you have to adopt a methodology of netlist cutting to simulate just the critical portions of your hierarchal memory design.

Summary
You really have to benchmark a SPICE circuit simulator on your own designs, your models, your analysis, and your design methodology to determine if it is better than what you are currently using. This is a highly competitive area for EDA tools and by all accounts Magma has world-class technology that works well for a wide range of transistor-level netlists, like: custom analog IP, large mixed-signal designs, memory design and characterization.

We’ve setup a Wiki page for all SPICE and Fast SPICE circuit simulatorsto give you a feel for which companies have tools.



Old standards never die

Old standards never die
by Paul McLellan on 11-09-2011 at 4:14 pm

I just put up a blog about the EDA interoperability forum, much of which is focused on standards. Which reminded me just how long-lived some standards turn out to be.

Back in the late 1970s Calma shipped workstations (actually re-badged Data General minicomputers) with a graphic display. That was how layout was done. It’s also why, before time-based licenses, EDA had a hardware business model, but that’s a story for another day. The disk wasn’t big enough to hold all the active designs, so the typical mode of operation was to keep your design on magnetic tape when you weren’t actually using the system. Plus you could use a different system next time rather than having to get back on the same system (this was pre-ethernet). The Calma system was called the graphic design system and the second generation was (surprise) labeled with a two. That tape backup format was thus called “graphic design system 2 stream format”. Or more concisely GDS2. Even today it is the most common format for moving physical layout design data between systems or to mask-makers, over 30 years later.

My favorite old standard is the cigarette lighter outlet that we all have on our cars. It was actually designed in the 1950s as a cigar lighter (well, everyone smoked cigars then I guess. Men, anyway). When eventually people wanted a power source then one way to get it was to design a plug that would take power from the cigar lighter outlet. That meant no wiring was required to be done. That was about the only good thing about it. It is an awful design as an electrical plug and socket, with a spring loaded pin trying to push the plug out of the socket and nothing really holding it solidly in place. Despite this, fifty years later every car has one (or several) of these and we use them to charge our cell-phones (now there’s a use that wasn’t envisioned in the 1950s).

Even more surprising, since you could already buy a PC power-supply that ran off a cigar outlet, when they first put power outlets for laptops on planes, they (some airlines anyway) used the old cigar lighter outlet. Imagine talking to the guy in the 1950s who designed the outlet, and you told him that in 2011 you’d use that socket to power your computer on a plane. He’d have been astounded. The average person could not afford to go on a plane in those days and, as for computers, they were room-sized, far too big to put on a plane.

Talking of planes, why do we get on from the left hand side. It’s another old standard living on. Back 2000 years ago, before the invention of the stern-post rudder, ships were steered with a steering oar. For a right hander (that’s most of you: we left-handers are the 10%) that was most conveniently put on the right side of the ship, hence the steerboard or, as we say today, starboard side. The other side was the port side, the side of the ship put against the quay for loading and unloading, without the steering oar getting in the way. When planes first had passenger service, they were sea-planes so naturally they kept the tradition. Eventually planes got wheels, and jet engines (and cigar outlets for our computers). 2000 years after the steering oar became obsolete that standard lives on.



EDA Interoperability Forum

EDA Interoperability Forum
by Paul McLellan on 11-09-2011 at 3:06 pm

The 24th Interoperability Forum is coming up at the end of the month on November 30th to be held at the Synopsys compus in Mountain View. It lasts from 9am until lunch (and yes, Virginia, there is such a thing as a free lunch). I think it looks like a very interesting way to spend a morning.

Here are the speakers and what they are speaking about:

  • Philippe Margashack, VP of central R&D at ST, will talk about 10 years of standards. Somehow I guess SystemC and TLM may figure prominently.
  • John Goodenough, VP Design technology and automation at ARM
  • Jim Hogan, long-time of Cadence and Artisan and now a private investor will talk about The sequel: a fistful of dollars (which I believe is on exit-strategies)
  • Mark Templeton (another Artisan alumnus) and now president of Scientific Ventures will talk about Survival of the fittest and the DNA of interoperability
  • Mike Keating, a Synopsys fellow and author of the Low Power Methodology Manual will talk about (surprise) Low power. Treating water in a rising flood
  • Shay Gal-On, director of software engineering at EEMBC Technology Center, will talk on Multicore Technology: To Infinity and beyond in complexity. I firmly believe that writing software for high-count multicore processors is as big a challenge as anything that we have on the semiconductor side in the coming decades.
  • Shishpal Rawat, chair of Accelera, will talk on The evolution of standards organizations: 2025 and beyond. Gulp. What technology node are we meant to be on by then?

Following the presentations there will be a wrap up, a prize drawing (let me guess, an iPad2) then lunch and networking.

I’ll see you there.

To register, go here.


Synopsys Awarded TSMC’s Interface IP Partner of the Year

Synopsys Awarded TSMC’s Interface IP Partner of the Year
by Eric Esteve on 11-09-2011 at 9:19 am

Is it surprising to see that Synopsys has been selected Interface IP partner of the year by TSMC? Not really, as the company is the clear leader on this IP market segment (which includes USB, PCI Express, SATA, DDRn, HDMI, MIPI and others protocols like Ethernet, DisplayPort, Hyper Transport, Infiniband, Serial RapidIO…). But, looking five years back (in 2006), Synopsys was competing with Rambus (no more active on this type of activity), ARM (still present, but not very involved), and a bunch of “defunct” companies like ChipIdea (bought by MIPS in 2007, then sold to Synopsys in 2009), Virage Logic (acquired by Synopsys in 2010)…At that time, the Interface IP market was weighting $205M (according with Gartner) and Synopsys had a decent 25% market share. Since then, the growth has been sustained (see the picture showing the market evolution for USB, PCIe, DDRn, SATA and HDMI) and Synopsys is enjoying in 2010 a market share of… be patient, I will disclose the figure later in this document!

What we can see on the above picture is the negative impact of the Q4 2008-Q1/Q2/Q3 2009 recession on the growth rate for every segment – except DDRn Memory Controller. Even if in 2010, the market has recovered, we should come back to 20-30% like growth rate only in 2011. What will happen in 2012 depends, as always, of the health of the global economy. Assuming no catastrophic event, 2010/2011 growth should continue, and the interface IP market should reach in 2012 a $350M level, or be 58% larger than in 2009 (a 17% CAGR during these 3 years).

The reasons for growth are well known (at least for those who read Semiwiki frequently!): the massive move from parallel I/Os to high speed serial, the ever increasing need for more bandwidth, not only in Networking, but also in PC, PC peripheral, Wireless and Consumer Electronic segments – just because we (the end user) exchange more data through Emails, Social Media, watch movies or listen music on various, and new, electronic systems. Also because these protocols standards are not falling in commoditization (which badly impact the price you sell Interface IP), as the various organizations (SATA, USB, PCIe, DDRn to name the most important) are releasing new protocol version (PCIe gen-3, USB 3.0, SATA 6G, DDR4) which help to keep high selling price for the IP. For the mature protocols, the chip makers expects the IP vendors to port the PHY (physical part, technology dependant) on the latest technology node (40 or 28 nm), which again help to keep price in the high range (half million dollar or so). Thus the market growth will continue, at least for the next three to four years. IPnest has built a forecast dedicated to these Interface IP segments, up to 2015, and we expect to see a sustained growth for a market climbing to a $400M to $450M range (don’t expect IPNEST to release a 3 digit precision forecast, this is simply anti-scientific!)…

But what about Synopsys’ position? Our latest market evaluation (one week old) integrated in the “Interface IP Survey 2005-2010 – Forecast 2011-2015” shows that for 2010, Synopsys has not only kept the leader position, but has consolidated and has passed from a 25% market share in 2006 to a 40%+ share in 2010. Even more impressive, the company is getting at least 50% more market share (sometime more than 80%) in the segments where they are playing, namely USB, PCI Express, SATA, DDRn, with the exception of HDMI, where Silicon Image is really too strong- on a protocol they have invented, that make sense!

All of the above explains why TSMC has made the good choice, and any other decision would not have been rational… except maybe to decide to develop (or at least market) themselves the Interface IP functions, like the FPGA vendors are doing…

By the way, if you plan to attend IP-SoC 2011 in December 7-8[SUP]th[/SUP] in Grenoble, don’t miss the presentation I will give on the Interface IP market, see the Conference agenda.

Eric Esteve from IPNEST – Table of Contentfor “Interface IP Survey 2005-2010 – Forecast 2011-2015” available here


Using Processors in the SoC Dataplane

Using Processors in the SoC Dataplane
by Paul McLellan on 11-08-2011 at 9:17 am

Almost any chip of any complexity contains a control processor of some sort. These blocks are good at executing a wide range of algorithms but there are often two problems with them: the performance is inadequate for some application or the amount of power required is much too high. Control processors pay in limited performance and excessive power in order to get the huge flexibility that comes from programming a general purpose architecture. But for some applications such as audio, video, baseband DSP, security, protocol processing and more, it is not practical just to write the algorithm in software and run it on the control processor. These functionns comprise the dataplane of the SoC.

In many designs that dataplane is now the largest part of the SoC and, as a result, typically takes the longest to design and verify. The design is typically parceled out into several design groups to design and verify their part of the SoC. Meanwhile the software team is trying to start software development but this goes slowly and requires significant rework because the hardware design is not yet available.

A typical dataplane applilcation consists of the datapaths themselves, where the main data flow, along with a complex finite-state machine (FSM) that drives the various operations of the datapath. Think of a function like an MPEG video decode. The video data itself flows through the datapaths and the complex aglorithms that are used to decompress that data into a raw pixel stream are driven by a comlex state machine off to one side (see diagram).


There are two challenges with this architecture. RTL just isn’t a very good way to describe the complex data operations and as a result it takes too long to create and verify. Secondly, the complex algorithmic nature of the control function is not a good match for an implementation on the bare silicon, since any change, even a minor one, to tweak the algorithm typically requires a complete respin of the chip.

One obvious approach would be to use the control processor to drive the dataplane, but often this does not have enough performance (for high performance applications like router packet analysis) or consumes too much power for simple ones (like mp3 decode). In addition, the control processor is typically busy doing other things and so the only practical implementation requires adding a second control processor

A better approach is to use a dataplane processor (DPU). This can strike the right balance between performance, power and flexibility. It is not a general purpose processor but doesn’t come with a general purpose processors overheads. It is not as hard-wired or expensive as writing dedicated RTL. The FSM can be implemented extremely efficiently but retaining a measure of programmability.

The main advantages are:

  • Flexibility. Changing the firmware changes the blocks function.
  • Software-based development, with lower cost tools and ease of making last minute changes.
  • Faster and more powerful system modeling, and more complete coverage
  • Time-to-market
  • Ability to make post-silicon changes to the firmware to tweak/improve algorithms
  • DPU processor cores are pre-designed and pre-verified. Just add the firmware.

So the reasons to use a programmable processor in the dataplane of an SoC are clear: flexibility, design productivity and, typically, higher performance at lower power. Plus the capabillity to make changes to the SoC after tapeout without requiring a re-spin.

The Tensilica white paper on DPUs is here.



RTL Power Models

RTL Power Models
by Paul McLellan on 11-08-2011 at 8:00 am

One of the challenges of doing a design in the 28nm world is that everything depends on everything else. But some decisions need to be made early with imperfect information. But the better the information we have, the better those early decisions will be. One area of particular importance is selecting a package, designing a power network and generally putting together a power policy. Everything is power sensitive these days, not just mobile applications but anything that is going to end up inside a cloud computing server farm, such as routers, disk drives and servers. Power gating and clock gating make things worse, since they can cause very abrupt transitions as inrush current powers up a sleeping block or a large block suddenly starts to be clocked again.

But these decisions about power can’t wait until physical design is complete. That is too late. And the old way of handing things with a combination of guesswork and excel spreadsheets is too cumbersome and inaccurate. The penalties are severe. Underdesigning the power delivery network results in failures; overdesigning it results in larger die or a more expensive package. The ideal approach would be to be able to do the sort of analysis that is possible after physical design, but instead do it at the RTL level.

That is basically what Apache is announcing today. An RTL power model (RPM) of the chip combines three technologies.

  • Fast critical frame selection
  • PACE (Power Artist Calibrator and Estimator)
  • RTL (pre-synthesis) power analysis

To do power analysis requires vectors (well, there are some vectorless approaches but they are not very accurate). But all vectors are not created equal. There might be tens of millions of vectors in the full verification suite, but for critical power analysis there may be just very short sequences of a dozen vectors that cause the maximum stress to the power network or which cause the maximum change in the current. These critical frames are the ones that are needed to ensure that there are no power problems due to excessive current draw or excessively fast change that, for example, drains all the decaps.

PACE, PowerArtist Calibrator and Estimator, is a tool for analyzing a chip for a similar application and process and generating the parameters that are required to make the RTL power analysis accurate for similar designs. Obviously power depends on things like the capacitance of interconnect, the cell library used, the type of clock tree and so on. PACE characterizes this physical information allowing RTL-power based design decisions to be made with confidence.

With these two things in place, the critical frames and the data in PACE, plus, of course, the RTL, it is now possible to generate an RTL Power Model for the design. The RPM can then be used in Apache’s analysis tools. How accurate is it? Early customers have found that the RPM is within 15% of the values from actual layout with full parasitics. For example, here is a comparison of package resonance frequency comparing just the package (green) with the package along with a CPM (chip-power-model) from layout (red) and with the package along with a CPM from RPM (blue).


And another example, showing the change in resonance frequency as the size of the decoupling capacitors is changed.

More and more designs are basically assemblies of IP blocks, mostly RTL. The RTL is often poorly understood since it is purchased externally or created in another group overseas or comes from a previous incarnation of the design. This makes early analysis like this even more important and avoids either underdesigning the power delivery network, and risking in-field failures, or overdesigning it and being uncompetitive due to cost.


Managing Test Power for ICs

Managing Test Power for ICs
by Beth Martin on 11-07-2011 at 12:17 pm

The goal for automatic test pattern generation (ATPG) is to achieve maximum coverage with the fewest test patterns. This conflicts with the goals of managing power because during test, the IC is often operated beyond its normal functional modes to get the highest quality test results. When switching activity exceeds a device’s power capability during test, it can have detrimental effects on the IC, such as collapse of the power supply, switching noise, and excessive current that could lead to joule heating and connection failure. These effects lead to false failures, and can damage IC in ways that decrease it’s lifetime.

When planning for power-aware test and creating production test patterns, several techniques should be used to manage power during test. Using the techniques I outline here, scan shift switching—the largest contributor to power usage during test—can typically be reduced from 50% (normal level as even distribution on 1s and 0s) to 25% with minimal impact on test time.

Reduce the clock frequency to allow power to dissipate and reduce the heating and average power. However, this could exacerbate a problem with instantaneous power because the circuit will settle more between the lower frequency clock pulses.

Skew the clocks such that they rise at different points within the cycle, thus reducing instantaneous power. This technique is highly dependent on the clock design, does not help with average power, is circuit dependent and, in some cases will not address localized problems.

Use a modular test approachto manage switching activity during production test by sequencing test activity and controlling power on a block-by-block basis. This method requires configuring test pattern generation so that only active blocks are considered during the ATPG process and the remaining blocks are held in a steady state.

Manage switching activity during scan test with various ATPG strategies. In standard ATPG, pattern generation targets the maximum number of faults in the minimum number of patterns. This approach leads to high levels of switching activity, usually in the early part of the test set. But if you relax the rate of fault detection and set a switching threshold as part of the initial constraints for ATPG, the coverage rate is spread throughout the entire test set, leading to a lower average switch activity.

Use clock-gating to limit capture power by holding state elements that are not being used to control or observe targeted faults. ATPG controls the clock-gating logic with either scan elements or primary pins. Hierarchical clock controls can provide a finer level of control granularity while using fewer control bits.

Achievable Results
Scan shift switching can typically be cut in half with minimal impact on test time. Capture switching activity can also be significantly reduced, but is highly design dependent. By leveraging features already in place at the system level to control power dissipation in functional mode, power used during production test can be effectively managed. When design-level approaches to power control are combined with ATPG and BIST techniques, you can achieve high-quality test and manage power integrity.For more information on low power testing, download my new white paper “Using Tessent Low Power Test to Manage Switching Activity.”By Giri Podichetty on behalf of Mentor Graphics.


3D Transistors @ TSMC 20nm!

3D Transistors @ TSMC 20nm!
by Daniel Nenni on 11-06-2011 at 12:51 pm

Ever since the TSMC OIP Forum where Dr. Shang-Yi Chiang openly asked customers, “When do you want 3D Transistors (FinFETS)?” I have heard quite a few debates on the topic inside the top fabless semiconductor companies. The bottom line, in my expert opinion, is that TSMC will add FinFETS to the N20 (20nm) process node in parallel with planar transistors and here are the reasons why:



Eliminating EXCESS:
In the next few years, traditional planar CMOS field-effect transistors will be replaced by alternate architectures that boost the gate’s control of the channel. The UTB SOI

replaces the bulk silicon channel with a thin layer of silicon mounted on insulator. The FinFET turns the transistor channel on its side and wraps the gate around three sides. (Click to enlarge)

The1999 IDM paper Sub 50-nm FinFET: PMOSstarted the 3D transistor ball rolling then in May of 2011 Intel announceda production version of a 3D transistor (TriGate) technology at 22nm. Intel is the leader in semiconductor process technologies so you can be sure that others will follow. Intel has a nice “History of the Transistor” backgrounder in case you are interested. Probably the most comprehensive article on the subject was just published by IEEE Spectrum “Transistor Wars: Rival architectures face off in a bid to keep Moore’s Law alive”. This is a must read for all of us semiconductor transistor laymen.


DOWN AND UP:
A cross section of UTB SOI transistors and a micrograph of an array of FinFET transistors .

Why the push to 3D transistors at 20nm?

Reason #1 is because of scaling. From 40nm to 28nm we saw significant opportunities for a reduction in die size and power requirements plus an increase in performance. The TSMC 28nm gate-last HKMG node will go down in history as the most profitable node ever, believe it! Unfortunately standard planar transistors are not scaling well from 28nm to 20nm, causing a reduction of the power/die savings and performance boost customers have come to expect from a process shrink. From what I have heard it is half what was expected/hoped for. As a result, TSMC will definitely offer 3D transistors at the 20nm node, probably as a mid-life node booster.



Shrinking returns:
As transistors got smaller, their power demands grew. By 2001, the power that leaked through a transistor when it was off was fast approaching the amount of power needed to turn the transistor on , a warning sign for the chip industry. As these Intel data show, the leakage problem eventually put a halt to the transistor scaling , a progression called Dennard’s law. Switching to alternate architectures will allow chipmakers to shrink transistors again, boosting transistor density and performance.

Reason #2 is because TSMC can and it will offer a significant competitive advantage against the second source foundries (UMC, GFI, SMIC). DR. Chenming Hu is considered an expert on the subject and is currently a TSMC Distinguished Professor of Microelectronics at University of California, Berkeley. Prior to that he was the Chief Technology Officer of TSMC. Hu coined the term FinFET 10+ years ago when he and his team built the first FinFETs and described them in the 1999 IEDM paper. The name FinFET because the transistors (technically known as Field Effect Transistors) look like fins. Hu didn’t register patents on the design or manufacturing process to make it as widely available as possible and was confident the industry would adopt it. Well, he was absolutely right!

The push for 3D transistors clearly shows that the days of planar transistor scaling will soon be behind us. It also shows what lengths we will go through to continue Moore’s Law. Or as TSMC says “More-than Moore Technologies”.


ARM Chips Away at Intel’s Server Business!

ARM Chips Away at Intel’s Server Business!
by Ed McKernan on 11-06-2011 at 7:40 am


When Intel entered the server market in the 1990s with their Pentium Processor and follow on Xeons beginning in 1998, they focused on the simple enterprise applications. At the same time they laid the groundwork for what will turn out to be a multi-decade, long war to wrest control from all mainframes and workstations. The announcements this past week by Calxeda with the first 32-bit ARM server chip and by ARM with their new 64-bit server architecture known as the “v8 Core” we see a similar strategy unfolding. We should not be surprised at ARMs aggressive push into server but we should also recognize that the battle between ARM and Intel will also occur over decades with many new interesting twists and alliances.

For the past year, Kirk Skaugen the General Manager of Intel’s Datacenter and Connected Systems Group has shown the slide to the left to describe not only the growth of the Server Market but how it is divided into many sub segments. The overriding message is that Intel is going after it all, but the more important point is that the market fragmentation will cause suppliers to customize their processors for the particular markets. The interesting takeaways from the Calxeda (pronounced Cal-Zeda) introduction that can be found here is not the fact that it is reported to be 90% lower power than Intel Xeon’s or said another way 10X the performance at the same power, it is that they are driving a complete new vector which takes into account packing density.

How many complete servers with networking and SATA storage interfaces can you cram into a 1U server (think pizza box size). The Calxeda chip integrates 4 ARM Cores with a DDR3 memory controller, 6 Gigabit Ethernet connections and 4 10G XAUI interfaces, SATA ports to interface up to 5 drives and 4 PCIe2.0 controllers. Think of it as a complete Server on Chip (another use of the acronym SOC). By tuning this chip to under 5 watts it allows dense packing without all the large, spacious cooling solutions, required with Intel Xeons running at up to 110 Watts. So what are the prospects for Calxeda? Although it is only a 32 bit chip, the Calxeda solution, assuming it undercuts an Intel Atom with its I/O chipset could play in the entry level office, small business server market handling rudimentary functions. It will also get a play in the data center dishing out web pages that don’t handle critical transactions. The datacenter companies have two large operating cost items: the power bill and the equipment cost. Google, Facebook, Amazon and the rest of the big server buyers can’t afford not to look at the Calxeda solution and to play around with it. HP recognized this and that is why they are establishing what they call the “HP Discovery Lab” in Houston for customers to test out their applications on the new server built with Calxeda parts. Assume that Google and the rest of teams of engineers tuning their code for ARM.

One other aspect of the Calxeda business model that is interesting. The tradition with many Xeon based servers is to buy the biggest processor and then load it with VMWare for the purpose of dividing many workloads on one processor since individual workloads rarely utilize even a fraction of a single processor. Calxeda’s model is to say instead that the user should get rid of the VMWare and run a single application per processor core. So now we have two economic models fighting it out. I am sure both are suitable for various market segments. But now VMWare will align closer with Intel. Now that we have seen what Calxeda has developed, what should we expect from nVidia, AMCC, and Intel. AMCC announced they have an ARM v8 running on FPGAs. This is fine, but too early to tell what direction they are headed. It is likely that nVidia has something in the labs that they will announce when they have silicon to sample. nVidia has the advantage of selling into the High Performance Computing (HPC) market today and would like to leverage that further into the datacenter. For Intel, there clearly is a need to rethink their architectures in ways that are dramatically different. For the dense server market they will need to imitate Calxeda in the integration of I/O (SATA and Ethernet).

Secondly, they are probably thinking and designing along the path of more heterogeneous processor cores (for example dropping the x86 floating point unit in some cases – as Calxeda has done in its server chip). And finally, they no doubt will demonstrate at 22nm that large L3 caches make a huge difference in performance benchmarks, even for simple applications. Why do I say this? Intel’s ultimate strength is in process technology and the fact that they are the world’s greatest designers of SRAM caches. They go hand-in-hand as the SRAM cache is the pipe cleaner for the new process technology. The Calxeda parts have a nice L2 cache but no L3 cache. Intel is brute forcing large L3 caches into Xeon based on the justification that any workload that remains on the processor die and avoids going off chip to DRAM is a power saving solution. As Intel builds Xeons with larger caches it simultaneously sell customers on the value proposition that the high ASP Xeons (as high as $4200 per part) save much more than that on their power bills. Today the highest end Xeon has 30MB of cache, in 12 months this could be 60MB on 22nm. Intel will argue many applications running on the Calxeda chip must go off to DRAM whereas with Intel they stay on the processor. This is where the PR performance/watt battle is likely to be fought with benchmarks over the next year. If Calxeda and the other ARM server processors gain traction, look for Intel to offer a “Celeron” server processor strategy that drags Calxeda into a scorched earth price war in a very narrow niche.

Meanwhile, the Intel high-end juggernaut will continue to steam ahead with more cores and bigger L3 caches. No one today can say how this will play out. Competition is good and there is a sense that more alternatives will lead to better economics. The good news is that the battle is joined and we get to live in interesting times.

Note: You must be logged in to read/write comments