CEVA Dolphin Weninar SemiWiki 800x100 260419 (1)

100 Million Miles Per Hour!

100 Million Miles Per Hour!
by Kevin Kostiner on 10-02-2016 at 12:00 pm

Back When We Loved Discovery
As anyone who reads and follows my blog posts will know, I’m a believer in innovation. It’s what drives my passion for the Internet of Things. That interest started when I was an “Apollo” kid during the 1960’s and 1970’s. Those decades offered a very different landscape for creativity, exploration and discovery!

There was a time back in the 1960’s and 70’s when the United States and much of the world had a passion for discovery. President Kennedy’s famous challenge to put a man (person today) on the moon and get him back safely by the decades end (1960’s) ignited the fires for innovation and discovery like never before! From our rudimentary knowledge about space that existed on the day Kennedy uttered those words to taking that “one small step for man, one giant leap for mankind” on July 9, 1969, scientists, engineers, mathematicians, visionaries, artists, physicists, hardware engineers, construction teams, welders, architects and so many more people came together to…..create!!

But We Lost Our Way
Unfortunately, after the Apollo program was scrapped, we lost a lot of that passion for challenging ourselves and seeking out the new and unknown. We also lost a uniting purpose that brought people together from around the world.

When Neil Armstrong made that first footprint in the lunar dust, it did not matter who you were, what you worshipped, how your hair looked or the color of your skin. Everyone, everywhere was as one praying, cheering and nail biting every second as Armstrong’s foot inched towards lunar touchdown! The greatest surge of human electricity in our history occurred on that day. We have not experienced such a moment since!

They Had Nothing Better To Do?
In 2006, the actions of the International Astronomical Union (IAU) downgrading Pluto from “planet” to “dwarf planet” exemplified the complete lack of passion for discovery that now existed. Made up of “a collection of professional astronomers, at the PhD level and beyond” this odd group decided that the planet we all grew up with as the lonely monitor of our outer solar system had failed to clear it’s neighborhood of other objects so could no longer be called a planet.

So this group of egotistical “PhD’s and beyond” decided to inform us all that our childhood’s were based on a lie! Somehow a timeout was not enough punishment for Pluto’s failure to clean up its neighborhood. Imagine what that group would have done to you if they had seen your bedroom!!!

For me, I never accepted these Buzz Lightyear wannabes pronouncement. Hence I proudly refer to Pluto as a planet. Hopefully all of you will agree with me. If not, feel free to “un-friend” me on LinkedIn.

Science Is Amazing!

Depending upon the Earth’s position, Pluto is an astounding 2.66 – 4.67 billion miles away from us. On average, Pluto is 3.67 billion miles from the sun. Considering a flight from New York to LA is about 3000 miles, Pluto’s really, really, really far away. Consider this another way: Light travels at 186,000 miles per second. So the limited amount of sunlight that bathes the distant planet is over 5 hours old when it arrives. In other words, it takes sunlight over 5 hours to travel those 3.67 billion miles!

July of 2015 (46 years since Armstrong’s famous step!) was a truly amazing month. The New Horizons spacecraft, after speeding through our solar system for 9 ½ years at the astounding speed of 36,373 mph finally pulled up alongside our distant cousin Pluto. Getting as close as 7500 miles (less than a speck of dust in lunar distance), New Horizons began taking what would become the most amazing photographs of any planet taken since those first photos of the Apollo 11 Astronauts on the moon in 1969.

Incredible pictures started appearing at NASA and shared around the world. The buzz and analysis that followed threw into question many of the assumptions that had been made about Pluto and the very origin of our solar system. That’s called discovery and it was based on innovation! And that innovation began with Kennedy’s promise, setting us on a path that put the New Horizon’s craft at Pluto 46 years after lunar touchdown. After the New Horizons spacecraft beamed back those amazing close up photos of the blue planet the “esteemed” group at the IAU should have said “sorry” we were wrong…Pluto really is a planet. But they did not!

Rediscovering “Discovery”
Although we became a bit sidetracked over the past 40 years, and groups like the IAU wasted their time on opinions rather than science, the lure of discovery is a powerful thing. The recent announcement by Yuri Milner is one great example that discovery is still alive and well…just a bit hidden at times.

36,373 miles per hour and 9 ½ years! How about leveraging some new innovative thinking and changing those two numbers to 100 million miles per hours and 2 days! That’s Yuri Milner’s plan and he’s backing it up with an initial investment of $ 100 million to help send miniature probes into deep space to help bring back to all of us that world bonding feeling of discovery and exploration!

The Future Looks Bright

Back on that day in July 1969 when Neil Armstrong made the giant step, people literally took Polaroid photographs of their TV screens to capture the moment. Those blurry small instant black and white photo’s (An original one is this blog post’s photo) became the symbol of how gripped the nation; the world was to the sense of discovery represented by that great achievement.

Now we have Yuri Milner putting forth a new commandment that we will find a way to develop and send probes into deep space at incredible speeds over the next few decades to discover what lies “beyond”. And he is doing it at the very time that our entire technology world is going through a massive evolution with the convergence of technologies into the Internet of Things. He could not have chosen a more perfect time!

So what will happen 20 or 30 years from now when the first photographs of a sun or planet from a star system outside of our own is received for the first time? Will we all be gathered around an OLED holographic TV or wrist projected iPAD in a world unifying moment of discovery and excitement as occurred with Neil Armstrong’s famous step?

There were far fewer distractions in 1969 and even fewer ways to communicate. By 2040 or 2050 the proliferation of technologies will lead to an exponentially greater number of things to distract us than we even have today. So it won’t be easy for everyone to be pulled together by one singular event. BUT while distractions will be a challenge, the human drive to connect and be apart of something larger will always be there. Nothing brings us together more than human exploration and discovery.

So 20 or 30 years from now we will be glued to our devices. Nail biting, cheering, crying and ready to snap those “Polaroids” on that day when Milner’s bots “phone home” for the first time. Another giant step indeed!!


Scalable Infrastructure for Digital Businesses

Scalable Infrastructure for Digital Businesses
by Sudeep Kanjilal on 10-02-2016 at 7:00 am

Building Digital businesses is tough. The run-time changes rapidly (browser – apps – bots), and standards for the digital architecture/stack gets refined constantly. Pace of innovation is accelerating due to massive war-chest of the top digital players like Google, Facebook, Apple and Amazon. For the Fortune 500 consumer-facing giants to catch-up, they have to upgrade their technology stack and incorporate new technologies, while living with their existing ‘legacy’ technologies. This is a tricky balancing act, as it is not just mashing up seemingly incompatible technologies – but really combining different set of capabilities that these technologies enable.

For example, how to you define your ‘security architecture’ for your PII (regulated) data, when you have multiple data warehouses, multiple data centers – all running on different generation of technologies. A typical Fortune 500 firm will have at least 4 version of infrastructure stack, with the oldest one typically running a technology stack no longer supported. Start-ups, of course, do not face this issue – they get to start with a clean slate. However, karma eventually catches up – once start-ups become successful and big, and start acquiring other companies, their technology stack eventually resembles a kitchen sink!

Strengthen the foundations
To address the inevitable technology debt, which deepened over the past 8 years due to the Great Recession induced cost cutting, smart Fortune 500 firms are starting from the basic – strengthening the foundation of the overall technology stack. They are focusing on core infrastructure. A correct choice, as a house is only as strong as its foundation.

Digital Businesses are inherently susceptible to a fly-wheel effect, which leads to rapid growth running into millions of MAUs if done right. One can imagine the gusher of data that this model will generate, the rapid (real-time) analytics and decision-making needed to finely calibrate user interaction and overall user experience – without even bringing in higher-order capabilities like machine learning. And none of these will scale on a complex mix old infrastructure elements spanning 4 generations.

So, smart firms are putting in governance policies for infrastructure refresh that emphasizes any new infrastructure build or refresh will follow the 3 steps:

[LIST=1]

  • Define the overall architecture that is scalable
  • Define the infrastructure in software instead of specialized hardware/appliances. Think stack – IaaS, PaaS and SaaS
  • Go Hybrid – explicitly define infrastructure that will ALWAYS have multiple generations of technologies, as that will always be the reality.

    Hybrid or Public
    The last point is critical, but cannot be taken in isolation. Too often I have witnessed decisions being taken at these Fortune 500 firms purely in binary terms – cloud vs data centers, public cloud vs private cloud vs hybrid cloud, etc. Its critical to take a look at the infrastructure not just in terms of technology choices within each layer of the stack (Win vs Linux for servers, Oracle vs SQL doe dB servers, etc.) but think in terms of overall service that this stack is enabling, and then define the architecture requirements.

    Example: for a large-scale consumer application of a typical bank and/or an insurance firm, their Big Data would runs on bare meta (dedicated server), customer backend on public cloud, web scale front end (or perhaps reverser proxy server) on public cloud – and security in all stack layers. To the end customer, all this is transparent – it just works, and scale infinitely.

    The ultimate aim: Speed up innovation – idea to production in 30 mins, push code to production 100 times a week. Only then you get a digital business!


  • Will TSMC be alone at 10nm and 7nm?!?!?

    Will TSMC be alone at 10nm and 7nm?!?!?
    by Daniel Nenni on 10-01-2016 at 7:00 am

    Now that the dust has settled let’s talk about the recent TSMC OIP Ecosystem Forum. This was the 6[SUP]th[/SUP] annual OIP which hosts more than 1,000 attendees from TSMC’s top customers and partners. Presenting this year were TSMC VP and CTO Dr. Jack Sun, TSMC VP of R&D Dr. Cliff Hou, and ARM EVP of Incubation Businesses Dr. Dipesh Patel. First let’s talk about some of the key manufacturing milestones that were mentioned.

    Also read:Top 5 Highlights from the 2016 TSMC Open Innovation Platform Forum

    TSMC announced that the new 16FFC process is currently in high volume manufacturing (HVM). As we know from the recent iPhone7 teardown the A10 SoC inside is TSMC 16FFC. Additionally, the new TSMC InFOs packaging is in HVM which is also used by Apple for the A10. The iPhone 7 teardown also showed that the majority of the chips inside the iPhone 7 are manufactured by TSMC including the Intel modem (TSMC 28nm) and the QCOM modem (TSMC 20nm). This represents a significant upside for TSMC in Q3 and Q4 of this year so get ready for some very upbeat investor calls.

    TSMC announced that 10nm is ahead of schedule and will enter HVM in Q4 2016 versus Q1 2017. This supports my belief that the new Apple iPad A10x (to be announced next month) uses TSMC 10nm and will be the fastest SoC on the market, absolutely. I also believe that the next iPhone SoC (iPhone8) will use TSMC 10nm exclusively.

    The chatter in the conference hall from people who would know was that due to unexpected yield challenges involving other 10nm processes, TSMC may be running unopposed at 10nm for the next 3-4 quarters. If so, this is huge for TSMC and the TSMC Ecosystem!

    TSMC announced that 7nm is ahead of schedule and will start risk production in Q1 2017 meaning HVM will be Q4 2017 (just in time for the iPad A11x SoC). TSMC 7nm will use the same fabs as 10nm so the ramp will be predictably fast. This leaves TSMC again unopposed at 7nm for 1-2 years so congratulations to the hardworking people of TSMC.

    And congratulations to TSMC partners and customers who will now lead the industry in semiconductor process development and will deliver industry leading chips for the rest of this decade. Just to name a few: Apple, ARM, Broadcom, MediaTek, Nvidia, Xilinx, etc…

    Bottom line: TSMC has the strongest roadmap I have ever seen and will continue to dominate the foundry business for years to come (déjà vu 28nm).

    The other interesting thing to note is that the 30 technical OIP presentations made by partners and customers are now available via TSMC Online:

    Held Sept. 22th, 2016 at the Santa Clara Convention Center, the fifth TSMC’S Open Innovation Platform Ecosystem Forum was attended by more than 1,000 TSMC customers and the Open Innovation Platform design ecosystem partners from EDA, IP and Design Services. The Forum brought TSMC’s design ecosystem member companies together to share with our customers real-case solutions to customers’ design challenges and success stories of best practice in TSMC’s design ecosystem. In an adjacent Partner Pavilion, 50 design partner companies staffed booths, showcased their products and services, and took questions throughout the day from leading designers. 30 technical papers were presented during the forum, showing real solutions and how the complete OIP ecosystem achieves faster time to market.

    Live EDA Technical Presentations:

    Live IP Technical Presentations:

    Live Design Service Technical Presentations

    Print-Only Technical Presentations


    CCIX shows up in ARM CMN-600 interconnect

    CCIX shows up in ARM CMN-600 interconnect
    by Don Dingee on 09-30-2016 at 4:00 pm

    All the hubbub about FPGA-accelerated servers prompts a big question about cache coherency. Performance gains from external acceleration hardware can be wiped out if the system CPU cluster is frequently taking hits from cache misses after data is worked on by an accelerator.

    ARM’s latest third-generation CoreLink CMN-600 Coherent Mesh Network interconnect announcement this week had a bunch of goodness about higher performance ARMv8 support. The interconnect runs at up to 2.5 GHz and cuts latency in half, resulting in five times the throughput of the prior CoreLink generation. It supports from 1 to 128 Cortex-A cores, and uses AMBA 5 CHI interfacing.


    Also part of the latency/throughput equation is the new CoreLink DMC-620 Dynamic Memory Controller, with integrated TrustZone security and support for up to 8 channels of DDR4-3200, including 3D stacked DRAM. The combination definitely helps CPU-centric performance. However, from other news I’ve covered this week, I’ll reiterate: viewing cache from only a CPU-centric perspective is an outdated idea in a world of heterogeneous SoCs.

    ARM is backing a different approach to solve the system-centric coherency issue. A few months ago, seven companies – AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, and Xilinx – announced a rather cryptic initiative. It’s called Cache Coherent Interconnect for Accelerators, or CCIX. Everything we’ve known about that initiative is on a single web page, until now.

    A big differentiator in Xilinx Zynq versus other attempts at FPGA SoCs so far is its cache coherency – not perfect, an early proprietary implementation, but still beats the heck out of having an FPGA sitting astride of a CPU on an external bus. (We can only assume Intel and Altera have learned the ‘Stellarton’ lesson, and we’ll see how in future products.)

    CCIX only says the solution is a “driver-less and interrupt-less” usage model, offering orders of magnitude improvement in application latency. Presumably, the improvement has to do with connecting coherent agents and sharing cache updates more efficiently. In the ARM CMN-600 scheme, that’s part of what the Agile System Cache is supposed to do. ARM is also putting a lot of energy into cache-coherent GPU IP, and has its NIC-450 IP to help I/O subsystems.

    But the sleeper in all this is CCIX support. What is likely developing here are three circles of influence: ARM and its ecosystem partners, IBM (see a new SemiWiki article on POWER9), and Xilinx in one camp; NVIDIA with NVLink also having a connection to IBM POWER8 and POWER9; and Intel and Altera in the other. CCIX is open, and presumably someone else could jump in on that side. Intel and Altera could proceed with a proprietary solution, or surprise me and a lot of other folks by joining in. There’s also one other processor architecture out there – RISC-V – that could weigh in on the CCIX side soon. (I’ll have some more thoughts on that in an upcoming piece on my conversation this week with SiFive.)

    A full CCIX spec is supposedly due before year-end, at which point I’d expect ARM to be a lot more specific about what they are doing in the CMN-600 IP. It’s interesting that only AMD, Huawei, and Qualcomm are onboard with CCIX so far, leaving one to wonder what the other ARM-based server types like Broadcom and Cavium are up to as far as cache coherency. As for other NoC vendors, Arteris has its proprietary NCore, and NetSpeed Systems has alluded to CCIX on slides but nothing official yet.

    For more on what little ARM did say about the CMN-600, the full press release:

    ARM System IP boosts SoC performance from edge to cloud

    As with any pre-approved specification, what constitutes CCIX support right now may be subject to change down the road. Given the intensity at which open servers and other applications are being explored, betting on an open specification for connecting FPGAs coherently to SoCs is smart money.


    Meet the POWER9 Chip Family

    Meet the POWER9 Chip Family
    by Alan Radding on 09-30-2016 at 12:00 pm

    When you looked at a chip in the past you primarily were concerned with two things: the speed of the chip, usually expressed in GHz, and how much power it consumed. Today the IBM engineers preparing the newest POWER chip, the 14nm POWER9, are tweaking the chips for the different workloads it might run, such as cognitive or cloud, and different deployment options, such as scale-up or scale-out, and a host of other attributes. EE Times described it in late Augustfrom the Hot Chips conference where it was publicly unveiled.

    IBM POWER9 chip

    IBM describes it as a chip family but maybe it’s best described as the product of an entire chip community, the Open POWER Foundation. Innovations include CAPI 2.0, New CAPI, Nvidia’s NVLink 2.0, PCle Gen4, and more. It spans a range of acceleration options from HSDC clusters to extreme virtualization capabilities for the cloud. POWER9 is not just about high speed transaction processing; IBM wants the chip to interpret and reason, ingest and analyze.


    POWER has gone far beyond the POWER chips that enabled Watson to (barely) beat the human Jeopardy champions. Going forward, IBM is counting on POWER9 and Watson to excel at cognitive computing, a combination of high speed analytics and self-learning. POWER9 systems should not only be lightning fast but get smarter with each new transaction.


    For z System shops, POWER9 offers a glimpse into the design thinking IBM might follow with the next mainframe, probably the z14 that will need comparable performance and flexibility. IBM already has set up the Open Mainframe Project, which hasn’t delivered much yet but is still young. It took the Open POWER group a couple of years to deliver meaningful innovations. Stay tuned.


    The POWER9 chip is incredibly dense (below). You can deploy it as either a scale-up or scale-out architecture. You have a choice of two-socket servers with 8 DDR4 ports and another for multiple chips per server with buffered DIMMs.

    IBM POWER9 silicon layout


    IBM describes the POWER9 as a premier acceleration platform. That means it offers extreme processor/accelerator bandwidth and reduced latency; coherent memory and virtual addressing capability for all accelerators; and robust accelerated compute options through the OpenPOWER community.


    It includes State-of-the-Art I/O and Acceleration Attachment Signaling:

    • PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth
    • 25G Link x 48 lanes – 300 GB/s duplex bandwidth

    And robust accelerated compute options based on open standards, including:

    • On-Chip Acceleration—Gzip x1, 842 Compression x2, AES/SHA x2
    • CAPI 2.0—4x bandwidth of POWER8 using PCIe Gen 4
    • NVLink 2.0—next generation of GPU/CPU bandwidth and integration using 25G Link
    • New CAPI—high bandwidth, low latency and open interface using 25G Link

    In scale-out mode it employs direct attached memory through 8 direct DDR4 ports, which deliver:

    • Up to 120 GB/s of sustained bandwidth
    • Low latency access
    • Commodity packaging form factor
    • Adaptive 64B / 128B reads

    In scale-up mode it uses buffered memory through 8 buffered channels to provide:

    • Up to 230GB/s of sustained bandwidth
    • Extreme capacity – up to 8TB / socket
    • Superior RAS with chip kill and lane sparing
    • Compatible with POWER8 system memory
    • Agnostic interface for alternate memory innovations

    POWER9 was publicly introduced at the Hot Chips conference last spring. Commentators writing in EE Times noted that POWER9 could become a break out chip, seeding new OEM and accelerator partners and rejuvenating IBM’s efforts against Intel in high-end servers. To achieve that kind of performance IBM deploys large chunks of memory—including a 120 Mbyte embedded DRAM in shared L3 cache while riding a 7 Tbit/second on-chip fabric. POWER9 should deliver as much as 2x the performance of the Power8 or more when the new chip arrives next year, according to Brian Thompto, a lead architect for the chip, in published reports.


    As noted above, IBM will release four versions of POWER9. Two will use eight threads per core and 12 cores per chip geared for IBM’s Power virtualization environment; two will use four threads per core and 24 cores/chip targeting Linux. Both will come in two versions — one for two-socket servers with 8 DDR4 ports and another for multiple chips per server with buffered DIMMs.


    The diversity of choices, according to Hot Chips observers, could help attract OEMs. IBM has been trying to encourage others to build POWER systems through its OpenPOWER group that now sports more than 200 members. So far, it’s gaining most interest from China where one partner plans to make its own POWER chips. The use of standard DDR4 DIMMs on some parts will lower barriers for OEMs by enabling commodity packaging and lower costs.


    DancingDinosaur is Alan Radding, a veteran information technology analyst and writer. Please follow DancingDinosaur on Twitter, @mainframeblog. See more of his IT writing at technologywriter.com and here.


    Low power physical design in the age of FinFETs

    Low power physical design in the age of FinFETs
    by Beth Martin on 09-30-2016 at 7:00 am

    Low power is now a goal for most digital circuit designs. This is to reduce costs for packaging, cooling, and electricity; to increase battery life; and to improve performance without overheating. I talked to the experts on physical design for ultra-low power at Mentor Graphics recently about the challenges to P&R tools and the techniques used during design, particularly to control dynamic power consumption in FinFETs. Here’s what David Chinnery, R&D lead for low power, and Product Marketing Architect, Arvind Narayanan had to say.

    Low-power design starts at the architectural level and continues through implementation. The challenge in implementation is to create, optimize, and verify the physical layout so that it meets the power budget along with traditional timing, performance, signal integrity (SI), and area goals. Physical design tools must find the best trade-offs when implementing a variety of low-power techniques. With the multiple layers of complexity in advanced technology node designs, power management requires a larger bag of tricks than ever.

    Chinnery pointed out that low power as a design goal is nothing new. The basic low-power techniques, such as clock gating and use of multiple voltage thresholds (multi-Vt), are well-established and supported by existing tools. What’s new is the added complexity of meeting that goal at advanced process nodes and the increasing size and complexity of designs. There are a huge number of corner, mode, and power state scenarios that have conflicting requirements for power, timing, SI, manufacturability, and area. Additionally, with slower migration to new process technology nodes, there has been much greater emphasis on achieving the best possible design results using EDA tools with state-of-the-art low power techniques.

    Narayanan added that FinFETs use significantly less total power, but the dynamic power component is much higher compared to the leakage power. The FinFET’s 3D gate around the transistor drain-source channel greatly reduces leakage due to better on/off control of the electric field in the channel. However, the 3D FinFET gate has higher capacitance compared to the MOSFET planar gate structure. Thus dynamic power needs to be considered during optimization and throughout the physical design flow.

    So what can be done during physical design to control power?

    Go in-depth with low power physical design solutions. Download the Mentor whitepaper, Low-Power Physical Design with the Mentor Place and Route System.

    First, the tools must have the capacity to handle very large designs, 100+ million instances, with reasonable runtimes. Second, the tools must support the full bag of low power tricks. By “tricks” they mean techniques to reduce power like using multi-Vt, gate sizing, clock gating, multi-corner/multi-mode (MCMM) power optimization, pin swapping, register clumping, remapping, and power-density driven placement. But, Chinnery noted that designers also need full support for UPF 3.0 directives, multi-voltage flows, support for dynamic voltage and frequency scaling (DVFS) to handle varying supply voltages and clock frequencies, and the capability to handle special cells such as level shifters, isolation cells, and multi-threshold CMOS (MTCMOS) power gates. Deployed correctly, these advanced techniques and some additional secret-sauce optimization tricks result in power analysis reports that make everyone’s day a bit happier.

    Chinnery and Narayanan agreed that one of the key strengths of Mentor’s physical design tools is the native MCMM architecture that lets designers analyze and optimize the design for all corner/mode/power state scenarios concurrently. Whether the design uses advanced process nodes and FinFETs or not, the MCMM capability is vital.

    Also important for any low-power design, whether on a legacy node or the latest processes, is clock tree synthesis. Getting the best clock tree, said Narayanan, also depends on the ability to synthesize the clocks for multiple corners and modes concurrently in the presence of design and manufacturing variability, and in multi-voltage flows. He pointed to the importance of features like:

    • Composing single-bit registers to multi-bit registers to reduce their load on the clock distribution network, and decomposing from multi-bit registers where necessary to meet data-path timing constraints
    • Lowering leaf wire capacitance by register clumping and clock gate cloning/de-cloning
    • Reducing functional skew and clock skew across multiple corners with MCMM CTS
    • Useful skew to improve data-path timing and reduce power consumption
    • Improving clock gating coverage with additional fine-grained gating
    • Minimizing clock net switching power with smart clock gate placement to reduce wire length for high activity clock nets


    Using low-power CTS with MCMM optimization significantly reduces the number of buffers, skew, total negative slack (TNS) and worst negative slack (WNS), in addition to reducing dynamic power and area. The table shows some real customer data comparing a single-corner CTS implementation with a 9-corner CTS implementation for a single mode in a 9-corner design.


    Another key part of the physical design flow is routing. For low-power designs, the router should follow the UPF power intent. This includes maintaining a single port of entry for boundary nets and respecting voltage island boundaries. The router handles secondary power connections for retention flip-flops, level shifters, and always-on buffers. Mentor’s router gets constant updates on MCMM timing and wire resistance and capacitance (RC), which it uses to find the optimal solution to meet power, timing, SI, manufacturability, and area constraints. It is also DFM-aware, so it accounts for the manufacturing issues that affect power (especially leakage power), such as variations in on-chip temperature and thickness.

    For many IC designs, low-power is as important as timing. While you get the largest impact on power at the architectural level, Chinnery mentioned that a focus on low power throughout place and route can achieve a further 20% to 40% power reduction on some designs. Additionally, the place and route flow should remove the unpredictability from the physical implementation process that can result in late-stage surprises in power consumption. A blown power budget can affect the cost, performance, and time-to-market of low-power ICs.

    For all the details on low-power physical implementation, download the Mentor whitepaper, Low-Power Physical Design with the Mentor Place and Route System.


    Cadence DSPs float for efficiency in complex apps

    Cadence DSPs float for efficiency in complex apps
    by Don Dingee on 09-29-2016 at 4:00 pm

    Floating-point computation has been a staple of mainframe, minicomputer, supercomputer, workstation, and PC platforms for decades. Almost all modern microprocessor IP supports the IEEE 754 floating-point standard. Embedded design, for reasons of power and area and thereby cost, often eschews floating-point hardware Continue reading “Cadence DSPs float for efficiency in complex apps”


    16nm HBM Implementation Presentation Highlights CoWoS During TSMC’s OIP

    16nm HBM Implementation Presentation Highlights CoWoS During TSMC’s OIP
    by Tom Simon on 09-29-2016 at 12:00 pm

    Once a year, during the TSMC’s Open Innovation Platform (OIP) Forum you can expect to see cutting edge technical achievements by TSMC and their partners. This year was no exception, with Open-Silicon presenting its accomplishments in implementing an HBM reference design in 16nm. It’s well understood that HBM offers huge benefits in terms of bandwidth and lower power consumption over alternatives such as DDR. With the advent of the JEDEC HBM Gen-2 specification, both density and data rates have gone up significantly. In the 2, 4 or 8 stack configuration HBM Gen-2 supports up to the 8 Gb per stack. In addition, data rates are going up to 1.6 Gb per second or even up to 2 Gb per second per pin.

    According to open silicon 16 nm FinFET is the key to unlocking the full benefits of HBM. 16 nm FinFET processes can potentially reduce power by 50% and boost performance by the same amount relative to 28 nm. However, to implement these HBM designs a complete ecosystem is required which includes the die, interposer, assembly and packaging. Open-Silicon paired SK-Hynix’s HBM die stack with a TSMC 16nm/2.5D/CoWoSTM ASIC implementation. CoWoSTM is TSMC’s 2.5D interposer technology. In fact, TSMC has been making a big deal out of all of its advanced packaging options.

    TSMC has been innovating their packaging options and is seeing the results in their business. It’s widely understood that TSMC scored a design win with the Apple A10 that is used in the iPhone 7. So clearly packaging technology is becoming a significant differentiator for foundries. We can still expect to see much more creative and diversified offerings in the already exploding packaging market.

    But now back to Open-Silicon and their HBM implementation at 16nm. HBM is a good choice for products where there is threefold pressure on form factor, bandwidth and power. These applications include a data centers, networking, radar, virtual reality, gaming and cloud computing. In the target design Open-Silicon was able to replace 24 DDR3 1600 (x16) with 1 HBM stack, the power consumption went from 1.0 mW per gigabit to 0.33 mW. At the same time the data rate climbed from 4 GB/s up to 256 GB/s.

    According to Open-Silicon’s Bupesh Dasila, Engineering Manager for Silicon Engineering, some of the major challenges for implementing a 2.5 D SIP using HBM are: having a scalable PHY architecture, designing the 2.5D interposer, managing the custom die-2-die IO’s, and testing the completed system. There were 1840 routes on the interposer that were up to 5 mm in length connecting the HBM to the SOC. To effectively shield the signal lines from cross talk, ground wires of 0.5um were placed 2.1um to the side of each signal wire. This left 2.1 um for each signal line. The signal wires were 0.85 um thick.

    After his presentation Bupesh told me that they did extensive modeling to verify that electrical characteristics of the signals on the Interposer. Below is an example of some of their Interposer SPICE simulations. In addition to the PHY design, Bupesh and his team designed the IO for the 16nm die that communicated to the HBM memory module.

    Open-Silicon’s roadmap includes heading to 7nm with this approach, but they also are going to be validating the Gen2 HBM on a 28nm design as well. The results of the 16nm chip were impressive, with data rates of 2 Gb/s per pin using their custom IO’s and PHY. They were diligent in adding testability features as well. They added probe pads and included loopback to help if needed with issue isolation among system components.

    Open-Silicon emphasized that they are ready to deliver solutions offering the potentially game changing benefits of HBM today. Admittedly this is new technology that requires more up-front costs, nevertheless, the area savings at volume are significant. Also the cooling and power improvements will change the equation for cost of ownership of the finished products they are used in. More information on Open-Silicon’s HBM expertise is available on their website, here.


    Power Exploration at RTL Design with Mentor PowerPro

    Power Exploration at RTL Design with Mentor PowerPro
    by Bernard Murphy on 09-29-2016 at 7:00 am

    There was a comment recently that design for low power is not an event, it’s a process; that comment is absolutely correct. Power is affected by everything in the electronic ecosystem, from application software all the way down to layout and process choices. Yet power as a metric is much more challenging to model and control than metrics like timing and area since it depends on factors across that range, particularly activity, which in turn is heavily dependent on use-cases.

    Still, a practical design methodology can’t iterate over so wide a span, so each stage aims to optimize – using realistic use-case data – within what can be reasonably controlled (or at least determined as constraints to feed forward) at that stage. One important observation helps: it has been amply demonstrated that power-optimizations made at higher levels in the system have a bigger impact than those made at lower levels. So when the architecture has been fixed, you are going to get the biggest bang for your buck through RTL optimizations. Which is not to say that you won’t polish all the way down to layout if you’re going after picowatt savings, but the Pareto principle suggests you should put most of your effort into the RTL. Unless of course you are unable to change the RTL, which can happen if you want to avoid re-qual.

    What are the options at RTL? Architecture and target process are already fixed. Your choices are to reduce leakage in areas that are not very performance sensitive by controlling the Vt mix, to power-down islands of logic during periods those functions are not needed, to reduce redundant/useless activity by gating clocks and related signals, and to scale down V[SUP]2[/SUP]f power (again in islands) where feasible through dynamic voltage and frequency scaling (DVFS). Depending on where you are starting, this bag of tricks together can give you 30% or more reduction in power, or as little as a handful of percent reduction if you’ve already significantly optimized the design. Energy (integrated power over time, which is important for battery life) is mostly controlled by how long you can keep most of the logic in a low/zero power state and how much power is consumed in turning it back on.

    Of course, doing all of this stuff comes with costs. Even clock gating adds at least one cycle of latency in turn-on time. Power-domains can be significantly slower to reactivate because they have complex power-up sequences (turn power on, reset/start clocks, restore state from retention registers). And then there’s the issue of what happens when something is switching on or off and something else wants to talk to it. This requires work to prove either that such a problem can never happen or that there is handshake logic in place to ensure that these cases will be handled gracefully. And all of this added circuitry consumes space, may create new timing problems and adds more complexity to verification. All of which means that while you may find lots of ways you could reduce power, they’re not all going to be equally desirable when balanced against other consequences of making those changes. PowerPro’s state of the art solution provides a way to start this analysis by considering all options, automated and guided for power-saving and interactive exploration of these options with feedback on power reduction and cost metrics.


    Mentor makes the point that all of this optimization could be handled more effectively if regular RTL designers were to get more involved in optimization for low power. Today this objective is generally handed off to power experts who, while skilled in that domain, necessarily have limited understanding of total design objectives, leaving you wondering what gets left on the table. However in high-pressure design schedules it’s sometimes difficult to see how design teams can significantly rework assignments. Perhaps instead PowerPro can enable a more comprehensive discussion between block, subsystem and top-level designers, the power experts and the verification engineers in debating which power-saving options are most worthy of consideration in the design. Doing this can start with the power expert filtering through the a range of possible directions to boil down to a limited set of most promising scenarios.

    At that point being able to interactively flip through scenarios (enabled by nearly real-time performance in PowerPro option what-ifs) would enable optimal choices made by the collective product team, each bringing their own area of expertise to consider a scenario from bandwidth, latency, area, power, performance/criticality and verification complexity perspectives.

    You can read more detail on PowerPro in the link at the end of this blog. A couple of interesting questions came up after the Webinar. One touched on how accurate dynamic power estimation is without a SPEF for the design, the other concerned vectorless estimation. Mentor answered both questions well in my view. First, RTL power estimation is good for relative comparisons, which is exactly what you need it for (is this option better than that). Absolute correlation with silicon is not the goal, nor is it likely possible before the design is fully implemented. Second, RTL block designers usually want to know about vectorless estimation because they don’t have much in the way of vectors. Vectorless can give you ballpark estimates but I wouldn’t invest a lot of time in power-saving tweaks based on this analysis – the error-bars on this kind of analysis can easily swamp potential power-savings.

    The Mentor Webinar can be found HERE.

    More articles by Bernard…


    It’s a heterogeneous world and cache rules it now

    It’s a heterogeneous world and cache rules it now
    by Don Dingee on 09-28-2016 at 4:00 pm

    Cache evolved when the world was all about homogeneous processing and slow and expensive shared memory. Now, compute is just part of the problem – devices need to handle display, connectivity, storage, and other tasks, all at the same time. Different, heterogeneous cores handle different workflows in the modern SoC, and the burden of cache coherency is shifting. Continue reading “It’s a heterogeneous world and cache rules it now”