Bronco Webinar 800x100 1

Microsoft, FPGAs and the Evolution of the Datacenter

Microsoft, FPGAs and the Evolution of the Datacenter
by Bernard Murphy on 10-03-2016 at 12:00 pm

When we think of datacenters, we think of serried ranks of high-performance servers. Recent announcements from Google (on the Tensor Processing Unit), Facebook and others have opened our eyes to the role that specialized hardware and/or GPUs can play in support of deep/machine learning and big data analytics. But most of us would probably still consider those applications, while important, somewhat niche in their role in the datacenter.

Several years ago, motivated by what they knew was already happening at Google and Amazon, Microsoft started to build their own machine learning system to enhance the capabilities of Bing. But rather than develop a custom device, or build on a GPU platform, they decided to build on FPGAs. As we know, FPGA-based solutions can be significantly cheaper to build and deploy when you know you are going to be the sole customer. And of course FPGAs have the advantage of re-programmability. The Microsoft team built an FPGA-based platform they called Catapult and demonstrated this would significantly accelerate machine-learning algorithms in Bing (over previous software-only approaches, I assume).

Fast forward to 2015. Even the most starry-eyed Microsoft supporter would admit that Bing has a long way to go to catch up with the leader in search and is unlikely to drive significant revenue for Microsoft in the near future. What the company really wants are more ways to propel their major online services – Azure (the MS Cloud) and Office 365. Catapult was appealing to both of these applications, but not necessarily for machine-learning.

A major problem for Azure’s has been managing the high volume of PCIe network traffic to and from virtual machines through virtual network (VN) adapters. When this gets up to GB/sec for a VM, the the VN management load on the CPU becomes substantial. Obviously off-loading this to a system to support physical traffic and handle network virtualization can significantly improve throughput. Network cards would be one solution but the Azure team didn’t find this approach adaptable enough in supporting what they needed in a flexible VN fabric on the server side. After all, if you want maximum flexibility in VM management in the cloud, you need corresponding flexibility in VN management. The Azure team felt this could best be handled through FPGAs, particularly in support for programmability for load balancing and other rules.

All of this required a major rework of Catapult, but now the hardware is done and is being rolled out. And this is no longer a few specialized boxes to serve specialized needs. Azure needs a Catapult system per server (exact details are difficult to find – looks like one per server). And you can add to that the deep/machine learning requirements to support Bing and later encryption/compression and machine learning requirements to support Office 365.

This is a whole new ball-game for FPGA deployment. Since a large datacenter contains many hundreds of thousands of servers, Microsoft’s demand alone has apparently shifted FPGA worldwide volumes significantly. You should know by the way that Catapult is based on Altera FPGAs. Intel EVP Diane Bryant is on record as saying this is why Intel bought Altera last year. She also anticipates that for similar reasons, one third of all servers in datacenters will contain FPGAs (presumably optical connectivity sets the limit on volume, where FPGAs maybe can’t help – for now, but stay tuned since Intel was talking about both FPGAs and photonics at the OCP summit this year).

Of course you could argue that Microsoft and Intel have misread the market and the virtual networking functionality will be replaced by ASIC hardware solutions (especially optical). I’m not so sure, at least for the next few years. This is an area of critical differentiation for cloud services providers, so they’ll each want their own solutions. Of course the economics of ASIC may not be a big factor in those budgets, but adaptability could be a very big factor, especially as capabilities in cloud services are evolving quickly. Eventually differentiation always moves on to other factors, but it’s not clear that is going to happen here anytime soon.

You can read the Wired article on Catapult HERE and a slightly more detailed article on the Azure need for networking flexibility HERE.

More articles by Bernard…


Could Machine Learning be Available for Mass Market?

Could Machine Learning be Available for Mass Market?
by Eric Esteve on 10-03-2016 at 7:00 am

Machine Learning is at the hype peak, according with Gartner’s August 2016 Hype Cycle for Emerging Technologies. The demand for vision processor IP is strong in smartphone, automotive and consumer electronics segments. ASSP based solutions can make the job, but how can OEM create differentiation, control their destiny and pricing if they select an ASSP? In mobile segment, integrating Mediatek or Qualcomm SoC supporting camera/vision will lead OEM to build a ‘me too’ smartphone. OEM developing ADAS or Autonomous for automotive are facing similar problem when integrating MobilEye or NVIDIA ASSP as they can’t add their own algorithms and differentiate.

It’s the right time to integrate DSP based vision processor IP complete solution, like new CEVA-XM6 DSP Core, Hardware Accelerators, Neural Network Software Framework, Software Libraries and Algorithms. The right time because the performance of deep learning technology, measured by the error rate on image recognition is, this year and for the first time, better than human performance! It’s also the right time because adopting CEVA Convolutional Deep Neural Network (CDNN) solution implemented on XM6 DSP core will enable embedded neural networks for mass market (low cost) vision application and allows delivery of deep learning solutions on (low power) embedded devices. This low cost, low power solution is not emerging by chance. The CEVA-XM6 based vision platform has been built on the strong foundation of XM4 counting 25 design-wins and the vast experience accumulated across multiple end markets and applications where neural network are being deployed.

We have explained the Convolutional Deep Neural Network (CDNN) theory and given some examples of proprietary networks in a previous blog in Semiwiki, we will focus today on the way to generate CDNN, thanks to S/W development tools and CEVA network generator, and describe the H/W implementation.

Before to run imaging and vision algorithms on CEVA-XM6 DSP, you can create your own CDNN, using neural network software framework, made of real-time libraries, Computer Vision libraries (CEVA-CV based on Open-CV), Vision Processing API (OpenVX, royalty free open standard API from Khronos, integrated into CEVA-VX) and 3[SUP]rd[/SUP] party S/W. At this point, any customer can create differentiation by inserting proprietary algorithms. Instead of using one CDNN fitting all application, CEVA Network Generator allows creating a unique CDNN, customer or application specific.

CEVA-XM6 is the 5[SUP]th[/SUP] generation Imaging & Vision Technology from CEVA and the IP vendor is bringing major improvements compared with the previous generation, CEVA-XM4. If you look at the right part of the Hardware box, you identify hardware accelerators (HWA), namely CDNN, De-Warp and 3[SUP]rd[/SUP] party HWA. Implementing in frozen hardware the well-known and repetitive tasks is a very good way to optimize performance, freeing the DSP which can be used to run other tasks, and reduce the power consumption as dedicated H/W will always be more power efficient that any processor to run the same task. For those who remember the digital signal processing implemented to run the wireless phone base-band, if the Viterbi decoding algorithm was initially running on (TI) DSP, the task has very quickly moved to an HWA. This is the same principle, applied to imaging and vision technology.

Scatter-gather capability: CEVA-XM6 enables load/store vector elements from/into multiple memory location in a single cycle. CEVA-XM6 is able to load values from 32 addresses per cycle. Scatter-gather not only boosts performance, but also allows minimizing access to/from memory, known to severely impact the power consumption.

If scatter-gather is a performance booster whatever is the application, Sliding-Window data processing mechanism is completely dedicated to imaging. The principle is to take advantage of pixel overlap in image processing by reusing same data to produce multiple outputs. If implementing Sliding-Window mechanism lead to significantly increase the DSP core processing capability, it also reduces power consumption and save external memory bandwidth. One of the challenges linked with machine learning on neural network is to reduce the amount of bandwidth consuming and computing bottleneck. That’s why implementing techniques like scatter-gather or sliding-window is crucial for bringing machine learning to mass market applications, as these require using low cost, low power solutions.

As of today, CEVA has implemented 512 MACs (16×16) as hardware accelerators, as well as many of the convolutional layers (Normalization, Pooling, etc.) required by the CDNN and plan to implement even more layers in the future. How efficient is CEVA-XM6 architecture? Just consider that the MAC utilization is greater than 95%, and you realize that CEVA-XM6 has been optimized to the maximum.

To answer the initial question, we can say yes, the machine learning technology has been made available to the mass market, targeting Autonomous Driving, Sense and Avoid Drones, Virtual and Augmented Reality, Smart Surveillance, Smartphones, Robotics and More.

By Eric Esteve from IPNEST


Intel Altera FPGA at the heart of an autonomous Audi A8

Intel Altera FPGA at the heart of an autonomous Audi A8
by Claudio Avi Chami on 10-02-2016 at 4:00 pm

Audi announced its piloted driving technology at CES 2015. The Audi Prologue includes the Advanced Driver Assistance System Platform (zFAS), co-developed with TTTech. The zFAS board is based on four devices: an Nvidia k1 processor and Infineon Aurix processor, Mobileye’s EyeQ3 for vision processing, and an Altera Cyclone V FPGA which provides sensor fusion, combining data from multiple sensors in the vehicle for highly reliable object detection and Deterministic Ethernet communications used to transport high bandwidth data within the vehicle.

The zFAS board receives and processes data from:

  • Ultrasonic sensors around the car
  • Front and rear radars
  • Top view camera
  • Front laser scanner
  • Wide angle front camera

The board has also actuators to control:

  • Steering wheel
  • Gear
  • Accelerator
  • Etc.

On ASDF-2015 (Altera SoC Developer Forum, now renamed ISDF since Altera was acquired by Intel), TTTech commented on various aspects of the Piloted Driving technology that would be available to the public during 2017.

Importance of Autonomous Driving

Autonomous driving will:

  • Improve safety – sensor information and processing could avoid up to 40% of today’s accidents
  • Liberate the driver from monotonous driving, i.e. commuting during rush hours
  • Provide new ways of mobility for people that cannot drive, deliveries, car pools, etc.

Key Technical aspects

The Piloted Driving must provide:

  • Fail-safe operation – Even if the system is an assistant, it cannot be guaranteed that the human driver can take control immediately. Upon a single component failure, the system must still be able to take the car to a secure stop.
  • Integration of safety devices with high performance devices. Processing devices share a Gigabit Ethernet backbone with real time messaging capabilities (TSN – Time Sensitive Network).
  • SW integration of applications and the operating systems running at the diverse platforms.
  • Fast deployment by usage of readily available Altera IPs and custom IPs.

Additional information:
TTTech announcement
Altera press release
Audi details piloted driving technology

My blog: FPGA Site

Other entries from me:
Soc FPGA for IoT Edge Computing
FPGAs at Deep Machine Learning


100 Million Miles Per Hour!

100 Million Miles Per Hour!
by Kevin Kostiner on 10-02-2016 at 12:00 pm

Back When We Loved Discovery
As anyone who reads and follows my blog posts will know, I’m a believer in innovation. It’s what drives my passion for the Internet of Things. That interest started when I was an “Apollo” kid during the 1960’s and 1970’s. Those decades offered a very different landscape for creativity, exploration and discovery!

There was a time back in the 1960’s and 70’s when the United States and much of the world had a passion for discovery. President Kennedy’s famous challenge to put a man (person today) on the moon and get him back safely by the decades end (1960’s) ignited the fires for innovation and discovery like never before! From our rudimentary knowledge about space that existed on the day Kennedy uttered those words to taking that “one small step for man, one giant leap for mankind” on July 9, 1969, scientists, engineers, mathematicians, visionaries, artists, physicists, hardware engineers, construction teams, welders, architects and so many more people came together to…..create!!

But We Lost Our Way
Unfortunately, after the Apollo program was scrapped, we lost a lot of that passion for challenging ourselves and seeking out the new and unknown. We also lost a uniting purpose that brought people together from around the world.

When Neil Armstrong made that first footprint in the lunar dust, it did not matter who you were, what you worshipped, how your hair looked or the color of your skin. Everyone, everywhere was as one praying, cheering and nail biting every second as Armstrong’s foot inched towards lunar touchdown! The greatest surge of human electricity in our history occurred on that day. We have not experienced such a moment since!

They Had Nothing Better To Do?
In 2006, the actions of the International Astronomical Union (IAU) downgrading Pluto from “planet” to “dwarf planet” exemplified the complete lack of passion for discovery that now existed. Made up of “a collection of professional astronomers, at the PhD level and beyond” this odd group decided that the planet we all grew up with as the lonely monitor of our outer solar system had failed to clear it’s neighborhood of other objects so could no longer be called a planet.

So this group of egotistical “PhD’s and beyond” decided to inform us all that our childhood’s were based on a lie! Somehow a timeout was not enough punishment for Pluto’s failure to clean up its neighborhood. Imagine what that group would have done to you if they had seen your bedroom!!!

For me, I never accepted these Buzz Lightyear wannabes pronouncement. Hence I proudly refer to Pluto as a planet. Hopefully all of you will agree with me. If not, feel free to “un-friend” me on LinkedIn.

Science Is Amazing!

Depending upon the Earth’s position, Pluto is an astounding 2.66 – 4.67 billion miles away from us. On average, Pluto is 3.67 billion miles from the sun. Considering a flight from New York to LA is about 3000 miles, Pluto’s really, really, really far away. Consider this another way: Light travels at 186,000 miles per second. So the limited amount of sunlight that bathes the distant planet is over 5 hours old when it arrives. In other words, it takes sunlight over 5 hours to travel those 3.67 billion miles!

July of 2015 (46 years since Armstrong’s famous step!) was a truly amazing month. The New Horizons spacecraft, after speeding through our solar system for 9 ½ years at the astounding speed of 36,373 mph finally pulled up alongside our distant cousin Pluto. Getting as close as 7500 miles (less than a speck of dust in lunar distance), New Horizons began taking what would become the most amazing photographs of any planet taken since those first photos of the Apollo 11 Astronauts on the moon in 1969.

Incredible pictures started appearing at NASA and shared around the world. The buzz and analysis that followed threw into question many of the assumptions that had been made about Pluto and the very origin of our solar system. That’s called discovery and it was based on innovation! And that innovation began with Kennedy’s promise, setting us on a path that put the New Horizon’s craft at Pluto 46 years after lunar touchdown. After the New Horizons spacecraft beamed back those amazing close up photos of the blue planet the “esteemed” group at the IAU should have said “sorry” we were wrong…Pluto really is a planet. But they did not!

Rediscovering “Discovery”
Although we became a bit sidetracked over the past 40 years, and groups like the IAU wasted their time on opinions rather than science, the lure of discovery is a powerful thing. The recent announcement by Yuri Milner is one great example that discovery is still alive and well…just a bit hidden at times.

36,373 miles per hour and 9 ½ years! How about leveraging some new innovative thinking and changing those two numbers to 100 million miles per hours and 2 days! That’s Yuri Milner’s plan and he’s backing it up with an initial investment of $ 100 million to help send miniature probes into deep space to help bring back to all of us that world bonding feeling of discovery and exploration!

The Future Looks Bright

Back on that day in July 1969 when Neil Armstrong made the giant step, people literally took Polaroid photographs of their TV screens to capture the moment. Those blurry small instant black and white photo’s (An original one is this blog post’s photo) became the symbol of how gripped the nation; the world was to the sense of discovery represented by that great achievement.

Now we have Yuri Milner putting forth a new commandment that we will find a way to develop and send probes into deep space at incredible speeds over the next few decades to discover what lies “beyond”. And he is doing it at the very time that our entire technology world is going through a massive evolution with the convergence of technologies into the Internet of Things. He could not have chosen a more perfect time!

So what will happen 20 or 30 years from now when the first photographs of a sun or planet from a star system outside of our own is received for the first time? Will we all be gathered around an OLED holographic TV or wrist projected iPAD in a world unifying moment of discovery and excitement as occurred with Neil Armstrong’s famous step?

There were far fewer distractions in 1969 and even fewer ways to communicate. By 2040 or 2050 the proliferation of technologies will lead to an exponentially greater number of things to distract us than we even have today. So it won’t be easy for everyone to be pulled together by one singular event. BUT while distractions will be a challenge, the human drive to connect and be apart of something larger will always be there. Nothing brings us together more than human exploration and discovery.

So 20 or 30 years from now we will be glued to our devices. Nail biting, cheering, crying and ready to snap those “Polaroids” on that day when Milner’s bots “phone home” for the first time. Another giant step indeed!!


Scalable Infrastructure for Digital Businesses

Scalable Infrastructure for Digital Businesses
by Sudeep Kanjilal on 10-02-2016 at 7:00 am

Building Digital businesses is tough. The run-time changes rapidly (browser – apps – bots), and standards for the digital architecture/stack gets refined constantly. Pace of innovation is accelerating due to massive war-chest of the top digital players like Google, Facebook, Apple and Amazon. For the Fortune 500 consumer-facing giants to catch-up, they have to upgrade their technology stack and incorporate new technologies, while living with their existing ‘legacy’ technologies. This is a tricky balancing act, as it is not just mashing up seemingly incompatible technologies – but really combining different set of capabilities that these technologies enable.

For example, how to you define your ‘security architecture’ for your PII (regulated) data, when you have multiple data warehouses, multiple data centers – all running on different generation of technologies. A typical Fortune 500 firm will have at least 4 version of infrastructure stack, with the oldest one typically running a technology stack no longer supported. Start-ups, of course, do not face this issue – they get to start with a clean slate. However, karma eventually catches up – once start-ups become successful and big, and start acquiring other companies, their technology stack eventually resembles a kitchen sink!

Strengthen the foundations
To address the inevitable technology debt, which deepened over the past 8 years due to the Great Recession induced cost cutting, smart Fortune 500 firms are starting from the basic – strengthening the foundation of the overall technology stack. They are focusing on core infrastructure. A correct choice, as a house is only as strong as its foundation.

Digital Businesses are inherently susceptible to a fly-wheel effect, which leads to rapid growth running into millions of MAUs if done right. One can imagine the gusher of data that this model will generate, the rapid (real-time) analytics and decision-making needed to finely calibrate user interaction and overall user experience – without even bringing in higher-order capabilities like machine learning. And none of these will scale on a complex mix old infrastructure elements spanning 4 generations.

So, smart firms are putting in governance policies for infrastructure refresh that emphasizes any new infrastructure build or refresh will follow the 3 steps:

[LIST=1]

  • Define the overall architecture that is scalable
  • Define the infrastructure in software instead of specialized hardware/appliances. Think stack – IaaS, PaaS and SaaS
  • Go Hybrid – explicitly define infrastructure that will ALWAYS have multiple generations of technologies, as that will always be the reality.

    Hybrid or Public
    The last point is critical, but cannot be taken in isolation. Too often I have witnessed decisions being taken at these Fortune 500 firms purely in binary terms – cloud vs data centers, public cloud vs private cloud vs hybrid cloud, etc. Its critical to take a look at the infrastructure not just in terms of technology choices within each layer of the stack (Win vs Linux for servers, Oracle vs SQL doe dB servers, etc.) but think in terms of overall service that this stack is enabling, and then define the architecture requirements.

    Example: for a large-scale consumer application of a typical bank and/or an insurance firm, their Big Data would runs on bare meta (dedicated server), customer backend on public cloud, web scale front end (or perhaps reverser proxy server) on public cloud – and security in all stack layers. To the end customer, all this is transparent – it just works, and scale infinitely.

    The ultimate aim: Speed up innovation – idea to production in 30 mins, push code to production 100 times a week. Only then you get a digital business!


  • Will TSMC be alone at 10nm and 7nm?!?!?

    Will TSMC be alone at 10nm and 7nm?!?!?
    by Daniel Nenni on 10-01-2016 at 7:00 am

    Now that the dust has settled let’s talk about the recent TSMC OIP Ecosystem Forum. This was the 6[SUP]th[/SUP] annual OIP which hosts more than 1,000 attendees from TSMC’s top customers and partners. Presenting this year were TSMC VP and CTO Dr. Jack Sun, TSMC VP of R&D Dr. Cliff Hou, and ARM EVP of Incubation Businesses Dr. Dipesh Patel. First let’s talk about some of the key manufacturing milestones that were mentioned.

    Also read:Top 5 Highlights from the 2016 TSMC Open Innovation Platform Forum

    TSMC announced that the new 16FFC process is currently in high volume manufacturing (HVM). As we know from the recent iPhone7 teardown the A10 SoC inside is TSMC 16FFC. Additionally, the new TSMC InFOs packaging is in HVM which is also used by Apple for the A10. The iPhone 7 teardown also showed that the majority of the chips inside the iPhone 7 are manufactured by TSMC including the Intel modem (TSMC 28nm) and the QCOM modem (TSMC 20nm). This represents a significant upside for TSMC in Q3 and Q4 of this year so get ready for some very upbeat investor calls.

    TSMC announced that 10nm is ahead of schedule and will enter HVM in Q4 2016 versus Q1 2017. This supports my belief that the new Apple iPad A10x (to be announced next month) uses TSMC 10nm and will be the fastest SoC on the market, absolutely. I also believe that the next iPhone SoC (iPhone8) will use TSMC 10nm exclusively.

    The chatter in the conference hall from people who would know was that due to unexpected yield challenges involving other 10nm processes, TSMC may be running unopposed at 10nm for the next 3-4 quarters. If so, this is huge for TSMC and the TSMC Ecosystem!

    TSMC announced that 7nm is ahead of schedule and will start risk production in Q1 2017 meaning HVM will be Q4 2017 (just in time for the iPad A11x SoC). TSMC 7nm will use the same fabs as 10nm so the ramp will be predictably fast. This leaves TSMC again unopposed at 7nm for 1-2 years so congratulations to the hardworking people of TSMC.

    And congratulations to TSMC partners and customers who will now lead the industry in semiconductor process development and will deliver industry leading chips for the rest of this decade. Just to name a few: Apple, ARM, Broadcom, MediaTek, Nvidia, Xilinx, etc…

    Bottom line: TSMC has the strongest roadmap I have ever seen and will continue to dominate the foundry business for years to come (déjà vu 28nm).

    The other interesting thing to note is that the 30 technical OIP presentations made by partners and customers are now available via TSMC Online:

    Held Sept. 22th, 2016 at the Santa Clara Convention Center, the fifth TSMC’S Open Innovation Platform Ecosystem Forum was attended by more than 1,000 TSMC customers and the Open Innovation Platform design ecosystem partners from EDA, IP and Design Services. The Forum brought TSMC’s design ecosystem member companies together to share with our customers real-case solutions to customers’ design challenges and success stories of best practice in TSMC’s design ecosystem. In an adjacent Partner Pavilion, 50 design partner companies staffed booths, showcased their products and services, and took questions throughout the day from leading designers. 30 technical papers were presented during the forum, showing real solutions and how the complete OIP ecosystem achieves faster time to market.

    Live EDA Technical Presentations:

    Live IP Technical Presentations:

    Live Design Service Technical Presentations

    Print-Only Technical Presentations


    CCIX shows up in ARM CMN-600 interconnect

    CCIX shows up in ARM CMN-600 interconnect
    by Don Dingee on 09-30-2016 at 4:00 pm

    All the hubbub about FPGA-accelerated servers prompts a big question about cache coherency. Performance gains from external acceleration hardware can be wiped out if the system CPU cluster is frequently taking hits from cache misses after data is worked on by an accelerator.

    ARM’s latest third-generation CoreLink CMN-600 Coherent Mesh Network interconnect announcement this week had a bunch of goodness about higher performance ARMv8 support. The interconnect runs at up to 2.5 GHz and cuts latency in half, resulting in five times the throughput of the prior CoreLink generation. It supports from 1 to 128 Cortex-A cores, and uses AMBA 5 CHI interfacing.


    Also part of the latency/throughput equation is the new CoreLink DMC-620 Dynamic Memory Controller, with integrated TrustZone security and support for up to 8 channels of DDR4-3200, including 3D stacked DRAM. The combination definitely helps CPU-centric performance. However, from other news I’ve covered this week, I’ll reiterate: viewing cache from only a CPU-centric perspective is an outdated idea in a world of heterogeneous SoCs.

    ARM is backing a different approach to solve the system-centric coherency issue. A few months ago, seven companies – AMD, ARM, Huawei, IBM, Mellanox, Qualcomm, and Xilinx – announced a rather cryptic initiative. It’s called Cache Coherent Interconnect for Accelerators, or CCIX. Everything we’ve known about that initiative is on a single web page, until now.

    A big differentiator in Xilinx Zynq versus other attempts at FPGA SoCs so far is its cache coherency – not perfect, an early proprietary implementation, but still beats the heck out of having an FPGA sitting astride of a CPU on an external bus. (We can only assume Intel and Altera have learned the ‘Stellarton’ lesson, and we’ll see how in future products.)

    CCIX only says the solution is a “driver-less and interrupt-less” usage model, offering orders of magnitude improvement in application latency. Presumably, the improvement has to do with connecting coherent agents and sharing cache updates more efficiently. In the ARM CMN-600 scheme, that’s part of what the Agile System Cache is supposed to do. ARM is also putting a lot of energy into cache-coherent GPU IP, and has its NIC-450 IP to help I/O subsystems.

    But the sleeper in all this is CCIX support. What is likely developing here are three circles of influence: ARM and its ecosystem partners, IBM (see a new SemiWiki article on POWER9), and Xilinx in one camp; NVIDIA with NVLink also having a connection to IBM POWER8 and POWER9; and Intel and Altera in the other. CCIX is open, and presumably someone else could jump in on that side. Intel and Altera could proceed with a proprietary solution, or surprise me and a lot of other folks by joining in. There’s also one other processor architecture out there – RISC-V – that could weigh in on the CCIX side soon. (I’ll have some more thoughts on that in an upcoming piece on my conversation this week with SiFive.)

    A full CCIX spec is supposedly due before year-end, at which point I’d expect ARM to be a lot more specific about what they are doing in the CMN-600 IP. It’s interesting that only AMD, Huawei, and Qualcomm are onboard with CCIX so far, leaving one to wonder what the other ARM-based server types like Broadcom and Cavium are up to as far as cache coherency. As for other NoC vendors, Arteris has its proprietary NCore, and NetSpeed Systems has alluded to CCIX on slides but nothing official yet.

    For more on what little ARM did say about the CMN-600, the full press release:

    ARM System IP boosts SoC performance from edge to cloud

    As with any pre-approved specification, what constitutes CCIX support right now may be subject to change down the road. Given the intensity at which open servers and other applications are being explored, betting on an open specification for connecting FPGAs coherently to SoCs is smart money.


    Meet the POWER9 Chip Family

    Meet the POWER9 Chip Family
    by Alan Radding on 09-30-2016 at 12:00 pm

    When you looked at a chip in the past you primarily were concerned with two things: the speed of the chip, usually expressed in GHz, and how much power it consumed. Today the IBM engineers preparing the newest POWER chip, the 14nm POWER9, are tweaking the chips for the different workloads it might run, such as cognitive or cloud, and different deployment options, such as scale-up or scale-out, and a host of other attributes. EE Times described it in late Augustfrom the Hot Chips conference where it was publicly unveiled.

    IBM POWER9 chip

    IBM describes it as a chip family but maybe it’s best described as the product of an entire chip community, the Open POWER Foundation. Innovations include CAPI 2.0, New CAPI, Nvidia’s NVLink 2.0, PCle Gen4, and more. It spans a range of acceleration options from HSDC clusters to extreme virtualization capabilities for the cloud. POWER9 is not just about high speed transaction processing; IBM wants the chip to interpret and reason, ingest and analyze.


    POWER has gone far beyond the POWER chips that enabled Watson to (barely) beat the human Jeopardy champions. Going forward, IBM is counting on POWER9 and Watson to excel at cognitive computing, a combination of high speed analytics and self-learning. POWER9 systems should not only be lightning fast but get smarter with each new transaction.


    For z System shops, POWER9 offers a glimpse into the design thinking IBM might follow with the next mainframe, probably the z14 that will need comparable performance and flexibility. IBM already has set up the Open Mainframe Project, which hasn’t delivered much yet but is still young. It took the Open POWER group a couple of years to deliver meaningful innovations. Stay tuned.


    The POWER9 chip is incredibly dense (below). You can deploy it as either a scale-up or scale-out architecture. You have a choice of two-socket servers with 8 DDR4 ports and another for multiple chips per server with buffered DIMMs.

    IBM POWER9 silicon layout


    IBM describes the POWER9 as a premier acceleration platform. That means it offers extreme processor/accelerator bandwidth and reduced latency; coherent memory and virtual addressing capability for all accelerators; and robust accelerated compute options through the OpenPOWER community.


    It includes State-of-the-Art I/O and Acceleration Attachment Signaling:

    • PCIe Gen 4 x 48 lanes – 192 GB/s duplex bandwidth
    • 25G Link x 48 lanes – 300 GB/s duplex bandwidth

    And robust accelerated compute options based on open standards, including:

    • On-Chip Acceleration—Gzip x1, 842 Compression x2, AES/SHA x2
    • CAPI 2.0—4x bandwidth of POWER8 using PCIe Gen 4
    • NVLink 2.0—next generation of GPU/CPU bandwidth and integration using 25G Link
    • New CAPI—high bandwidth, low latency and open interface using 25G Link

    In scale-out mode it employs direct attached memory through 8 direct DDR4 ports, which deliver:

    • Up to 120 GB/s of sustained bandwidth
    • Low latency access
    • Commodity packaging form factor
    • Adaptive 64B / 128B reads

    In scale-up mode it uses buffered memory through 8 buffered channels to provide:

    • Up to 230GB/s of sustained bandwidth
    • Extreme capacity – up to 8TB / socket
    • Superior RAS with chip kill and lane sparing
    • Compatible with POWER8 system memory
    • Agnostic interface for alternate memory innovations

    POWER9 was publicly introduced at the Hot Chips conference last spring. Commentators writing in EE Times noted that POWER9 could become a break out chip, seeding new OEM and accelerator partners and rejuvenating IBM’s efforts against Intel in high-end servers. To achieve that kind of performance IBM deploys large chunks of memory—including a 120 Mbyte embedded DRAM in shared L3 cache while riding a 7 Tbit/second on-chip fabric. POWER9 should deliver as much as 2x the performance of the Power8 or more when the new chip arrives next year, according to Brian Thompto, a lead architect for the chip, in published reports.


    As noted above, IBM will release four versions of POWER9. Two will use eight threads per core and 12 cores per chip geared for IBM’s Power virtualization environment; two will use four threads per core and 24 cores/chip targeting Linux. Both will come in two versions — one for two-socket servers with 8 DDR4 ports and another for multiple chips per server with buffered DIMMs.


    The diversity of choices, according to Hot Chips observers, could help attract OEMs. IBM has been trying to encourage others to build POWER systems through its OpenPOWER group that now sports more than 200 members. So far, it’s gaining most interest from China where one partner plans to make its own POWER chips. The use of standard DDR4 DIMMs on some parts will lower barriers for OEMs by enabling commodity packaging and lower costs.


    DancingDinosaur is Alan Radding, a veteran information technology analyst and writer. Please follow DancingDinosaur on Twitter, @mainframeblog. See more of his IT writing at technologywriter.com and here.


    Low power physical design in the age of FinFETs

    Low power physical design in the age of FinFETs
    by Beth Martin on 09-30-2016 at 7:00 am

    Low power is now a goal for most digital circuit designs. This is to reduce costs for packaging, cooling, and electricity; to increase battery life; and to improve performance without overheating. I talked to the experts on physical design for ultra-low power at Mentor Graphics recently about the challenges to P&R tools and the techniques used during design, particularly to control dynamic power consumption in FinFETs. Here’s what David Chinnery, R&D lead for low power, and Product Marketing Architect, Arvind Narayanan had to say.

    Low-power design starts at the architectural level and continues through implementation. The challenge in implementation is to create, optimize, and verify the physical layout so that it meets the power budget along with traditional timing, performance, signal integrity (SI), and area goals. Physical design tools must find the best trade-offs when implementing a variety of low-power techniques. With the multiple layers of complexity in advanced technology node designs, power management requires a larger bag of tricks than ever.

    Chinnery pointed out that low power as a design goal is nothing new. The basic low-power techniques, such as clock gating and use of multiple voltage thresholds (multi-Vt), are well-established and supported by existing tools. What’s new is the added complexity of meeting that goal at advanced process nodes and the increasing size and complexity of designs. There are a huge number of corner, mode, and power state scenarios that have conflicting requirements for power, timing, SI, manufacturability, and area. Additionally, with slower migration to new process technology nodes, there has been much greater emphasis on achieving the best possible design results using EDA tools with state-of-the-art low power techniques.

    Narayanan added that FinFETs use significantly less total power, but the dynamic power component is much higher compared to the leakage power. The FinFET’s 3D gate around the transistor drain-source channel greatly reduces leakage due to better on/off control of the electric field in the channel. However, the 3D FinFET gate has higher capacitance compared to the MOSFET planar gate structure. Thus dynamic power needs to be considered during optimization and throughout the physical design flow.

    So what can be done during physical design to control power?

    Go in-depth with low power physical design solutions. Download the Mentor whitepaper, Low-Power Physical Design with the Mentor Place and Route System.

    First, the tools must have the capacity to handle very large designs, 100+ million instances, with reasonable runtimes. Second, the tools must support the full bag of low power tricks. By “tricks” they mean techniques to reduce power like using multi-Vt, gate sizing, clock gating, multi-corner/multi-mode (MCMM) power optimization, pin swapping, register clumping, remapping, and power-density driven placement. But, Chinnery noted that designers also need full support for UPF 3.0 directives, multi-voltage flows, support for dynamic voltage and frequency scaling (DVFS) to handle varying supply voltages and clock frequencies, and the capability to handle special cells such as level shifters, isolation cells, and multi-threshold CMOS (MTCMOS) power gates. Deployed correctly, these advanced techniques and some additional secret-sauce optimization tricks result in power analysis reports that make everyone’s day a bit happier.

    Chinnery and Narayanan agreed that one of the key strengths of Mentor’s physical design tools is the native MCMM architecture that lets designers analyze and optimize the design for all corner/mode/power state scenarios concurrently. Whether the design uses advanced process nodes and FinFETs or not, the MCMM capability is vital.

    Also important for any low-power design, whether on a legacy node or the latest processes, is clock tree synthesis. Getting the best clock tree, said Narayanan, also depends on the ability to synthesize the clocks for multiple corners and modes concurrently in the presence of design and manufacturing variability, and in multi-voltage flows. He pointed to the importance of features like:

    • Composing single-bit registers to multi-bit registers to reduce their load on the clock distribution network, and decomposing from multi-bit registers where necessary to meet data-path timing constraints
    • Lowering leaf wire capacitance by register clumping and clock gate cloning/de-cloning
    • Reducing functional skew and clock skew across multiple corners with MCMM CTS
    • Useful skew to improve data-path timing and reduce power consumption
    • Improving clock gating coverage with additional fine-grained gating
    • Minimizing clock net switching power with smart clock gate placement to reduce wire length for high activity clock nets


    Using low-power CTS with MCMM optimization significantly reduces the number of buffers, skew, total negative slack (TNS) and worst negative slack (WNS), in addition to reducing dynamic power and area. The table shows some real customer data comparing a single-corner CTS implementation with a 9-corner CTS implementation for a single mode in a 9-corner design.


    Another key part of the physical design flow is routing. For low-power designs, the router should follow the UPF power intent. This includes maintaining a single port of entry for boundary nets and respecting voltage island boundaries. The router handles secondary power connections for retention flip-flops, level shifters, and always-on buffers. Mentor’s router gets constant updates on MCMM timing and wire resistance and capacitance (RC), which it uses to find the optimal solution to meet power, timing, SI, manufacturability, and area constraints. It is also DFM-aware, so it accounts for the manufacturing issues that affect power (especially leakage power), such as variations in on-chip temperature and thickness.

    For many IC designs, low-power is as important as timing. While you get the largest impact on power at the architectural level, Chinnery mentioned that a focus on low power throughout place and route can achieve a further 20% to 40% power reduction on some designs. Additionally, the place and route flow should remove the unpredictability from the physical implementation process that can result in late-stage surprises in power consumption. A blown power budget can affect the cost, performance, and time-to-market of low-power ICs.

    For all the details on low-power physical implementation, download the Mentor whitepaper, Low-Power Physical Design with the Mentor Place and Route System.


    Cadence DSPs float for efficiency in complex apps

    Cadence DSPs float for efficiency in complex apps
    by Don Dingee on 09-29-2016 at 4:00 pm

    Floating-point computation has been a staple of mainframe, minicomputer, supercomputer, workstation, and PC platforms for decades. Almost all modern microprocessor IP supports the IEEE 754 floating-point standard. Embedded design, for reasons of power and area and thereby cost, often eschews floating-point hardware Continue reading “Cadence DSPs float for efficiency in complex apps”