ARM has a storied history of announcing major architecture changes at conferences far in advance of product implementations to get their ecosystem moving. At Hot Chips 2016, their sights are set on revamping the ARMv8-A architecture for a new generation of server and high-performance computing parallelism with a preview of the Scalable Vector Extension (SVE).
ARM NEON, similarly previewed in 2004, was a response to Intel’s move to incorporate their MMX technology for SIMD into mobile chips, adding iwMMXt to the PXA270 processor. In desktop and server space, Intel drove several evolutions of Streaming SIMD Extensions (SSE) and in 2010 announced a move to their Advanced Vector Extensions (AVX), currently with 512-bit support in the AVX2 variant in Knights Landing and Skylake.
To Intel’s credit, they have put extensive efforts into compiler technology that can deal with all the variants of SSE and AVX on various Intel processor families. Auto-vectorization, magically transforming sequential code for vector processing, is good in theory but often falls down unless the target hardware has exactly the right support. For example, tossing the -xAVX option at something predating Sandy Bridge generates a fatal error. Intel came up with the -axAVX flag to generate both a baseline path (set to SSE2, SSE3, SSE4.1, SSE4.1, or other instruction set by another option) and an AVX-optimized path, with runtime selection based on processor support.
ARM NEON fell far behind in comparison, really having little reason to evolve for mobile needs. However, it is a new era, and ARM wants its product in a new generation of server-class platforms with different workloads. “Server” always needs an adjective for proper discussion; Intel’s lead in high-volume application servers is undisputed, but ARM wants “beach heads in key segments” per their slides. Telecom is one of those – see my prior posts on their OPNFV efforts – as well as IoT infrastructure and real-time analytics platforms.
HPC is the next area ARM wants to scout. ARM has quietly been watching the portability beast Intel created and considering how to get the performance benefits of vectorization without the software migraine headaches. Vector length is part of the problem; picking a fixed length can lock-in goodness for some applications but cause others not to run.
SVE is wide: 128 to 2048 bits. There is almost no overlap with NEON at 128 bits max. Instead, SVE has been created from the ground-up for systems such as HPC. In an ARM tradition, where the ecosystem determines the best-fit for the architecture, SVE supports both a vector length choice and a vector-length agnostic programming model that can adapt to the available vector length in the hardware. There are many other improvements in SVE with the aim to smooth out compiler vectorization:
It’s interesting how ARM has squeezed SVE into ARMv8-A. 75% of the A64 encoding space is already allocated, but SVE took just a quarter of the remaining 25% with some creative use of predicated execution and attention to addressing modes.
Fujitsu has been collaborating with ARM on the Post-K supercomputer and compiler technology supporting SVE. As we see from the Intel efforts, compilers and libraries will be the make-or-break aspect for SVE, and the uptake in Linux distributions with SVE-enabled libraries will be an area to watch.
Nigel Stevens gave the talk at Hot Chips and wrote a blog post with more details on the innovations in SVE:
There is also a good Fujitsu overview of Post-K:
I think ARM recognizes very well they have a huge mountain to climb on Intel’s head start in server-class and HPC processing. They’ve clearly learned from the ARMv7 “we’re in servers now” debacle, and are taking steps in both ARMv-8A architecture and ecosystem development to start paving the path with niche wins.
SVE is a huge step forward, and ultimately will probably have much bigger impact than NEON for ARM. It really ups the ante in terms of vector width and the potential compiler technology that could support a wide range of hardware. Most of the HPC work will probably be in C/C++ or FORTRAN. As an IoT wonk, I’d also be curious how other distributed languages like Lua and Rust might be able to take advantage of vectorization with SVE.
Was anyone at Hot Chips and have further insight on this announcement, or just general thoughts on how SVE might stack up from HPC or compiler work?