You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!
I was curious if anything "pioneered" in Itanium later got re-used or folded into x86 technical design that had a meaningful impact on improving x86? (IP, techniques, etc.)
Not to my knowledge. The Itanium CPU architecture was RISC with a modified VLIW parallel instruction execution strategy (Explicitly Parallel Instruction Computing, which delineated instruction bundles and supported speculative execution). This is nothing like current x86 superscalar instruction execution.
I suppose one could say the Itanium and Xeon cache coherency schemes have some similarity, since both use snoopy caches with snoop filters, but AMD appears to have a similar strategy (though the terminology is different). I'm not convinced there was anything of unique value learned in the Itanium effort, but Itanium did use switched point-to-point links for multi-CPU NUMA coherency, rather than the typical Intel memory bus of the day (called FSB, or Front-Side Bus), but modern UPI links in Xeons don't seem to be derived from the Scalability Port design used by Itanium.
My impression is that the only thing Intel really learned after Itanium was to stay far away from VLIW for general purpose computing and stick with superscalar instruction parallelism. VLIW can be used for reducing the hardware complexity and design cost of specialized processors, like in Intel's Gaudi AI accelerators, or in the Kalray networking and security processors, which I suspect run software that is manually coded at the assembler level for best use of VLIW (I'm guessing). But for general purpose processing the VLIW compilers become an overwhelming software complexity problem.
I don't think programming languages are a factor. The fundamental issue seems to be this: compilation does not have the real-time execution information that the superscalar CPU circuits do for branch prediction, out of order execution, and instruction reordering. If your workload is a general purpose (code cannot be predicted), multi-programming (mixed applications), multi-threaded (a large amount of execution concurrency) environment, execution optimization by compilation - even in-line compilation, is difficult. It has been proven possible for small dedicated workloads (small meaning a small code footprint), but it is difficult to imagine for, say, cloud computing.
We're also at the stage of chip fabrication where incredibly complicated circuitry with impressive parallelism can be implemented with reasonably low power. Modern superscalar CPUs have many instruction decoders and processing pipelines. VLIW was conceived in the early 1980s, when we talked about chip feature sizes in microns.
I can picture Josh Fisher, the conceiver of the VLIW strategy, in 1983 or so, sitting in a bar near Yale University with his pals talking about VLIW. "Imagine if CPU instruction processing parallelism innovation could proceed at the speed of software innovation, rather than the speed of integrated circuit improvements?" I get it. But real-time information and multi-gigahertz chip clock frequencies beat advance prediction, no matter how good, every time, when the workload you're trying to optimize isn't predictable.
They have completely different design philosphy. Itanium(IA64) was designed to 'let the programmers do more work', while x86-64 was their for 'we'll reduce programmer's work'. So there's little you can salvage from sunk IA64. After launching IA64, Intel found what really made their chips valuable. IA64 forced SW developers to re-compile(or even change) their existing code bases to fully utilize potentials of new processors. SW development cost exceeds the cost of purchasing(not developing) new processors. So SW companies moved to backward compatible x86-64. Anyway, VLIW still survives in GPUs where workloads are more predictable.