Specifications and Architecture
We finally have our hands on the 18-core and 16-core Skylake-X parts.
It has been an interesting 2017 for Intel. Though still the dominant market share leader in consumer processors of all shapes and sizes, from DIY PCs to notebooks to servers, it has come under attack with pressure from AMD unlike any it has felt in nearly a decade. It started with the release of AMD Ryzen 7 and a family of processors aimed at the mainstream user and enthusiast markets. That followed by the EPYC processor release moving in on Intel’s turf of the enterprise markets. And most recently, Ryzen Threadripper took a swing (and hit) at the HEDT (high-end desktop) market that Intel had created and held its own since the days of the Nehalem-based Core i7-920 CPU.
Between the time Threadripper was announced and when it shipped, Intel made an interesting move. It decided to launch and announce its updated family of HEDT processors dubbed Skylake-X. Only available in a 10-core model at first, the Core i9-7900X was the fastest tested processor in our labs, at the time. But it was rather quickly overtaken by the likes of the Threadripper 1950X that ran with 16-cores and 32-threads of processing. Intel had already revealed that its HEDT lineup would go to 18-core options, though availability and exact clock speeds remained in hiding until recently.
i9-7980XE | i9-7960X | i9-7940X | i9-7920X | i9-7900X | i7-7820X | i7-7800X | TR 1950X | TR 1920X | TR 1900X | |
---|---|---|---|---|---|---|---|---|---|---|
Architecture | Skylake-X | Skylake-X | Skylake-X | Skylake-X | Skylake-X | Skylake-X | Skylake-X | Zen | Zen | Zen |
Process Tech | 14nm+ | 14nm+ | 14nm+ | 14nm+ | 14nm+ | 14nm+ | 14nm+ | 14nm | 14nm | 14nm |
Cores/Threads | 18/36 | 16/32 | 14/28 | 12/24 | 10/20 | 8/16 | 6/12 | 16/32 | 12/24 | 8/16 |
Base Clock | 2.6 GHz | 2.8 GHz | 3.1 GHz | 2.9 GHz | 3.3 GHz | 3.6 GHz | 3.5 GHz | 3.4 GHz | 3.5 GHz | 3.8 GHz |
Turbo Boost 2.0 | 4.2 GHz | 4.2 GHz | 4.3 GHz | 4.3 GHz | 4.3 GHz | 4.3 GHz | 4.0 GHz | 4.0 GHz | 4.0 GHz | 4.0 GHz |
Turbo Boost Max 3.0 | 4.4 GHz | 4.4 GHz | 4.4 GHz | 4.4 GHz | 4.5 GHz | 4.5 GHz | N/A | N/A | N/A | N/A |
Cache | 24.75MB | 22MB | 19.25MB | 16.5MB | 13.75MB | 11MB | 8.25MB | 40MB | 38MB | ? |
Memory Support | DDR4-2666 Quad Channel | DDR4-2666 Quad Channel | DDR4-2666 Quad Channel | DDR4-2666 Quad Channel | DDR4-2666 Quad Channel |
DDR4-2666 Quad Channel |
DDR4-2666 Quad Channel |
DDR4-2666 Quad Channel |
DDR4-2666 Quad Channel | DDR4-2666 Quad Channel |
PCIe Lanes | 44 | 44 | 44 | 44 | 44 | 28 | 28 | 64 | 64 | 64 |
TDP | 165 watts | 165 watts | 165 watts | 140 watts | 140 watts | 140 watts | 140 watts | 180 watts | 180 watts | 180 watts? |
Socket | 2066 | 2066 | 2066 | 2066 | 2066 | 2066 | 2066 | TR4 | TR4 | TR4 |
Price | $1999 | $1699 | $1399 | $1199 | $999 | $599 | $389 | $999 | $799 | $549 |
Today we are now looking at both the Intel Core i9-7980XE and the Core i9-7960X, 18-core and 16-core processors, respectively. The goal from Intel is clear with the release: retake the crown as the highest performing consumer processor on the market. It will do that, but it does so at $700-1000 over the price of the Threadripper 1950X.
Architectural Refresher
There is very little new to discuss about the Core i9-7980XE and the Core i9-7960X processors. They use the same new architecture and design as the Core i9-7900X, though they are using a larger die. This die has up to 18-cores available on it, and it is utilized for the 12-, 14-, and 16-core HEDT processors as well. It uses the same mesh interconnect, has the same AVX-512 support, changes the cache weighting to the same ratio, includes the improved Turbo Max Boost 3.0 feature, and includes SpeedShift support.
If you want a refresher on what these technologies and features do, I have included that here, just below this paragraph. If you don’t need that, and instead want to jump straight into the new stuff including benchmarks and overclocking, I get it. Just click right here.
AVX-512
Although the underlying architecture of the Skylake-X processors is the same as the mainstream consumer Skylake line, which we knew as the Core i7-6000 series, there are some important changes thanks to the Xeon heritage of these parts. First, Intel has tried to impart the value of AVX-512 on us, each and every time we discuss this platform, and its ability to drastically improve the performance of applications that are recompiled and engineered to take advantage of it. Due to timing constraints today, and with a lack of real-world software that can utilize it, we are going to hold off on the more detailed AVX-512 discussion for another day.
Caching Hierarchy and Performance
We do know that the cache hierarchy of the Skylake-X processors has changed:
Skylake-X processors will also rebalance the cache hierarchy compared to previous generations, rebalancing to more exclusive per-core cache at the expensive of shared LLC. While Broadwell-E had 256KB of private L3 cache per core, and 2.5 MB per core of shared, Skylake-X moves to 1MB of private cache per core and 1.375MB per core of shared.
This shift in cache division will increase the hit rate on the lowest latency memory requests, though we do expect inter-core latency to increase slightly as a result. Intel obviously has made this decision based on workload profiling so I am curious to see how it impacts our testing in the coming weeks.
After more talks with Intel and our own testing, it’s clear that the changes made to the mesh architecture (below) and cache divisions have an impact on latencies and performance in some applications. Take a look at our cache latency results below:
Mesh Architecture Interconnect
I wrote about this new revelation that is part of both the Skylake-X HEDT consumer processors and the Xeon Scalable product this week, but it’s worth including the details here as well.
One of the most significant changes to the new processor design comes in the form of a new mesh interconnect architecture that handles the communications between the on-chip logical areas.
Since the days of Nehalem-EX, Intel has utilized a ring-bus architecture for processor design. The ring bus operated in a bi-directional, sequential method that cycled through various stops. At each stop, the control logic would determine if data was to be the collected to deposited with that module. These ring bus stops are located at memory controllers, CPU cores / caches, the PCI Express interface, memory controllers, LLCs, etc. This ring bus was fairly simple and easily expandable by simply adding more stops on the ring bus itself.
However, over several generations, the ring bus has become quite large and unwieldy. Compare the ring bus from Nehalem above, to the one for last year’s Xeon E5 v5 platform.
The spike in core counts and other modules caused a ballooning of the ring that eventually turned into multiple rings, complicating the design. As you increase the stops on the ring bus you also increase the physical latency of the messaging and data transfer, for which Intel compensated by increasing bandwidth and clock speed of this interface. The expense of that is power and efficiency.
For an on-die interconnect to remain relevant, it needs to be flexible in bandwidth scaling, reduce latency, and remain energy efficient. With 28-core Xeon processors imminent, and new IO capabilities coming along with it, the time for the ring bus in this space is over.
Starting with the HEDT and Xeon products released this year, Intel will be using a new on-chip design called a mesh that Intel promises will offer higher bandwidth, lower latency, and improved power efficiency. As the name implies, the mesh architecture is one in which each node relays messages through the network between source and destination. Though I cannot share many of the details on performance characteristics just yet, Intel did share the following diagram.
As Intel indicates in its blog on the mesh announcements, this generic diagram “shows a representation of the mesh architecture where cores, on-chip cache banks, memory controllers, and I/O controllers are organized in rows and columns, with wires and switches connecting them at each intersection to allow for turns. By providing a more direct path than the prior ring architectures and many more pathways to eliminate bottlenecks, the mesh can operate at a lower frequency and voltage and can still deliver very high bandwidth and low latency. This results in improved performance and greater energy efficiency similar to a well-designed highway system that lets traffic flow at the optimal speed without congestion.”
The bi-directional mesh design allows a many-core design to offer lower node-to-node latency than the ring architecture could provide, and, by adjusting the width of the interface, Intel can control bandwidth (and, by relation, frequency). Intel tells us that this can offer lower average latency without increasing power. Though it wasn’t specifically mentioned in this blog, the assumption is that because nothing is free, this has a slight die size cost to implement the more granular mesh network.
Using a mesh architecture offers a couple of capabilities and also requires a few changes to the cache design. By dividing up the IO interfaces (think multiple PCI Express banks, or memory channels), Intel can provide better average access times to each core by intelligently spacing the location of those modules. Intel will also be breaking up the LLC into different segments which will share a “stop” on the network with a processor core. Rather than the previous design of the ring bus where the entirety of the LLC was accessed through a single stop, the LLC will perform as a divided system. However, Intel assures us that performance variability is not a concern:
Negligible latency differences in accessing different cache banks allows software to treat the distributed cache banks as one large unified last level cache. As a result, application developers do not have to worry about variable latency in accessing different cache banks, nor do they need to optimize or recompile code to get a significant performance boosts out of their applications.
There is a lot to dissect when it comes to this new mesh architecture for Xeon Scalable and Core i9 processors, including its overall effect on the LLC cache performance and how it might affect system memory or PCI Express performance. In theory, the integration of a mesh network-style interface could drastically improve the average latency in all cases and increase maximum memory bandwidth by giving more cores access to the memory bus sooner. But, it is also possible this increases maximum latency in some fringe cases.
Turbo Boost Max Technology 3.0
With the release of the Broadwell-E platform, Intel introduced Turbo Boost Max Technology 3.0 that allowed a single core on those CPUs to run at higher clock speeds than the others, effectively improving single-threaded performance. With Skylake-X, Intel has improved the technology to utilize the TWO best cores, rather than just one.
This allows the 8-core and higher count processors from this launch to run at higher frequencies when only one or two cores are being utilized. In the two products that we have clock speeds for, that is a 200 MHz advantage over standard Turbo Boost technology. Intel hopes that this improvement in the technology gives them another advantage in any gaming or lightly threaded workload over the AMD Ryzen and upcoming Threadripper processors.
SpeedShift on HEDT
For the first time, the HEDT platform will get SpeedShift technology. This feature has been present since the launch of Skylake on the consumer notebook line, was updated with Kaby Lake, and now finds its way to the high performance platforms. The basis of the technology allows the clock rates of the CPU to get higher, and do so faster, in order to improve the responsiveness of the system for short, bursty workloads. It accomplishes this by taking over much of the control of power states from the operating system and leaves that decision making on the CPU itself.
Zoomed
Comparing the Core i9-7900X to the Core 7-6950X (that does not have SpeedShift) and the Core i7-7700K (Kaby Lake) shows the differences in implementation. The 7900X reaches its peak clock speed in 40ms while the Broadwell-E processor from last year takes over 250ms to reach its highest clock state. That’s a significant difference and should give users better performance on application loads and other short workloads. Note the difference on the 7700K though: the consumer part and Kaby Lake design is even more aggressively targeting instantaneous clock rates.
X299 Platform Still Going Strong
Though nothing has changed on our test bed for this review, it’s worth noting that there has been continuous updating around the X299 chipset and motherboards that use it. ASUS has released a handful of newer BIOS for the X299-Deluxe that we used on the 7900X story and for today’s review, that have improved stability, made overclocking easier, and even improved our storage performance (more on that at a later time).
A big thanks to our friends at Corsair for hooking us up with a collection of new RM1000x power supplies, Vengeance LPX 32GB 3200 MHz memory kits, and Neutron XTi 480 GB SSDs to upgrade all of our CPU testing platforms! They have helped streamline our testing process great.
Intel 18C SL-X -$2000
AMD 32C
Intel 18C SL-X -$2000
AMD 32C EPYC 7551P -$2100
Want a *real* workstation ? You don’t want Intel then. Because single socket EPYC buries everything they’ve got. But the ‘press’ never mentions it. They just keep comparing $2000 Intel chips to $1000 Threadripper.
Bunch of shills. You’re right, workstation buyers *don’t* care about cost so much, which is exactly why they will spend the extra $100 on an EPYC, and leave this expensive, hot, zero value for money ‘kidde 18C’ junk from Intel in the trash, where it belongs.