We're beginning to see how the Zen architecture will affect AMD's entire product stack. This news refers to their Opteron line of CPUs, which are intended for servers and certain workstations. They tend to allow lots of memory, have lots of cores, and connect to a lot of I/O options and add-in boards at the same time.
In this case, Zen-based Opterons will be available in two, four, sixteen, and thirty-two core options, with two threads per core (yielding four, eight, thirty-two, and sixty-four threads, respectively). TDPs will range between 35W and 180W. Intel's Xeon E7 v4 goes up to 165W got 24 cores (on Broadwell-EX) so AMD has a little more headroom to play with for those extra eight cores. That is obviously a lot, and it should be, again, good for cloud applications that can be parallelized.
As for the I/O side of things, the rumored chip will have 128 PCIe 3.0 lanes. It's unclear whether that is per socket, or total. Its wording sounds like it is per-CPU, although much earlier rumors have said that it has 64 PCIe lanes per socket with dual-socket boards available. It will also support sixteen 10-Gigabit Ethernet connections, which, again, is great for servers, especially with virtualization.
These are expected to launch in 2017. Fudzilla claims that “very late 2016” is possible, but also that it will launch after high-end desktop, which are expected to be delayed until 2017.
A two core option? really?
A two core option? really? What would be the use case for that? Even 4 cores in a server environment sounds low.
Yeah, that looked really odd.
Yeah, that looked really odd. Two, four then sixteen? Iffy rumor.
128 PCIe lanes on that 32 core do sound yummy. AMD might get back into the server room?
Having a 2 core design isn’t
Having a 2 core design isn’t strange. It is unclear why they wouldn’t list an 8 core part though. I would expect a 2 core die for very low cost, low power systems, but it is unclear whether the 4 core part is a separate die. It could be produced by salvage from 8 core production. They could be expecting to sell most of the actual 8 core parts in the desktop market. For server and workstation, at 14 nm, 8 cores is actually relatively small. I expect over 8 cores, there is only one actual die, probably a 16 core part made to be placed on an interposer. Interposer parts are going to be expensive, so they may not be available under 16 cores. Although, they could make lower core count devices with salvage from 16 core production.
Unfortunately, probably not
Unfortunately, probably not an interposer based design, probably just an MCM.
I agree it is odd, it could
I agree it is odd, it could mean low yields force them to do so.
That’s not a surprise since
That’s not a surprise since Zen would come out on desktop first.
Be prepared to the worst with Zen… many would be disappointed.
They have 4c jaguar
They have 4c jaguar microservers now… I would assume 2c/4t zen would beat the snot out of that…
High clock low threaded
High clock low threaded applications like Autocad could benefit from the stability of an Opteron for serious workloads. But I would guess that the speeds they can hit on 2core would be able to be hit on 4core as well.
Then again there are industrial stuff and embedded machines that could be a potential market that we know nothing of.
There is probably some market
There is probably some market for an extremely low power server device. They are still, unfortunately, differentiated based on supporting ECC. I tend to think ECC should be extended to the whole market, but even if you don’t need much processing power, you still have to buy a server grade device if you want ECC.
It is 2-core/4-threads. Zen
It is 2-core/4-threads. Zen has its own version of Hyperthreading.
It’s called SMT(simultaneous
It’s called SMT(simultaneous multi-threading) in the computer sciences dictionary, and Intel’s Hyperthreading is just a marketing term for Intel’s version of SMT.
“While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of the ACS-360 project.[1] The first major commercial microprocessor developed with SMT was the Alpha 21464 (EV8). This microprocessor was developed by DEC in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP acquired Compaq which had in turn acquired DEC. Dean Tullsen’s work was also used to develop the Hyper-threading (Hyper-threading technology or HTT) versions of the Intel Pentium 4 microprocessors, such as the “Northwood” and “Prescott”.”(1)
(1)
“Simultaneous multithreading”
https://en.wikipedia.org/wiki/Simultaneous_multithreading
NAS
NAS
Is it actually 128 PCIe lanes
Is it actually 128 PCIe lanes AND 16 10GBit, or is it something that is depending on how BIOS configures the CPU…
Like, the I/O on the CPU is configurable to support either up to 128 PCIe lanes OR up to 16 10GBit… and like combinations such as 64xPCIe + 8x10GBit.
Furthermore, it feels more likely that it’s the system aggregate for dual socket, meaning a single socket system would be configured like 32x PCIe and 4x 10GBit. Or more likely, 48x PCIe + 2x 10GBit, which is more firmly in the realm of what’s believable.
It could be a silicon
It could be a silicon interposer based device. Perhaps they make a die with 16 (or 32?) pci-e lanes and another die with 4 10 Gbit links (or maybe 2; I don’t know how much die space it will take at 14 nm). If you get an interposer with 8 of the pci-e die on it, then you have 128 pci-e lanes. If you get an interposer with 4 of the 10 Gbit die, then you have sixteen 10 Gbit links. Then you could perhaps make an interposer with two 10 Gbit die and 4 pci-e die for eight 10 Gbit and 64 pci-e. This is wild speculation. Although, it is unclear to me what is “in the realm of believable” when they can make an interposer over 800 square millimeters and fill it with die made on 14 nm.
AMD is supposed to be working
AMD is supposed to be working on an HPC/Workstation APU MCM variant with 16/32 Zen cores and a Vega/Greenland GPU die along with HBM. Some of the earlier leaks for APU’s on an interposer based MCM(Milti-Chip Module) have a 16 Zen cores die and and a GPU/Vega die wired up via an interposer with both the CPU cores’ die and GPU’s die both sharing the HBM, there are 32 core APU’s on an interposer variants with 2 separate 16 Zen/cores dies placed opposite sides of a large Vega/Greenland Die along with some HBM2 dies to complete the APU on an MCM/interposer based APU system.
But this SKU appears to be only a 32 core Zen die with lots of server type connectivity options and lots of PCIe lanes. There will be at a later time some true APUs on an interposer SKUs for AMD and they will have all the dies wired up via the interposer’s silicon substrate and have a much wider parallel CPU(Zen cores die) to GPU(Vega/Greenland die) connection of the same wide design similar to what the GPU to HBM2 wide parallel connections get on the current and future HBM/HBM2 connection fabrics get on the GPU only systems.
So this 32 Zen core is probably AMD’s move to get a Zen based server product to market ASAP, in advance of those future APU’s on an silicon interposer(MCM) module. So look for this 32 Zen cores only SKU to be released very early in 2017, while the Server/HPC/workstation APUs on an interposer/MCM module will probably be late 2017 into 2018.
AMD’s Navi platform may break the GPU up further into separate smaller GPU die modules(for better GPU die/wafer yields) to be added to the interposer module in various scalable numbers, along with the Zen cores die and the HBM dies to create a wide range of interposer/MCM based APU SKUs for the consumer to HPC/workstation markets.
People have been both
People have been both underestimating and overestimating interposer based designs. The interposer is still if limited size, and while there is technology to go beyond reticle size, it probably isn’t worth it in most cases; just use multiple devices rather than trying to splice multiple interposers or other hacks. For GPU containing devices, you are probably not going to see more than 600 square mm GPU die since 4 stacks of hbm2 takes almost 400 square mm. You could see two 300 square mm die die though. If you add a CPU die, then the other components must be smaller, so probably no 600 square mm GPU on an APU die. Maybe between 400 and 450 or so, unless they stick with HBM1. HBM1 can probably be higher clocked than what is in the Fury card, so it is possible. HBM1 could be very useful as a cache also; it has 8 independent channels per stack so it can technically be connected up to 8 different devices. They could run 4 channels to one CPU die and 4 channels to another. With how small HBM1 is though, I could imagine using 2 stacks, one for each 16 core device. HBM is still DRAM and much slower than SRAM caches, but it would still be much faster than external memory.
I have usually heard MCM to refer to placing multiple die on the same package rather than on a silicon interposer. The interposer is a very different design from just placing multiple die on the same package substrate. I believe AMD has made Opteron with two completely separate die on the same package in the past. For this generation, I would expect a 2 core low-end device for laptops and other low power applications. They may make a separate 4 core die also and definitely a separate 8 core die. I don’t think there will be a full 32 cores on a single die though. It would be most efficient to design a single 16 core device for use on interposers. Two 16 core devices on an interposer gives you 32 cores max. They can make any size from 8 up to 32 cores by using salvaged parts. The other components can be designed in a modular manner. They can design a pci-e die and a 10 Gbit ethernet die with the same interface such that they can be interchangeably placed on the interposer. They can also design the interposer to allow either device to be routed out of the package. A wide range of designs with varying components can be made with a single interposer design. They could make a couple different sizes interposer designs using the same die for different market segments.
They are calling it an MCM in
They are calling it an MCM in some articles, but it’s based on an silicon interposer naturally for the HBM2 memory’s wide parallel traces that can be etched into the silicon interposer’s silicon substrate, so that’s already 4, 1024 bit wide connection paths to 4 HBM/HBM2 stacks. And those 600 square mm GPU dies where done up at 28nm, so at 14nm the total die area is going to be much smaller for the same amount of GPU CUs and total cores. Whatever the new size standard those 600 square mm monolithic GPU dies are history for die/wafer yield issues alone.
So that MCM may be in error, so let’s go back to calling in simply an interposer based APU/SOC or GPU, but AMD’s Navi based systems are most likely going to be made up of groups of CUs on separate smaller GPU dies to improve yields, with the individual GPU dies wired up via an interposer based network topology, along with the HBM and separate CPU dies for any APU/Navi GPU based SKUs. AMD appears to be moving towards breaking up the large GPU die with its Navi based GPUs being made up to scale based on the numbers of smaller modular GPU dies wired up via the interposer. So those large GPU die/wafer yield issues may be avoided by going smaller/modular with many GPU dies per interposer depending on the SKU’s performance needs.
So the Navi GPU systems will be made up of scalable/modular GPU dies that can allow for many low to high powered SKUs simply by adding more of the modular GPU dies, HBM, and for APUs on an interposer CPU cores dies, or other processor dies, in addition to the GPU modular Navi die units and HBM.
There are even papers/graduate thesis submissions exploring DE-integrating Multi-core CPUs(into core complexes of 4 CPU cores per die) and moving the networking mesh topology onto the interposer to allow for wiring CPU systems on an interposer with 64 total CPU cores/16( 4 core dies) and running simulations of various networking topologies for data and coherency traffic metrics. So look for some interesting usage of Interposer technology for even CPU only based systems with NOC(network on a chip)/interposer designs where the network topology has been moved onto the interposer for SMP types of CPU systems.
The era of giant GPU dies may be coming to an end with giant interposers hosting many smaller GPU dies and HBM dies, with the smaller modular GPU dies made to appear to the software/OS like one big monolithic GPU via the interposer/other based circuitry and firmware abstracting away the fact that the GPU is made up of separate smaller dies.
It seems to be an MCM like
It seems to be an MCM like current socket G34 Opteron processors rather a silicon interposer based design. It is massively more powerful than current socket G34 processors though. AMD has been making these MCM since 2010 with two Opteron die placed on one PCB. While this is a massive amount of interconnect, it isn’t out of the realm of possibility. Intel makes a 22-core device with 40 something PCIe lanes. This is supposed to be a 16-core die with 64 PCIe lanes. Place two of them on one package, and you have a 32-core device with 128 PCIe lanes; no silicon interposers required.
I hope AMD can get more silicon interposer based designs out quickly, but I don’t think that this device is a silicon interposer. It will take time to design all of the die required for an HPC APU, or just a large silicon interposer based server CPU. It can share the CPU core design, but the rest of the die will need to be redesigned for use on the interposer. Any other accessory die also have to be redesigned.
Also, Nvidia has made an interposer based GPU with a 600 square mm dies size, but it probably cost over $10,000. The point is, you can’t easily have an HPC APU that has a GPU anywhere near that size since you need room for other components, like the CPU die.
Unfortunately, probably not
Unfortunately, probably not an interposer based design, probably just an MCM. Should have known that they would not have such an interposer based device available that soon. Probably will be a few years before we get such use of interposers.
Look MCM and and Interposers
Look MCM and and Interposers can host multiple dies so the fact that the OLD ceramic/plastic MCM modules may have been based some other compound does not mean the Using the term MCM is totally wrong. Silicon interposers can in fact host many different kinds of chips.
This first AMD 32 core Zen CPU variant is there to get AMD back into x86 the server market, while the interposer based APUs are still in the development stage. AMD still has some Opteron customers so maybe AMD wants to get at that market sooner and get more business and will offer its customers an upgrade path to the APU/HBM based HPC/server systems a later time. As long as AMD has designed the socket on for SKUs to also be usable for any newer APU/HBM based SKUs that may come online then AMD can offer an upgrade path to anyone gets this first Zen server SKU. There are still a lot of server usage models that do not require any GPUs, though the HBM would be welcomed by anyone for its speed and power savings.
Silicon interposers are quite
Silicon interposers are quite different from mounting multiple chips on a small PCB. Technically, you could refer to a motherboard as an “MCM” and not be “totally wrong”. This is a semantic issue, so there isn’t necessarily a right and wrong, but there can be a lot of confusion if we refer to silicon interposers as MCMs since MCM has come to imply more specifically mounting multiple chips on a small PCB or ceramic substrate.
AMD has been doing this since 2010 with their socket g34 processors. It will be a massive upgrade compared to their old line-up though. They are doubling the number of memory channels from 4 to 8, and also it is DDR4 instead of DDR3. That amount of bandwidth alone may attract some sales. The PCIe bandwidth is also huge. I don’t think any of the Opterons had on die PCIe before. They had to use HT links for connection to a chipset. HT is still quite fast, but they only have 4 16-bit links, and some will be used for inter processor communications in a multi-socket system. The 16-bit HT isn’t going to be fast enough with 4 channels of DDR4. I am curious as to what they will use to make up for that. They could use full 32-bit HT links instead.
Hopefully this will bring them some sales; it is a massive upgrade compared to a their current products, but we don’t know how well it will compete with Intel. I was hoping that they could get an interposer based device out sooner. Even without GPUs, an interposer could offer huge advantages, especially if it includes some HBM. When the first Opteron came out, it allowed AMD to grab quite a large share of the server market since they had moved the memory controller onto the CPU and Intel was still using a shared bus with memory connected to a north bridge. These new AMD chips could give them enough of an advantage to allow them to grab some market share, but it isn’t the truly massive advantage that they would have with a silicon interposer based device. Intel doesn’t really have an answer to silicon interposers yet. I have seen some stuff about
Accidentally hit post before
Accidentally hit post before done (typing on small phone). Meant to say that Intel has a technology called EMIB (embedded multi-die interconnect bridge). This seems to be their answer to silicon interposers. They embed an interconnect die upside down in the packaging substrate and, I believe, use similar micro-solder balls. This allows for very wide connections also, and the interconnect chip can contain logic. Given this, AMD may have a limited time to get a silicon interposer based device out before they will face much more competition. Unfortunately, it will take a lot of time and resources to design die to be placed on the interposer. They can’t really reuse a non-interposer design without significant modification.
The PCPer post seems a bit
The PCPer post seems a bit misleading. Reading more around the web, it sounds like it probably will be an MCM (multiple die mounted on the same package substrate; not really new) rather than an interposer based design. It wouldn’t be out of the realm of possibility to have two 16 core die with 4 channel memory each. It also would be large, but possible to have 64 pci-e links per 16 core die. Intel does 48 links and 22 cores, so they may end up similar sized. The 10 Gbit ethernet thing sounds more like you could do that many ethernet ports with how many pci-e links it has, not that it has both.
I don’t know if these rumors
I don’t know if these rumors are true, but such things are what I was expecting with interposer based devices. These rumors sound like interposer based devices. If you look at the labeled die photo for the Intel 6950x, the uncore portion of it is huge, especially considering how much L3 cache is on that die. All of the PCI-e links and other system interface circuitry takes a huge amount of die space. I don’t know if it also has 3 QPI links on that die. With silicon interposers though, they can split a lot of those components off onto separate die. They could even place a large L3 or L4 cache on a separate die. A 32 core device could actually be two 16-core or even four 8-core die placed on the interposer. The large number of PCI-e links could be supported on separate chips. The ethernet links would probably be on separate die also. I wonder if it will have a separate router chip to tie them all together. It should also be possible to put active circuitry in the interposer which could make interesting things possible.
This could make excellent use of manufacturing abilities since all of the die can be kept small, which will improve yields. Just making a chip with a giant L3 cache like Intel does will reduce yields significantly. This also makes the devices very modular, so they can support a large number of configurations. They may be able to decrease the number of PCI-e and increase the number of ethernet easily. They could also very easily change the number of cores or the amount of cache, since none of it will require a new die. I wonder if AMD will make separate SRAM cache die. If it is L4, then I could see them using an HBM stack instead of SRAM cache. A silicon interposer gives them the equivalent of a truly massive die. I am not sure what the max size of an interposer currently is, but they may have 800 to a 1000 square millimeters to fill. It is going to be interesting times if AMD pulls some of this stuff off.
Yes the first generation of
Yes the first generation of silicon interposers are just passive designs with just etched traces and no active circuitry on the interposer’s silicon substrate. Future active interposer designs could have complete connection fabrics coherency circuitry added along with the traces, or even have buffer memory added to the interposer for a complete connection fabric platform on the interposer, with the processor dies and memory dies added to an active interposer that handles the communication among all the interposer hosted dies. AMD’s Navi platform may be made up of smaller GPU die modules all wired up via an active/passive interposer/MCM module with the active circuitry in the interposer able to take all the separate smaller GPU modules and make them appear as if they are one big monolithic die based GPU to the software/OS.
So the Interposer could become like a mainboard on an MCM/Interposer module with the CPU/GPU/HBM/Other dies all hosted/connected up via the interposer module’s wide parallel connection traces. A silicon Interposer/MCM could have many thousands of bits wide parallel ring networks etched into its substrate to wire up the various Processor/HBM2 memory components in a manner that no PCB based system could ever affordably do.
I doubt that they would try
I doubt that they would try to make it appear as a single monolithic GPU. That would probably require several types of die which are not usable independently. Also, AMD has been pushing for better multi-gpu support with things like the dual Fury card. With a modular, but independent GPU, they can place varying numbers of GPU dies for different products. A monolithic GPU made of multiple die would not have that much flexibility. They can make a smaller interposer with a single die and HBM and also a 2, 3, or 4 GPU interposer if the die are small enough. Space is limited. A 4 die device would not be able to have 4 HBM die for each GPU and the die would need to be small. It could be possible to have an interposer with active circuitry to handle switching such that multiple GPUs can talk to the same HBM stack though. That would make multi-gpu much more efficient than it is now with completely independent memory. I am wondering if they will switch the interposer production to a smaller process. It is on 65 nm now; there should be a lot of 32 or 28 nm production capacity available for cheaper prices now. That would allow much more headroom for active interposers. The “holy grail” would be the ability to stack the GPU in the same stack as the memory. Bandwidth within the stack can be very high, but there are a lot of issues with 3D stacking of high power devices.
This device is probably just
This device is probably just an MCM, not an interposer based design. It will be a huge package if it really does have 4 channel memory attached to each die along with 64 pci-e lanes. The 10 Gbit talk is probably just to demonstrate what could be done with 128 pci-e lanes, not that it actually has both.