Arm Introduces Cortex-A78, Cortex-X1, Cortex-X Custom
Arm Tech Day 2020
Another year and another Arm Tech Day. These things seem to follow a pattern, but one that was slightly disrupted by the current pandemic that we are experiencing. Instead of heading down to Austin for a few days of presentations, some sun, and some really good BBQ we were relegated to sitting at home and interacting with all our colleagues at Arm over Zoom and Microsoft Teams.
While the lockdown may be somewhat disappointing, Arm is not letting the situation slow them down. Today they are announcing five product lines that will be hitting shelves sometime next year. Each year Arm has improved their performance and efficiency by double digit percentages across their lineup of CPUs, GPUs, and NPUs. They have achieved this with design optimizations as well as process node improvements that are available to their partners.
The consistent improvements from generation to generation of designs from Arm have allowed the company and their partners to dominate the global mobile market. With performance approaching desktop designs, but with only a fraction of the power required, Arm is enabling new levels of compute performance and interactivity that was once impossible half a decade ago.
We are only now seeing products hit the scene with the Cortex-A77 based on the latest process nodes from TSMC, Samsung, and others. These parts are competitive with custom parts from Apple, but of course they are licensed designs that do not require a massive design team to implement. This is not to say that there aren’t challenges in getting these parts to market, but much of the design heavy lifting has been done by Arm and their robust tools.
Cortex-A78
In a shock to nearly no one, Arm has announced its next CPU core as the A78. This is the successor to the A77 and is based off of that same architecture which was introduced with the A76. It is a product of the Austin design group and is heavily optimized for performance, die size, and efficiency.
Arm claims that it sees a 20% increase in sustained performance as compared to the previous generation at the same power envelope. This is accomplished by a combination of design and process advancements. Arm is comparing an A77 based on 7nm vs. the A78 at 5nm. If we are looking at identical performance, then the A78 consumes about 50% less power than the A77.
The approach to design is described as three pronged: additional micro features that push performance in an area/power efficient manner, reduce structures that have low performance/ROI, and optimize existing structures to consume less power.
One of the big features of the new A78 is that the L1 cache can be halved with having a minimal hit on performance compared to the previous generation. This saves on both power consumption and die space. Partners still have the option of going the full 64 KB size, but considering that the part performs as it does with only 32 KB I would guess that most will go with the smaller size.
Arm has done a tremendous amount of work on the front end. This is another area where a lot of performance and efficiency can be gained. By improving this area Arm is able to decrease memory bandwidth, lower latency, and hide pipeline bubbles from the execution units as much as possible. Prefetch is more efficient, so the data required for execution is waiting after a cache miss.
One of the big features of the new A78 is that the L1 cache can be halved with having a minimal hit on performance compared to the previous generation. This saves on both power consumption and die space. Partners still have the option of going the full 64 KB size, but considering that the part performs as it does with only 32 KB I would guess that most will go with the smaller size.
Arm has done a tremendous amount of work on the front end. This is another area where a lot of performance and efficiency can be gained. By improving this area Arm is able to decrease memory bandwidth, lower latency, and hide pipeline bubbles from the execution units as much as possible. Prefetch is more efficient, so the data required for execution is waiting after a cache miss.
The core has also seen improvements that increase throughput while maintaining good power consumption. These changes were implemented to achieve the aforementioned 20% increase in sustained performance without breaking the power budget. Improving OOO capabilities again helps to hide latencies and bubbles by processing as much as possible in a timely manner.
The back end has also been worked over extensively to essentially double overall capabilities. There is an extra Load/Store AGU, store data bandwidth was doubled, double L2 bandwidth, and improved data prefetch. This keeps performance at a high level, even when a partner may take the option of halving the L1 cache size.
Arm has taken a fine-toothed comb to A78 and provided nearly a generational performance impact over the closely related A77. Some of this does come from the usage of the advanced 5 nm process, but even iso-process we see improvements from A77 to A78 that are still impressive. The ability to focus on 20% improved performance at the same power consumption or a 50% increase in power efficiency at the same performance as the last generation should give partners and mobile producers a lot of flexibility when designing end user products.
Cortex X Custom Solutions
The company is introducing a new product line for those partners wishing for a higher performance part and do not mind paying the power and thermal price for it. This could be construed as a response to Apple’s fully custom units that lead the market in terms of overall performance.
This is an aggressive response not only to Apple, but perhaps also to the likes of AMD and Intel. Arm is willing to explore these higher performance/wattage parts now that they have significant backing from the likes of Qualcomm and Microsoft to start pushing their products into the mainstream mobile market.
The first products will be seemingly cut from the same cloth, but this does not mean that the future will look the same. AMD has successfully implemented a semi-custom division that provides partners like Sony and Microsoft with high performance gaming chips that fit the needs of the console market. It only seems natural that Arm would expand its ability to address other, higher performance markets with a custom solution.
This solution requires early engagement with partners to fully customize the design to their needs. If we consider that the A78 product design was started some 3 years ago, we expect much that same timeline for these more custom parts. Arm will probably leverage their existing IP base extensively as AMD has done on the x86 side, but it has the potential to be more flexible in their offerings in terms of low level microarchitectural changes.
Cortex-X1
The Cortex-X1 is the first product from the Cortex-X Custom Program. It is a product that is aimed at ultimate Arm performance, but is not considered a mainstream product. Performance is aimed at the 3 GHz level with enhanced core features including increasing cache sizes to maximize peak performance.
The front end had to be reworked extensively to be able to feed the more powerful core. It is now a 5 wide decode for instructions and a massive 8 wide for Mop cache (pre-decoded instructions). The Mop cache itself is double the size of the A77 and the L0-BTB (bubble taken branch) is 50% larger.
Not only did dispatch get buffed up, but the execution core also received a lot of attention. The OOO window size was increased by 40% which helps expose more parallelism in code. Floating point/ASIMD is doubled from 2 x 128 bit units up to 4 x 128 bit. Machine learning on the CPU alone doubles by this improvement.
To help keep this engine running Arm has also allowed the doubling of L2 caches, doubling of bandwidth to all caches, a 33% increase of in-flight load/stores, and a 66% increase of L2-TLB capacity (2k entries). These improvements allow the cores to be consistently fed data and instructions so it can achieve peak throughput.
To repeat, these will not be mainstream chips. They could potentially find themselves in laptop replacements and some higher end tablets. I am sure there are other applications outside of the consumer market as well for these chips. This is a very interesting idea from Arm and we of course will be curious how it pans out. They have partners already with the chip, but they are not announcing any of them today.
Wrapping Up the CPUs
These are the two new major CPUs that will show up in 2021. Currently the Cortex-A77 is broadly adopted now that we are deep within 2020, and we can expect such availability next year with the A78. Arm continues to push innovation and new products based on the latest process technologies to their licensees and partners.
The mobile space continues to grow and the applications being developed for these platforms require more performance and efficiency than ever before. AR/VR, navigation, and machine learning will continue to grow and develop. New features will suck up whatever performance that Arm and their partners can provide the market.
Next year we will likely be hearing about an entirely new architecture from the design teams at Sophia Antipolis. For now we can appreciate the technology that will be arriving at our doorsteps this next year. Where once the PC market was moving at near lightspeed in terms of innovation and performance, the mobile space now is where all the action is heading.
” Where once the PC market was moving at near lightspeed in terms of innovation and performance, the mobile space now is where all the action is heading.”
well of course, mobile started on a low point in regards to performance and so had much room for growth, Desk top on the other hand, had years of growth. Although, when Intel finally releases their 7nm parts, a much larger performance jump is said to occur, more than in recent years anyway.