What advances have been made in the Zen architecture that will allow AMD and Intel to compete with each other?

What advances have been made in the Zen architecture that will allow AMD and Intel to compete with each other?
It has been many years since AMD and Intel last faced off. A fans all miss the K7 era created by the EV6 bus, and the glorious days when AMD64 won the favor of Microsoft's new system, forcing Intel to give up promoting IA64 and obtain the x86-64 license from A. However, the good times never last long. As Intel got rid of the shadow of NetBurst and created the Core era, A never got a single cent from Intel. Later, joining the ARM camp did not bring any benefits to itself. At the same time, the GPU department was also in a more and more obvious decline, causing NVIDIA to press it to the ground for quite a long time. Therefore, in the past few years, AMD has been nicknamed the "slide" factory - AMD has been relying on the "PPT to save the country" policy to keep A fans hanging in the air, which also makes them give supportive remarks with self-deprecation every time they face the upgrade of A. But this time, AMD seems to have really given people hope of turning things around. During the Hot Chips 2016 chip technology forum this week, AMD disclosed a lot of details about its new microprocessor architecture Zen, which seemed to be a sign of the return of the hero of the past. So, what progress has Zen made that can allow AMD to stand on the same playing field with Intel again? A shot in the arm: microinstruction cache AMD chose to completely abandon the existing architecture for the design of the next-generation microprocessor architecture. From the beginning, it set the goal of "high-performance x86 processor" and redesigned the core of Zen. The previous generation architecture, that is, the bulldozer/excavator, has exposed many defects in practical tests. AMD's core design engineers should have their own considerations for choosing to break it first and then build it instead of checking for leaks and filling them. From the new core architecture, there is a change worth noting: AMD has added a microinstruction cache to Zen. The role of the microinstruction cache in a computing module is to make the microinstructions closer to the microinstruction queue, avoiding the core from wasting extra time to fetch instructions from the lower-level instruction cache, which has a very obvious effect on improving the execution efficiency of the entire core. Intel started adding micro-op caches to its CPU core designs very early on, and the results were so good that they continued to do so for several generations (considering that Conroe, which widened the gap between Intel and AMD, may have directly benefited from this), there is no reason not to believe that AMD's imitation will bring more considerable positive effects. The current question surrounding AMD Zen's micro-op cache is only its size. If you have to guess, considering that typical micro-op caches are relatively small, and Intel's micro-op cache supports 8-way 1536 uOps, AMD's buffer parameters should be on par with Intel's, because there is no choice. With this, AMD's claim that Zen has at least a 40% increase in instructions per clock cycle (IPC) compared to the previous generation of cores can be much more credible. Of course, the addition of microinstruction cache alone without numerical improvement would make 40% seem a bit pale, so AMD also enlarged all the indicators of Zen : the number of single microinstruction dispatches increased from 4 to 6 (and 4 integer and 6 floating-point microinstructions can be dispatched at the same time); the integer/floating-point instruction schedulers increased from 48/60 to 84/96; the length of L/S queues and return queues increased by about 50%. Ideally, these numerical improvements combined with more accurate branch predictions can enable the core to achieve higher throughput in the fastest order and maintain this high efficiency for a longer period of time. With the microinstruction cache, AMD Zen can be said to have made up for a core shortcoming. Without this step, it would be impossible to challenge Intel. Rebuilding confidence again: cache hierarchy reconstruction Compared with a bulldozer, Zen's cache hierarchy structure is not an exaggeration to describe it as a complete transformation. Although AMD avoids talking about data such as cache latency and bandwidth, it is currently believed that changes that can bring positive effects do exist. First of all, Zen changed the 32KB L1 data cache for each core from the Bulldozer penetration to the write-back type, and no longer tied the data update of the core cache to the bus cycle of the CPU and memory. This can maintain the performance of the burst write of the L1 cache without waiting for the slower memory to synchronize the cached data in the same bus cycle. And according to the statistics that Load operations are more frequent than Store operations, Zen's L1 cache makes the L/S unit asymmetric and gives more Load channels. Back to the overall view, AMD completely dismantled the original Bulldozer architecture and built a new CCX - a structure that hangs four CPU cores and their L1 and L2 caches on the L3 cache. Among them, this 8MB L3 cache is not like our common CPU lower-level cache, which collects data according to the pre-fetch/request instructions required by the CPU core. Its role is to provide temporary accommodation for instructions that are driven out of the L1 and L2 caches because they are not executed in time or marked invalid by the write-back command. It is more like a refugee camp, so it must not be as efficient as the L1 and L2 caches. However, because the 8-way L2 cache of the Zen core is as large as 512KB, this inefficiency can be corrected to a certain extent. At the same time, since there is no need for the refugee cache to store the data in the L2 cache, it reduces the data redundancy in the cache to indirectly improve the cache utilization efficiency, or cache capacity. The modular design adopted by AMD in Zen also allows the new CPU to have better product line scalability, enabling a set of architectures to cover everything from the most energy-efficient mobile chips to the hottest performance darlings, avoiding the fault phenomenon like the previous generation. For example, a CCX is made into a low-power 4-core CPU for notebooks, competing with Intel's mobile i3/i5. On the desktop, two CCXs are put together to form an 8-core Zen, directly challenging the i7. However, what AMD did not clearly state is what the interconnection structure between CCXs is, and they denied the speculation based on the improvement of the HyperTransport bus, but did not give a specific answer, leaving a question that makes people think about it. In a word, let's not talk about whether AMD's cache efficiency can be as it claims, doubling the first and second level cache throughput and multiplying the third level cache speed by 5. At least in terms of size, it is no problem to surpass Intel's current Skylake. Ternary Methodology: Real SMT Intel's application of simultaneous multithreading (SMT) can be traced back to 2008. Splitting a core into two threads is a difficult task. Just telling these two threads how to get along well with each other, use cache and resources reasonably, and not monopolize each other... These are enough for those engineers to drink a pot. Maybe AMD has not added SMT capabilities to its own CPUs for so many years, which has blocked the way for this problem. Next year, we should be able to see 8-core/16-thread AMD CPUs. From the inside, the scheduling between threads of the Zen core mainly follows the time-sharing strategy. Although considering that different threads can have many different occupancy characteristics, this is not the best solution. AMD still relies on its own set of thread marking/discrimination methodology to force it. There are usually three situations in which the process priority is interfered in Zen. One is that the CPU will analyze the data flow of each thread to determine which one has a higher algorithmic priority. When it comes to resource-heavy tasks such as branch prediction and integer/floating point renaming, the thread adjusts its priority accordingly. Another is that when the thread involves latency-sensitive operations such as TLB cache and Load queue operations (which are usually reflected in the upper layer in a timely response to user feedback), the CPU will assign thread processing priorities based on latency demand tags. For parts like microinstruction queues that follow a sequence, the CPU will use a static time-sharing strategy to allow threads to process alternately. As for the rest, it is much simpler and cruder. First come first served, whichever thread needs more corresponding core resources will rush to occupy them first. If we raise it to the level of the operating system and application software, and observe AMD's SMT from their perspective, it is similar to Intel's hyperthreading. Each thread is treated as a core and there will be no resource usage restrictions like a bulldozer. I don't know if AMD has learned from Intel's HyperThreading this time, and whether it can surpass Intel's, but it is certain that the performance of Zen floating-point calculations will be greatly improved compared to the previous generation of AMD CPUs. Process technology: See FinFET again. I'm sure everyone has heard the term FinFET so many times that their ears are callused. We have introduced this technology in our previous discussions on mobile phone processor chips, so let's make it short here. Power consumption has always been an aspect that AMD is used to taking care of when designing its own CPUs. To achieve a TDP of less than 100W, it is not only to set the gated clock more aggressively. Zen intends to use the Global Foundries 14nm FinFET process that they have actually tested on the Polaris GPU earlier. Moreover, AMD does not intend to copy the GPU solution. They also want to use this density-optimized process. After all, they need to control the DIE area-this is something AMD did not explain in this Hot Chips. If the current design forces them to use 500 square millimeters of 14nm DIE, it will go against AMD's usual pricing strategy, and the final product will definitely be expensive. However, considering that the birth of Zen has been accompanied by target adjustments, it is difficult to determine how much AMD's next-generation CPU can maintain what we see now. 40% on paper, 2% in reality? Having said so much, end users are actually hard to be fooled by PPT. No matter how exaggerated the numbers and architecture are, they want nothing more than two things: first, they must be able to buy it at a suitable price; second, the actual feeling of using it is really not slow. AMD did a Blender run for the attendees at Hot Chips 2016. The 3GHz 8-core Zen and the 3GHz 8-core Broadwell-E, Zen can be 2% faster than Broadwell under the same multi-threaded custom load. But AMD did not disclose more configuration details. When carrying the previous record of "rising on PPT", the choice of speaking can only be cautious, cautious and cautious. It is not an easy task to dispel public suspicion. If AMD can successfully ship Zen in batches in the first quarter of 2017 (actually, it has already been delayed, originally scheduled for October this year), consumers may first find the new CPU in branded computers. Maybe they can seize this opportunity to return to the high-end x86 CPU battlefield and compete with Intel again, but at least they must ensure that they have the cards to play when facing their old rivals, and don't delay again.

As a winner of Toutiao's Qingyun Plan and Baijiahao's Bai+ Plan, the 2019 Baidu Digital Author of the Year, the Baijiahao's Most Popular Author in the Technology Field, the 2019 Sogou Technology and Culture Author, and the 2021 Baijiahao Quarterly Influential Creator, he has won many awards, including the 2013 Sohu Best Industry Media Person, the 2015 China New Media Entrepreneurship Competition Beijing Third Place, the 2015 Guangmang Experience Award, the 2015 China New Media Entrepreneurship Competition Finals Third Place, and the 2018 Baidu Dynamic Annual Powerful Celebrity.

<<:  With neither demand nor technology, Italy has placed all its hopes for new energy vehicles on China

>>:  Can you buy the DIY consoles from e-commerce companies at low prices? The profits are staggering

Recommend

District 9 watch online free version, District 9 full movie!

In 1990, a huge spaceship appeared in the sky abo...

Psychology in Design

The interaction effect of the product has a signi...

Introduction to the use of Android classic sliding menu SlidingMenu

SlidingMenu is an open source Android development...

What taboos should we pay attention to when placing the Wenchang Tower?

Wenchang Tower is a relatively special kind of ha...

Is it true that Android sub-thread UI operations are not allowed?

Author: Zhang Xichen, vivo Internet Server Team 1...

WeChat product analysis report!

one. Product Information 1. Product Name: WeChat ...

The lakes on the Qinghai-Tibet Plateau are quietly "expanding"

The Qinghai-Tibet Plateau is home to the world...