"Ice lake is the architecture code of Intel's next-generation platform. With the demonstration at the Taipei Computer Show, it finally unveiled its mystery. At Intel's internal second quarter earnings conference not long ago, the CEO announced that ice lake processors had been officially shipped to OEM manufacturers. Dell also acted quickly. XPS 13 7390, which has been delayed for more than a month and adopts the new ice lake processor, was quickly put on the shelves to accept reservations and will be shipped in the near future. This means that Intel's first generation mass-produced 10nm product (not counting the only 10nm I3 of cannon Lake) will finally appear in the market. At this time, Xiaobian compiled and sorted out the relevant analysis articles on ice lake architecture to explore the improvements behind it.
It has been nearly five years since Intel last updated its desktop processor architecture. It has to be said that skylake is a very successful architecture, and it may also be the longest used processor architecture by Intel since P6, which supports Intel to still dominate the mainstream and server market.
First of all, we should clarify that ice lake is the code of the whole processor architecture, and the current Intel processor architecture includes the kernel, GPU and other IO units of the uncore part. Therefore, this paper does not only analyze the kernel microarchitecture of the CPU, but the whole architecture.
Ice lake processor architecture sunny Cove kernel microarchitecture: IPC increased by 18% on average
Front end buffer: increase
The kernel of x86 processor can be simply divided into two parts: the front-end part and the back-end execution part. The front-end part mainly completes the work of "finger fetching and decoding", and the back-end is mainly the specific execution unit of instructions. There is a buffer between the front and rear ends to store the micro instructions after interpretation and fusion. Intel has long introduced the "microinstruction fusion" technology into the kernel to improve efficiency. The fused microinstructions will enter the buffer and then be allocated to the back-end execution part for specific execution. Intel currently believes that more bottlenecks of programs today lie in memory access and front-end instruction dispatch. The improvement of the front-end part of sunny Cove reflects this concept, so the buffer has been expanded a lot this time.
It can be seen that Intel's reorder buffer (which is mainly used for the instruction buffer submitted according to the original order after the reorder execution) can accommodate 352 microinstructions, directly increasing by 128 / 57%, while Haswell and skylake only increased by 32. Similarly, great improvements have been made in memory access. The load queue has increased by 56 and the store queue has increased by 16, which is significantly more than the changes from Haswell to skylake.
Let's look at the cache part. The new kernel has finally added a level-1 data cache that has not changed for ten thousand years, from 32KB to 48KB. Although it has only increased by 12KB, you should know that the design of 32KB level-1 instruction cache + 32KB level-1 data cache has been used since the first generation architecture of the core series - core microarchitecture, At the same time, the bandwidth of L1 data cache is also increased. The L2 cache attached to each kernel is directly doubled to 512KB, which is the largest change in the kernel cache since Nehalem architecture built L2 cache into each core and set up shared L3 cache separately.
Comparison of skylake and sunny Cove kernel architectures, left skylake, right sunny Cove
The improvement of the front-end part is small. It mainly improves the performance of prefetcher and branch predictor, and increases the size of micro instruction cache, so that it can meet the transmission of 5 (6) instructions per cycle.
Rear end: wider
Go up skylake, go down Icelake, pay attention to port
The backend has also changed a lot. Sunny Cove has two more execution ports than skylake, up to 10. And the use of ports is more refined. There are ports dedicated to reading and storing addresses, and the number of ports dedicated to accessing data is two.
Then, in the execution unit, sunny Cove added a unit supporting avx-512 instruction. In fact, this kind of unit has been added to skylake server. At the same time, it also introduced IDIV, a hardware integer divider added to cannon lake. At the same time, it also added a new mulhi unit for the processing of multiplication instructions.
The introduction of avx-512 computing unit enables sunny Cove kernel to process one 512 bit instruction or two 256 bit instructions at a time.
In terms of kernel interconnection, desktop ice lake will still adopt ringbus, that is, ring bus design, while the server side will continue the mesh bus design of skylake server.
Instruction set and AI acceleration
With the addition of new units, the instruction set has also been expanded. Many new instructions have been added in encryption and decryption, AI acceleration, general computing, specific computing and so on, especially avx-512 instruction set.
For the popular artificial intelligence in recent years, on the one hand, Intel has added its own "Gaussian network accelerator" in the uncore part, which is similar to the common AI hardware acceleration circuit on the mobile phone SOC. It also introduces the avx512vnni instruction set and uses the avx-512 unit to accelerate AI related calculations, Intel calls this acceleration "DL (deep learning) boost". This is a clever trick. The introduction of special computing units can ensure a certain acceleration performance, and the addition of new instruction sets can also make more full use of the new CPU features.
The changes above the encryption and decryption instruction set, such as increasing the throughput of AES and adding a new series of instructions for Sha algorithm. In short, ice lake's encryption and decryption performance is much better than skylake on the premise of appropriate optimization of the compiler.
Summary
Briefly summarize the improvements of sunny Cove microarchitecture:
The performance of prefetcher and branch predictor is improved
Increase L1 data cache by 50%
L1 cache storage bandwidth increased by 100%
L2 cache increased by 100%
Microinstruction cache increased by 50%
25% more microinstructions can be added to the out of order rearrangement buffer per cycle
Out of order rearrangement buffer is 57% larger
25% more backend execution ports
Support new instruction sets such as avx-512
Based on the above improvements, sunny Cove has made an average 18% progress in IPC compared with skylake, while for Broadwell or Haswell, it has made 47% progress. In the test optimized for avx-512, it can be up to 2 ~ 2.5 times faster than the previous generation of mobile low-voltage processor. Today, when Moore's law advances slowly, this number is already very high.
As an aside, many improvements have been made in cannon lake, such as avx-512, related instruction set changes and cache bandwidth increase. Some changes are delegated from skylake server architecture. For example, AI accelerated instruction set has actually appeared on server-side processors. However, because cannon lake was actually abandoned by Intel, the sunny Cove kernel architecture that inherited the improvement points of cannon lake can achieve an average IPC progress of 18% compared with skylake. If everything is normal and Intel's 10nm is not delayed, ice lake should be the next generation of cannon lake, which is not so much progress compared with skylake.
11th generation graphics architecture
Ice lake's core display has achieved the computing performance of 1tflops for the first time, and has also added many functional features, which can be described as a lot of improvement. Intel uses "the most powerful version" to describe the performance of this generation of core display. How do you do it?
Stack scale with 10nm process
Intel's 10nm process has greatly improved the transistor density. In the 14nm era, it is equipped with up to 24 groups of EU core displays, which has directly increased by 2.67 times on ice lake, up to 64 groups of EU, and the frequency is not low. It can reach 1100MHz, only 50MHz lower than before. At this time, the overall fp32 calculation of the core display has reached 1.15tflops. In view of this, compared with the 9th generation core display on the eighth generation core processor, Intel officially claims to provide an average frame rate of about 1.8 times.
You must want to ask where the 10th generation has gone, right? In fact, it is still on the premature cannon lake, and the only nuclear display is shielded.
At present, on the mobile low-voltage ice lake processor, Intel provides core displays in G1, G4 and G7 configurations, with 32 / 48 / 64 groups of EU respectively. The low-end G1 is still named "UHD", while both G4 and G7 appear under the brand of "iris plus".
In addition to stacking the number of EU through process progress, the optimization of internal architecture is also important.
Internal architecture optimization
Firstly, the scale is expanded by adding sub slices contained in a single slice to increase the number of calculations per cycle.
The second is to make an article on the cache system to expand the capacity of the three-level cache. Intel announced that the EU's three-level cache has 3MB and 0.5Mb of local shared memory. In addition, the memory controller of the processor can be upgraded to use higher memory bandwidth.
New interface version and enhanced hardware coding circuit
One of the most uncomfortable things for Xiaobian last month was to buy a 1440p, 144hz refresh rate display. When connecting a notebook with HDMI, the maximum output can only be 60Hz under 1440p. The reason is that the HDMI version supported by the old 9th generation core display can only be up to 1.4, and the maximum can only be provided 4K@30Hz The maximum output below 1080p is 120Hz, The small notebook does not provide usb-c or DP output.
Ice lake has finally solved this pain point and supports HDMI 2.0B and DP 1.4 hbr3. Needless to say, these two are the highest resolution and the number of frames. By the way, it can also support HDR.
In addition, in the video hardware coding part, that is, the independent hardware circuit used by Intel quicksync feature, the new core display has also been greatly improved. Now it supports two hevc 10 bit encoding at the same time. In the case of yuv444, it supports up to two 4k60 frame video streams or one 8k30 frame video stream of YUV422.
Variable rate shading (VRS)
The full name of VRS is variable rate shading. It is a new technology that allows the GPU to adjust the shading accuracy according to the importance of the picture area. The specific effect has been introduced in our previous news. You can have a look: let's compare the performance improvement brought by VRS variable rate shading technology. 3dmark will add the picture comparison in the technical benchmark.
VRS can save some GPU resources on unimportant pictures, so that these GPU resources can participate in the rendering of more important pictures, so as to improve the overall number of frames. At present, NVIDIA has added relevant support to the Turing core. Intel did not lag behind. It provided this feature in the 11th generation core display, and they announced that they would cooperate with epic to add this feature to the unreal engine. At present, civilization VI has supported this technology, and according to Intel's data, the number of frames has been increased by 30%.
Summary
The improvement of GPU is mainly due to the increase of the scale, which is a small change in the architecture. It mainly improves the cache system, but the progress of the 11th generation kernel display is still obvious.
Maybe in the future, the nuclear display under 1080p low image quality is no longer a chicken rib, and can play games at 30 frames.
Uncore section
The uncore part refers to the part of the processor other than the kernel and GPU. In the top structure diagram, it is the part of the system agent. There has been no major change since Intel moved the memory controller and PCI-E controller into the CPU in Nehalem. However, Intel has added a new thing and upgraded many old components this time.
Thunderblot 3
Originally, one of the main reasons that prevent people from using thunderbolt (hereinafter referred to as TB) equipment is that the use cost of this interface is slightly higher. After Tb3 began to appear in the form of USB type-C interface, the utilization rate is indeed much higher, but there are other obstacles. One of them is that TB needs to be equipped with additional chips on the motherboard. This control chip is not cheap. Finally, on ice lake, Intel integrated the TB controller into the processor, and will no longer occupy the number of PCI-E buses provided by the processor or squeeze the already crowded DMI 3.0 bus with PCH, but has its own position on the ring bus.
Moreover, Intel generously provides as many as four Tb3 interfaces at once, each of which is the full specification of PCI-E 3.0 X4. In other words, ice lake processor actually has 32 PCI-E 3.0 channels, but half of them are provided in the form of Tb3. Of course, these interfaces support USB mode. When running in USB 2.0 state, they will circle back to PCH for communication.
Of course, not all manufacturers will give enough four Tb3 interfaces. The specific configuration depends on the OEM manufacturer. After all, the independent IC required by other supporting chips such as USB PD will increase the cost, and the TB interface also needs additional retimer chips. However, Intel has halved the required retimer, and only one retimer is required for two Tb3.
However, integrating the TB controller into the CPU also makes the IO part of the whole system agent more complex. The above is a detailed schematic diagram. A type CIO route (named CIO router on the figure) has two PCI-E 3.0 X4 connected to the CPU, and the CPU
			
			
			
			
			
			Our other product: