Deep evaluation Intel's initial generation 10nm Cannon Lake processor

"For people interested in cutting-edge semiconductor know, Intel has bounced latest manufacturing technology for a long time the .10nm process was first released in 2014, was originally scheduled for mass production in 2016, but it has been postponed so far. While the first use of 10nm technology mobile processor shipments by the end of 2017, but is "special for" notebook form and some domestic universities to launch. Intel's 10nm processors belonging to the 8th generation Core series, the model is i3 8121U. ARK page (Intel's online database) the processor has been disclosed, the core code-named Cannon Lake, part of the core 14nm Skylake improved version, in the second quarter of 2018 officially released. i3 8121U The TDP of 15W, dual-thread design nuclear power plant, the fundamental frequency of 2.2GHz, Core frequency of 3.2GHz. This 15W processor of 14nm Kaby Lake even lower than the same. The most novel is that although this is a mobile processor, but supports the AVX-512 instruction set processor servers and high-end desktop platforms only, and you can work with vector operations as enterprise-class hardware. i3 8121U frequency is not retrograde, so that people on the actual performance of Intel 10nm process to produce the slightest doubt. Foreign media SemiAccurate a research article noted that the current (refers i3 8121U launch) Intel's 10nm process there are still many problems and difficulties, it gains only 10%, much lower than the expected 60%, which SAQP, COAG, Cobalt and excellent links tune is far behind schedule and expectations. The following months, anecdotal Intel 10nm process is severely hampered, and even to give up 10nm R & D work, there are rumors that Intel will lower their standards in order to achieve this process, but are Intel eleven rumor. Fortunately, this year's CES, Intel demonstrated the new Sunny 10nm process architecture Ice Lake Cove processor, be concerned about the new process allows people to eat a reassurance. Although Ice Lake have not yet landed, but the foreign media but Anandtech through various channels, to get the use of Cannon Lake Processor "special for" laptop, and its detailed test. Difficulties 10nm process? September 2017, Intel Technology and Manufacturing in Japan demonstrated a complete 300mm wafer chip 10nm Cannon Lake, foreign media Techinsights measured chip area of the chip is about 70.5mm², that is to say, i3 8121U is Intel so far minimum dual-core processor, but compared with the then Skylake processor (Sixth Generation core), i3 8121U using separate CPU and GPU design, a lower degree of integration. One common measure of the semiconductor process industry standard of good or bad, the number of chip integration of the relevant transistors per square millimeter. The CPU is not computing transistor SRAM cell as well, and some are designed to heat the buffer area between the "dead" silicon. Transistors have different counting methods, a 2 NAND logic cell input complex than a much smaller scan flip-flop logic unit. Intel the number of transistors per unit area is divided into 2-input NAND cell unit and the scan flip-flop, wherein the 2-input NAND cell transistor density 90.78MTr / mm² (one million transistors per square millimeter), the density of scan flip-flop means is 115.74 MTr / mm², after assigning weights for 60/40 weight calculated density of transistors 10nm process 100.8MTr / mm², 2.7 times 14nm process 37.5MTr / mm² of. Intel also disclosed in the International Electron Devices Meeting, depending on the desired functionality, the library 10 nm logic technology 10 types, including short library (library high density), high library (library high performance) and a high libraries (ultrahigh performance library) and so on. Library shorter, lower power dissipation, the higher the transistor density, but the lower the peak performance. So Intel's 10nm process is in fact a variety of different densities, in fact, only the highest density of short libraries can reach 100.8MTr / mm². In the actual chip manufacture, typically using a mixture of libraries, shorter library for I / O and other non-core area portion insensitive to cost performance, higher density and lower library by high drive current is generally used in the core region of the sensitive performance. To better understand the Intel 10nm process, first discuss the Fin (fin), Gate (gate), the mechanism unit, a number of terms and definitions associated with the transistors and the FinFET. Transistor source - drain of the fin (gray) is provided through the gate of the fin (green) and embedded oxide, key indicators here is the fin height, width and gate length of the target, it is to make every semiconductor process a are as small as possible, unit performance as high as possible. Intel 22nm process in which, using a tri-gate transistor comprising a plurality of fins to increase the total drive current to achieve better performance. This introduces a new measure: "fin pitch", i.e. the distance between the fins. If a plurality of fins through the gate, the distance between the gate is called "distance gates." The more the gate contact between the fin and the smaller the fin pitch the lower the leak, the better the performance, which may increase the drive current, parasitic capacitance can be controlled and the gate capacitance. Thereafter 14nm process, the fin height, width and gate length becomes shorter, the gate of each fin is also more passes, thereby obtaining better performance. By the 10nm process, Intel design also active fin structure, the fin pitch reduced from 42nm to 34nm, the fin width down to 7nm from 8nm to avoid parasitic capacitance. Changes not seem like much, but it is very important to every nm on this scale. Intel conformal layer of titanium further be improved by adding a contact region (below the gate tip gray) between the source and drain diffusion regions, fins and grooves also need to minimize the contact resistance. In the 10nm process, Intel from contacting the tungsten to cobalt in contact resistance of the contact lines reduced by 60%. All of these improvements allow technology become extremely challenging. Fin is the basic gate combination circuit means, 22nm process from the image scanning electron microscope view, unit 6 has two fins and fin (of course there are other specifications), are not the same gate length, each unit has inner fin pass current active and inactive as the fin spacing. On the 10nm process, the high-density cell library total of eight fins, wherein the fin 5 is active, these units can be used for I / O or the like need not be very high performance for cost-sensitive circuit portion. High Performance and Ultra High Performance libraries are libraries respectively fins 10 and 12, each of the former compared to the one more additional N and P fin fin, it helps to provide additional drive current to the appropriate exchange sacrifice peak efficiency enhance performance. Between the cells, there is usually many dummy gate serving as spacers. In the Intel 14nm process, both ends of each unit has a dummy gate, there will be two dummy gate means between two units. In the 10nm process, two adjacent cells may share a dummy gate, which will bring the advantages of greater density, Intel represents savings of up to 20% of the chip area. Internal transistor, gate length typically by two cell sizes slightly beyond the source and drain contacts to the power, it is inevitable to take up additional planar size. In the 10nm process, at least in the current version of Lake Cannon used in the processor, Intel by what is called "active gate contact" (COAG) design, the gate contact is placed vertically on the display unit. This design increases the manufacturing process of several steps (the first etching, deposition time and the polishing time), but may provide about 10% of the area of a chip scale. As already made Road, a research article on the foreign media SemiAccurate has said, COAG is a high risk embodiment, although Intel has been made out and it worked fine, but it is not as reliable as expected. COAG for Cannon Lake core seems to only run in conditions of low performance & low power or high performance & high-power, Intel hope that the future can improve on COAG explained in the next-generation 10nm Ice Lake processor officially on sale Details condition. Back to the transistor density, another method is to measure the transistor density CPP * MMP, the gate is about pitch (pitch polysilicon contacts Contact Poly Pitch) multiplied by the fin pitch (minimum metal pitch). These various improvements are added together, so Intel's CPP * MMP size is only 54nm * 44nm, TSMC and Samsung 7nm compared to only slightly lose a little bit, which is the reason Intel has always stressed that only the first two business named. Uncover the secret of architecture Although Cannon Lake Core i3 8121U is still in the NDA, but after a year of tireless research since numerous scientific and technological circles colleagues, he finally took the wraps off its basic architecture. Overall, Cannon Lake core is designed much like the PC side and server-side core Skylake mixture Skylake-SP core. Although it uses a standard PC-side decoding section 4 + 1, eight execution units and L1 + L2 + L3 cache structure, but also introduces a AVX-512 from the server unit, and the L1 data cache read and write speeds, respectively the 2 * 512Byte and 1 * 512Byte weekly. Further point of view, the core of Cannon Lake also reflects a small portion of the second-generation 10nm design Sunny Cove architecture, some of Skylake and Skylake-SP are not on the core instruction, there exist on Cannon Lake and Sunny Cove. In addition, although not clear Cannon Lake distal core architecture design changes, but still it can be seen that the reorder buffer size is the same as the core 224 Skylake microinstruction, while most features improved architecture Sunny Cove (memory bandwidth doubled, execute and perform more port port improvements) will not appear on Cannon Lake core. Cannon Lake support new instructions include IFMA (Integer fusion Multiply Add, fusion integer multiplication and addition), VBMI (Vector Byte operation instructions, the vector operation instruction bytes), and hardware-based SHA (Secure Hash Algorithm, Secure Hash Algorithm), etc. . Wherein, IFMA 52-bit integer fused multiply-add (FMA), the behavior is the same FMA AVX512 floating point, delay of 4 clock cycles, the throughput per clock period of 2 (for xmm / ymm / zmm of 1 and 4 ). This instruction is typically used to assist encryption, but it also means performing arbitrary-precision arithmetic. Providing the instruction set VBMI VPERMB, VPERMI2B, VPERMT2B and VPMULTISHIFTQB four instructions, shuffling useful in the byte scheme. The SHA is purely hardware-accelerated encryption algorithm designed to speed up, but tests show, Cannon Lake core and with it the speed of Zen and AMD are still slower than Goldmont (the core of the next generation Atom processor), which means that at least hardware-based SHA on not particularly useful in the i3 8121U. In addition to adding new instructions, Intel generally also improve existing instructions on a new core, to increase throughput or decrease latency (or both). Cannon Lake core also supports Vector-AES feature which allows more time using AES instruction AVX-512 cells thereby doubling the throughput. Cannon Lake on the core, the biggest change is the hardware to support 64-bit integer divide, no longer need to be divided into several instructions, 18 clock cycles to complete the IDIV 64bit. In contrast, Zen perform the same operation requires 45 clock cycles, Skylake core 97 clock cycles is required. For storage block string, all the command REP STOS * series 512bit may be used to perform a write port, the throughput per clock cycle 61bit, contrast, Skylake-SP is 43bit, Skylake of 31bit, Zen is 14bit. For full word integer vector, AVX512BW VPERMW command latency is reduced from six to four clock cycles, and each of the clock doubling throughput. Similarly the vectors, and the use VMOVSS VMOVSD commands to move or combined single / double precision scalar vector is now the same as the behavior of other MOV command. Further advantageous adjustment instruction set comprises ZMM division and square root of a faster clock, and some of the throughput from each of four functions GATHER a clock increased every three clock a; regression places x87 instruction appears in the form of old, where x87 DIV, SQRT, REP CMPS, LFENCE and MFENCE have a had a slower clock, the other instruction more slowly, to let these people abandoned the old instructions. Cannon Lake relative lack of core areas include: VPCONFLICT * command has delayed three clock cycles, the throughput per clock cycle one, the speed is still slow; DWORD ZMM form of delay of 26 clock, the throughput for every 20 clock 1; Skylake-SP is not supported by the core of the cache line write-back function CLWB; does not support SGX (extended software protection). Processor specifications comparison I3 8121U In tests, using a mobile processor i3 8130U contrast, this is a Kaby Lake processor core four threads, using 14nm manufacturing process, the same as the TDP 15W, i3 8121U the same fundamental frequency, frequencies Core instead, the slightly higher. For this 15W TDP mobile processors, it could easily lead to hit the wall temperature down. Test i3 8121U down very frequently in applications AVX2 simply a running state of the reference frequency of 2.2GHz, AVX-512 applications even down to below baseline 1.8GHz. In contrast, the use of sophisticated technology 14nm i3 8130U 2.8GHz frequency can be maintained in AVX2 applications, such as the POV-Ray test items, i3 8130U can complete the test faster, higher performance compared i3 8121U 26%. However, although the application i3 8121U running AVX-512 frequency is low, but still advanced instruction set delivers outstanding performance in 3DPM test, open the 3846 points AVX-512 instruction set i3 8121U at 1.8GHz under score , but only 6 times 2.8GHz AVX2 instruction set supports i3 8130U. Memory performance and power test Cache / memory latency test, i3 8121U processor and i3 8130U Turbo disabled, forcing them running at the same frequency of 2.2 GHz, for parity and architecture is direct. Cannon Lake core cache / memory subsystem and Skylake core the same, without any other improvements, in theory, it should also show the performance is basically the same. In this test, two-processor cache access latency is almost the same, but Cannon Lake Core i3 8121U memory access latency to be higher than Kaby Lake core i3 8130U up to 50%, one up on the shock of the four (of course, this is not what a good thing). Although the i3 8121U is supporting DDR4 2400 memory timing 17-17-17, slightly lost on 16-16-16 -16 i3 8130U, but this Diudiu timing difference is hardly enough to have such a big impact, can think only the reason is, Cannon Lake access memory controller core has a very large overhead, which is perhaps sealing off the specter of side effects and fuse vulnerabilities. The power consumption relatively complicated and confusing, we know, Intel set up two critical power limit PL2 --PL1 and hardware in the processor, the former steady-state power control, which controls short time Turbo power consumption. In most cases, the stationary power consumption of the processor and the same TDP, i3 8130U is such as steady-state power processor for 15W, 15W TDP however, the same steady-state power dissipation of only i3 8121U 12.6W . PL2 controlled by the peak power is the same, i3 8130U peak power up to 24.2 W, and i3 8121U only the highest rushed to 18.7W, the duration and the Turbo is also much shorter than the i3 8130U. Be Zaoxin, although a lower power consumption i3 8121U wall, but because of its lower frequency performance is worse, the actual implementation of the energy consumed by the operation but more. In POV-Ray test items, the total energy consumption Kaby Lake core i3 8130U only 768 mWh, and the total energy consumption Cannon Lake core i3 8121U is 867mWh, full high 12.9%. 2.2GHz frequency with test: SPEC2006 In addition to power, another core issue of Cannon Lake is whether it is an efficient architecture. For direct comparison of the IPC, we will hold the two processors to run on the same 2.2 GHz frequency SPEC2006 test. SPEC2006 is an important benchmark, it differs from other testing software that processed data sets larger and more complex. As a benchmark more representative, it could fully reveal more details in architecture. From the test results, two different core processor performance is almost the same, i3 8130U seems more than Cannon Lake core i3 8121U advantage Kaby Lake in central 462.libquantum and 470.lbm test items associated with the SIMD, this may be related to both memory latency performance. 2.2GHz frequency with test: overall system performance System testing section focuses on the actual user experience, including application load times, image processing, simple physical science, simulation, neural simulation, optimization and development of 3D models and other test items. GIMP application load times The system response speed is of the indicators related to the user experience, a good test is to see how long it takes to load the application. In this test, Cannon Lake core i3 8121U particularly good performance. FCAT image processing FCAT using video recording software, and data frame time color data processing, so that the system can draw visual frame rate. This test is single-threaded, at the reference frequency, Cannon Lake core i3 8121U and i3 8130U consuming gap Kaby Lake core within half a second, slightly ahead i3 8121U. 3DPM particle motion computing 3DPM test is a customized benchmark, designed to simulate different algorithms of particle motion in 3D space of six points. A key part of the algorithm is the use of a relatively fast random number generator, ultimately in the chain of dependencies in the code. In this test, we run on six kinds of algorithm a set of atomic particles, every 20 seconds, pause for 10 seconds, and report on the overall rate of particles moving to millions of times per second the motion as a unit. Kaby Lake lost in the core does not start AVX, Cannon Lake core i3 8121U of i3 8130U. But after each start AVX, i3 8121U actually ran out of the ultra-high 4519 points, and even beat the 18-core Core 4185 points i9 7980XE processor, very crazy. Dolphin 5.0 simulator Dolphin 5.0 is a GameCube / Wii host simulator, you can play these exclusive masterpiece older game console on the PC. However, these two simulated using Power Architecture processors typically require a host processor, the job is not weak. In this test, the same frequency is substantially the same properties of the two processors. DigiCortex sea slug brain simulation DigiCortex benchmark originally designed for simulation and visualization of neurons and synaptic activity, the software has a variety of standard models, this small benchmark test, simulate 32,000 neurons and synapses 1.8 billion, equivalent to the size of sea slugs brain. Simulation type classified as "non-activated" and "excitation" modes, the former is more affected by the influence of memory, which is more dependent on pure processor performance. Using the latter test, substantially the same as the frequency of the performance of the two processors. y-Cruncher Scientific Computing y-Cruncher is a mathematical constant help calculate various tools, software support run by different optimizations binary, single-threaded and multi-threaded, and even including AVX-512 optimized binary files. The test is based on single-threaded and multi-threaded, calculated pi 250 million. Test results are not unexpected Cannon Lake core i3 8121U win, so far, all of the available instruction set AVX-512 software is i3 8121U win. Agisoft Photoscan 2D image to 3D model PhotoScan may be many 2D images into 3D models, which is an important tool for model development and archives, depending on many single and multithreaded algorithms. PhotoScan v1.3.3 release tests used, which contains 84 x 1800 megapixel large data sets, through a very fast algorithm variant, and finally comparison of the total time the conversion process. In this test, the same frequency is substantially the same properties of the two processors. The same 2.2GHz frequency tests: rendering performance Rendering performance processor is typically a key indicator in a professional environment, from the 3D rendering to rasterized, covering mesh, texture, collision, sawtooth, and other physical aspects. Most renderer supports CPU rendering, few can support GPU or ASIC, FPGA and ASIC. For large studio is, CPU is still the preferred hardware. Corona 1.3 Rendering Corona is 3DS Max and Cinema 4D software renderer's advanced performance, benchmark GUI can display the scene being built, and rendering time feedback to the user. This test uses a direct output of the command line version, nor report the result output time, but reports the average number of six light operation per second, because the performance ratio per unit time is generally easier to understand. Corona only support to AVX2 instruction set, we can not give full play to the core characteristics of Cannon Lake. In this test, i3 8121U same frequency performance behind i3 8130U about 10%. Blender 3D authoring software Blender is an open source advanced rendering tools, support a large number of configurable items, it is used by many of the world-renowned animation studio. The software development team recently released a benchmark package, this test run "bmw27" sub-scenarios to test the suite from the command line, and measurement is complete rendering time. Blender is also only support to AVX2 instruction set, in this test, the performance of two processors with frequency about the same, Cannon Lake core i3 8121U has a slight advantage. LuxMark engine Use LuxRender engine developed benchmark offers several different scenarios and API, this test choose to run a simple "Ball" in C ++ and OpenCL code path scene to render a rough start, and slowly within two minutes improve the quality of the final result to show the number of rays rendered per second. POV-Ray raytracing Persistence of Vision ray tracing engine is another well-known benchmarking tool, it has been unknown until AMD released Ryzen processor, then Intel and AMD have begun to submit code to the main branch of the open source projects. This test uses all calls to the kernel command line with internal reference. 2.2GHz frequency with test: office performance Office Test Suite is designed to focus more on industry standards, such as office work processes and systems conferencing, but we will also be bundled compiler performance in this section. For the user must make an overall assessment of the hardware, these benchmarks are generally the most consideration. 3DMark physical computing Each game tests 3DMark software test scenario includes a physical test subkey. According to the complexity of the arrangement followed by Ice Storm, Cloud Gate, Sky Diver, Fire Strike and Time Spy. In all the test scenarios, the same frequency performance of the two processors are substantially the same. GeekBench 4 GeekBench 4 are commonly used cross-platform test tools, a series of algorithms to seek focus peak throughput, including encryption, compression, fast Fourier transform, memory operations, n-physical body, matrix operation, and the histogram processing HTML parsing, commonly used mobile device testing. Given its versatility and popularity, this software also joined this single-threaded and multi-threaded test. 2.2GHz frequency with test: encoding performance With the rise of streaming video content and short, more and more home users and gamers need to convert video files, encoding and transcoding performance of the processor is becoming increasingly important, this encoding tests are mainly around these important scene. Handbrake video transcoding Handbrake is a popular open-source video conversion software, the latest version available AVX-512 and OpenCL to accelerate certain types of transcoding and algorithms. This test uses a CPU transcoding. 7-Zip compression and decompression In the compression / decompression applications, open-source 7-Zip is one of the tools very welcome. The guess is the latest version v18.05, which built benchmark, benchmark tests run from the command line, the report compression, decompression, and overall score. WinRAR compression decompression In most people's systems usually we have WinRAR, which is one of the first compression decompression tool 20 years ago. It is no built-in benchmark tests, this time using a contains more than 30 60-second video file fragmented and 2000 small files folder to the normal operation of the compression ratio compression. WinRAR is a variable thread, but also vulnerable to the cache, so the test need to run it 10 times and take the average of the last five, so that the results can show pure raw computing performance CPU. AES encryption Many mobile devices use the default file system provides encryption to protect content, Windows on the PC is also usually use BitLocker or by a third-party software. The use has been discontinued as a built TrueCrypt benchmark tests various encryption algorithms in the memory directly, but does support the AES instruction set supported AVX-512. Test data is used in AES encryption / decryption combined to gigabytes per second. Intel on 10nm process really done a lot to improve, if every step perfectly running, 10nm should become in the last year. The problem is in semiconductor design, there are hundreds of different characteristics, a change in any of the biggest problems are likely to lead to several or even dozens of other characteristics are degraded, which is encountered in the Intel 10nm workmanship. Still remember the 2018 CES, Intel on 10nm process-related issues mollified, this short-lived Cannon Lake core point of view, the only claim to brisk performance only AVX-512 performance, it is clear that the first generation of 10nm also far from ready for prime time entered, Intel is trying to deal with this cold generation processor, and certainly they will not be officially on sale. Hot recommendation: NXP LPC54018 Internet of Things Module OM40007, let your schedule fall more quickly Texas Instruments MSP-EXP430FR2355 LaunchPad development kit allows simple and intuitive design and development Microchip Technology SR087 power from the board, a new selection switching power supply Bosch BMI088 high performance IMU, and you together to improve flight and navigation experience The semiconductor Roma SensorShield-EVK-003 Evaluation Kit, a "kind of material has a" board Intel RealSense Depth Camera D400 Series "More Convenient Secret" Analog Devices Eval-ADXL362 Evaluation Board to Develop Designers More Choice Cypress Semiconductor PSoc6 BLE development kit, beautiful design is about to happen STMicroelectronics VL53L1X time of flight range-finding sensorAbsolute ranging fearless color and reflection NXP I.MX 8MQUAD Evaluation Kit MCIMX8M-EVK, a high-performance development tool for unpacking Infineon IM69D120 and IM69D130 Xensiv MEMS microphone, designed for low distortion and high signal-to-noise ratio Original address: Hot recommendation: NXP LPC54018 Internet of Things Module OM40007, let your schedule fall more quickly Texas Instrument MSP-EXP430FR2355 LaunchPad Development Kit, allows design development to be simple and intuitive Microchip Technology SR087 Power Demo Board, Switching Power Supply New Selection Bosch BMI088 High Performance IMU, improve the flight and navigation experience with you Roma Semiconductor Sensorshield-Evk-003 Evaluation Kit, a "available" Intel RealSense Depth Camera D400 Series "More Convenient Secret" Analog Devices Eval-ADXL362 Evaluation Board to Develop Designers More Choice Cypress Semiconductor PSoc6 BLE development kit, beautiful design is about to happen STMicroelectronics VL53L1X flight time ranging sensor, absolute ranging fearless color and reflection NXP I.MX 8MQUAD Evaluation Kit MCIMX8M-EVK, a high-performance development tool for unpacking Infineon IM69D120 and IM69D130 Xensiv MEMS microphone, designed for low distortion and high signal-to-noise ratio Original address: https://www.eeboard.com/news/beaglebone-capes/ Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be directly paid] Be Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be paid directly] "