In-depth understanding of the difference between CPU and heterogeneous calculation chip (GPU / FPGA / ASIC)

"First, heterogeneous calculations: why Obvious CPU is good, why should we consider heterogeneous calculation chips? With the rapid growth of Internet users, the amount of data volume is sharply expanded, and the data center's demand for calculation is rapidly. Such as deep learning online prediction, video transfer, picture compression decompression, and HTTPS encryption, various applications such as computing have far exceed the ability of traditional CPU processors. Historically, the continuous evolution of semiconductor technology, the throughput and system performance of computer architecture continue to improve, the performance of the processor can double every 18 months (well-known "Moore Law"), so that the performance of the processor can Meet the needs of application software. However, in recent years, the semiconductor technology improvement has reached the physical limit, and the circuit is getting more complicated. Each design is costly to millions of dollars, and billions of dollars can form new product production capacity. On March 24, 2016, Intel announced that the "Tick-Tock" processor R & D model was officially deactivated, and the future R & D cycle will change from the two years to three years. At this point, Moore 's law has expired to Intel. On the other hand, the processor performance will never grow in accordance with Moore Law, and on the other hand, data growth has exceeded the rate of calculation performance exceeding the growth rate of "Moore Law". The processor itself cannot meet the performance requirements of high performance computing (HPC: High Performompute) application, resulting in a gap between demand and performance (see Figure 1). One solution is to improve processing performance by hardware acceleration, using a special coprocessor's heterogeneous method to improve processing performance. Figure 1 Calculating demand and calculation capabilities Second, heterogeneous calculations: Standards Usually we have four options for deploying the platform when providing solutions for business, GPU, FPGA, ASIC. Is there any standard to judge the advantages and disadvantages of the platform? I am a judge, the standard I said Today's ideal coprocessor should be hardware-based design and have three basic capabilities. The first is a key processing function that is designed to provide specialized hardware acceleration implementation. Secondly, the coordinator design is very flexible in performance, using pipeline and parallel structure, following algorithm updates, and performance requirements. Finally, the coprocessor can provide broadband and low delay interfaces for the main processor and system memory. In addition to hardware requirements, the ideal coprocessor should also meet the "4P" requirements of the HPC market: Performance, PROTUCTIVITY, Power and Price. The minimum requirements for the HPC market are comprehensive acceleration algorithms, not just one step, and can accelerate the implementation of the entire application. Effective demand comes from end users. In an existing computer system, the coprocessor must be installed and easy to provide a simple method to configure the system to accelerate existing applications. The power consumption demand in the HPC market comes from the power consumption limit on the installation and use of the system. For most users, it is possible to provide a limited space for the computer. The smaller the power consumption of the computing system, then fewer heat dissipation measures can be taken to keep your computer will not overheat. Therefore, the low-power condensation processor can not only provide lower operating costs for the computing system, but also improve the spatial utilization rate of the computing system. Price factors are increasingly important in the HPC market. More than ten years ago, some applications have exceeded the performance range of performance, which promoted people to adopt a dedicated architecture such as dense parallel processing (MPP) and symmetrical multiprocessing (SMP). However, such systems require the use of custom processor units and dedicated data pathways, development, and programming are very expensive. The current HPC market discards such an expensive approach, but a cluster calculation method with a higher cost performance. The cluster calculation uses commercial standard architecture, such as Intel and AMD; industrial standard interconnection, such as 10 Gigabit Ethernet and InfiniBand; use standard programming, such as C language running on a low-cost Linux operating system. Today's coordinator design must be able to integrate into a commercial cluster computing environment, which costs and other nodes in the cluster are roughly equivalent. After understanding the basic evaluation criteria, we take the most popular depth study as an example, from the chip architecture, computing performance, power consumption, and development difficulties, several different chips are compared. Third, heterogeneous calculation: Which 1, chip characteristics Classification of commonly used processor chips, there is a significant feature: CPU & GPU requires software support, and FPGA & ASIC is a structure of hardware and software, software is hardware. This feature is the most important feature in the processor chip. Figure 2 processor chip comparison Figure 2 can be explained from two angles: from the direction of the ASIC-> CPU, along this directional chip is getting stronger and stronger, the programming of CPU & GPU requires the support of the system, the compilation system's role is to put advanced software languages The instructions (also called machine language) can be identified as a machine. Advanced languages bring great convenience and ease of use, so the software development cycle of the same function with CPU & GPU is much lower than the FPGA & ASIC chip. Along the direction of CPU-> ASIC, the efficiency of transistors in the chip is getting higher and higher. Because the algorithm implemented by the FPGA & ASIC is implemented directly with the transistor gate circuit, the algorithm is directly building above the physical structure, and there is no intermediate level, so the transistor is the highest. Essentially software operation objects are instructions, while CPU & GPUs play a role of high-speed execution instructions. The presence of the instruction becomes the program into software and hardware, and the existence of the instruction also determines some of the completely different characteristics of various processor chips and their respective advantages. The function of the FPGA & ASIC is fixed, and the algorithms they achieve are implemented directly using the door circuit. Therefore, the FPGA & ASIC programming is the process of implementing algorithms of the door circuit. The software is completed means that the organizational form of the door circuit has been determined. In this sense, FPGA & ASIC The software is hardware, and the software determines the organizational form of the hardware. The characteristics of hardware and software are characterized by extreme important resource utilization characteristics in FPGA & ASIC design. Utilization Refers to the process of using the gate circuit implementation algorithm, the algorithm occupies the occupancy of the gate circuit resources owned by the processor chip. If the algorithm is large, there may be no need to use the door circuit resources or although the circuit resources are sufficient, the actual wiring is unable to carry out. There is no utilization of the processor chip CPU & GPU of the instruction system. The process of doing instructions is to constantly read the instruction from the memory and then execute by the actuator. Since the space occupied by the memory is almost unlimited, even if the algorithm is large, there is not enough memory space, the algorithm cannot be read in. Moreover, the computer system can also plug-in storage such as a hard disk, and switch the space stored by the instruction stored by switching the algorithm that is temporarily executed to the hard disk. 2, chip calculation performance Deep learning is also called Deepneural Networks, which is developed from the artificial neural network. We analyze the performance of each chip as an in-depth point. Figure 3 is the basic structure of the neural network, a large number of calculations of each layer in the model is the output result of the previous layer and the multiplication of the two matrices thereof. Figure 3 Basic structure of neural network Transverse contrast CPU, GPU, FPGA, ASIC computing power, actual comparison is: 1. Hardware chip's computing power. 2, why do you have such a calculation ability? 3. Can I give full play to the multiplication capacity of the hardware chip? With these three problems, we compare the computational power of hardware chips. CAC calculation capability analysis Here CPU computing ability is analyzed by Intel's Haswell architecture. The Haswell architecture has 2 FMA (FUSED MUSD MUSD), each FMA can be used to perform a pre-operation and an additional operation for 256bit data in a clock cycle. Corresponding 32BIT single-precision floating point computing power is: (256bit / 32bit) * 2 (FMA) * 2 (multiply and plus) = 32 sp flops / cycle, that is, 32 single-precision floating point calculations can be made each clock cycle. CPU peak floating point calculation performance = CPU core * CPU frequency * Percy execution floating point operand. Intel's CPU model E5-2620v3 calculates the peak computing power is = 6 (CPU core) * 2.4GHz (CPU frequency) * 32 sp flops / cycle = 460.8 GFLOPS / S is 460g peak floating point computing power per second. Can the CPU chip structure make full floating point computing power? The CPU's instruction execution process is: Take the instruction-> instruction decoding -> instruction execution, only when the instruction is executed, the calculation unit functions, so that the instruction and instruction decoding are used, the calculation unit is not working. ,As shown in Figure 4. Figure 4 CPU instruction execution process In order to improve the efficiency of the instruction execution, the CPU is pre-read in advance, and the instruction flow is pre-read, so that the instructions are treated, and the instruction execution efficiency is improved, as shown in FIG. The premise premise prerequisites pre-read and stream execution is that there is no correlation between the instructions, and the result of how to perform a command is not allowed to wait until a result of the previous instructions can be known. Figure 5 CPU instructions As a general purpose processor, the CPU is used to calculate and control, and the 70% transistor is used to build Cache and some of the control units are used to process complex logic and improve the execution efficiency of the instruction. As shown in FIG. 6, the calculation is strong, Treatment has high complexity, but the calculation performance is general. Figure 6 CPU structure Through the CPU computational performance analysis, directly improve the calculation performance direction is: increase the CPU core, improve the CPU frequency, and modify the number of CPU architecture to increase the calculation unit FMA (FuseDMultiply-Add). In these three directions, the number of CPU cores directly increases the maximum increase in computing power, but the chip power consumption and price increase, because only 30% of the transistor in each physical core is a calculation unit. Improve the CPU frequency, the increase in space is limited, and the CPU frequency will cause the chip to have an excessive power consumption and overheating. Therefore, Intel and other chip manufacturers currently take the multi-core route, which limits the main frequency of a single microprocessor. Integrated multiple processor kernels to improve processing performance. Modifying the CPU architecture to increase the number of calculation unit FMA. At present, Intel is adjusted by CPU architecture in accordance with the "Tick-Tock" two-year cycle, from 2016 to three years, update the iterative cycle longer. Be GPU computational power analysis The GPU is mainly good at making a parallel calculation of image processing, so-called "coarse-grainparallelism". The characteristic of graphics processing is characterized by high density calculations, and less existence correlation between the required data, the GPU provides a large number of calculation units (up to thousand calculations units) and a large number of high-speed memory, which can simultaneously Perform parallel processing. Figure 7 is a design structure of the GPU. The GPU's design starting point is that GPU is more suitable for calculation of high computational strength and multi-fold line. Therefore, the GPU uses more transistors to calculate the unit, rather than the CPU for data Cache and the flow controller. Such a design is because each data unit performs the same program during parallel, and does not require a cumbersome process control, and there is no high computing power, so there is no need for large CACHE capacity. Figure 7 GPU structure A logic control unit in the GPU corresponds to multiple computing units, while if you want to calculate the unit, logic control inevitably not too complicated, too complex logic control unable to play the parallelism of the computing unit, such as too much IF ... Else IF ... ELSE IF ... Branch calculations cannot improve the parallelism of the computing unit, so that the logic control unit in the GPU does not need to quickly process complex control. Here GPU computing ability analyzes NVIDIA Tesla K40, and K40 includes 2880 flow processors, and stream processors are GPU computing units. Each stream processor contains a 32BIT single-precision floating point multiplication and plus unit, that is, 2 single-precision floating point calculations per clock cycle. GPU peak floating point computational performance = stream processor number * GPU frequency * Periodic floating point operand. Take K40 as an example, K40 peak floating point computational performance = 2880 (flow processor) * 745MHz * 2 (multiplication) = 4.29TFLOPS / S is 4.29T peak floating point computing power per second. Can the GPU chip structure make full floating point computing power? The GPU is also the same as the CPU, the directive execution process: Take the instruction -> instruction decoding -> instruction execution, only the calculation unit functions when the instruction is executed. The GPU's logic control unit is simpler, so you want to do instruction flow processing, improve the instruction execution efficiency, and inevitably require the algorithm itself, the complexity of the process is low, the processing data is independent, so the serial processing of the algorithm itself A significant decrease in the capacity of GPU floating point. FPGA computing ability analysis As a high-performance, low-power programmable chip, FPGA can be designed according to customer customization. Therefore, when processing massive data, the FPGA is more efficient than the CPU and GPU, and the FPGA is more efficient, and FPGA is closer to IO. FPGA does not adopt instructions and software, is a device in which a hardware and software combined. Programming FPGA To use a hardware description language, the logic described in the hardware description language can be directly compiled into a combination of transistor circuits. Therefore, the FPGA actually implements the user's algorithm with the transistor circuit, and does not pass the translation of the instruction system. FPGA's English-speaking name translation, all named field programmable logic gate array, this name has revealed the FPGA function, which is a combination of logical gate circuits, which can be programmed, and programmed can be repeated. Figure 8 shows the internal schematic of the programmable FPGA. Figure 8 FPGA internal structure diagram Here, the FPGA computing power is analyzed by Xilinx's V7-690T.Yuan. Each DSP can do 2 single-precision floating point calculations (multiplication) at each clock cycle. FPGA peak floating point computing performance = DSP number * FPGA frequency * Percy execution floating point operand. The V7-690T running frequency has been 250MHz, V7-690T peak floating point computing performance = 3600 (DSP number) * 250MHz * 2 (multiplication and plus) = 1.8T flops / s 即 1.8T peak floating point computing power per second . Can the FPGA chip structure make full floating point computing power? FPGA is customized because the algorithm is customized, so there is no CPU and GPU's reform command and instruction decoding process, the data stream is directly fixed according to custom algorithm, and the calculation unit can perform, so it can fully exert floating point Computing power, calculating efficiency is higher than CPU and GPU. Analysis of ASIC Computation Capacity ASIC is a special chip that has a certain difference from the traditional general-purpose chip. It is a chip that is customized for a particular requirement. Both the calculation and calculation efficiency of the ASIC chip can be customized according to the algorithm, so the advantages of the following aspects have the following aspects compared to the universal chip: small size, low power consumption, high computational performance, high computational efficiency, and chip out. The greater the amount of goods, the lower the cost. But the disadvantage is also obvious: the algorithm is fixed, and once the algorithm changes, it may not be available. At present, the artificial intelligence is a big outbreak period. A large number of algorithms are constantly emerging, far from the algorithm station, how to adapt to various algorithms, if the current CPU and GPU architecture are adapted to various algorithms, The ASIC special chip becomes the same common chip as the CPU, GPU, and there is no advantage in performance and power consumption. Let's take a look at the difference between FPGA and ASIC. The basic principle of the FPGA is to integrate a large number of digital circuit basic door circuits and memory in the chip, and the user can define the connecting lines between the memory and the memory by burn the FPGA configuration file. This burning is not one-time, that is, the user can configure the FPGA to a microcontroller MCU, which can edit the configuration file to configure the same FPGA into an audio codec. The ASIC is a dedicated integrated circuit. Once the circuit is designed and manufactured, the circuit is fixed and cannot be changed. Compare FPGA and ASIC are like a relatively simple building blocks and models. For example, if you find that the Yoda Master of the Star War is very hot, what do you do if you want to be an Yoda master toy? There are two ways, one is to use Lego blocks, and one is to find the factory to open the design. If you use music, you can buy a set of Lego blocks as long as you design the toy shape. If you find a factory to open, you still need to do a lot outside of the design of the toy, such as whether the material of the toy will emit odor, whether the toy will melt in high temperature, so use the music highlights to do the pre-work ratio of toys Looking for a factory open mode, there is much less, and the time you need to be able to list it is much faster from the time you need to list it. The FPGA and ASIC are also the same. Using FPGAs You can use the tools provided by the FPGA to implement the hardware accelerator with the tools provided by the FPGA manufacturer. To design ASIC, you need to do a lot of verification and physical design (ESD, package, etc.), need more time. If you want to apply for special occasions (such as military and industries such as military requirements for reliability), ASIC needs more time to specifically design to meet demand, but with FPGA, you can directly buy military-grade high stability FPGA is completely Affect the development time. However, although the design time is relatively short, the toys made by Lego blocks are compared to the factory-customized toys to be rough (characterized by poor performance), after all, the factory is molded to be tailored. In addition, if the shipments are large, the cost of large-scale production toys in the factory will be much cheaper than using Lego blocks. The same is true for FPGA and ASIC, and the speed of the accelerator of the ASIC implemented with the best process at the same time is 5-10 times faster than the accelerator speed made by the same process FPGA, and once the amount of ASICs after mass production will be far away Lower than the FPGA scheme. The FPGA has a fast market, and the ASIC has slowed down. It requires a lot of time development, and one-time cost (cost of media mask is much higher than the FPGA, but the performance is higher than the FPGA and the average cost is lower than the FPGA. In terms of the target market, the FPGA cost is high, so it is suitable for places where the price is not very sensitive, such as enterprise applications, military and industrial electronics, etc. (there is a need to be configurable in these areas). The ASIC is suitable for consumer electronic applications due to low cost, and whether it is configurable in consumer electronics whether it is a pseudo demand. The market situation we see is also the same: using FPGA to do deep learning acceleration, how is the company, Baidu, Microsoft, IBM and other companies have special FPGAs to accelerate, and the startup of the FPGA program Teradeep's Target Market Also the server. ASIC is primarily targeting consumer electronics, such as Movidius. Since the mobile terminal belongs to the field of consumer electronics, the future use of the future should be based on ASIC. 3, platform performance and power consumption Due to different chip production processes, there is an impact on the power consumption and performance of the chip, here is compared with the same process or close processes, the ASIC chip has no commercial chip, Google's TPU is just the use of information, Here the ASIC chip uses "Diannao: A Small-Footprint High-Throughput Acceleratorfor Ubiquitous Machine-Learning" published in academic papers as a representative. From the above contrast, energy consumption ratio: ASIC> FPGA> GPU> CPU, the root cause of the result: for calculating the intensive algorithm, the higher the power consumption ratio, the higher the energy consumption ratio, the higher the energy consumption ratio. Both ASIC and FPGA are closer to the underlying IO, so the calculation is high and the data movement is high, but the FPGA has redundant transistors and connectors, and the operating frequency is low, so there is no ASIC energy consumption ratio. Both the GPU and CPUs belong to a general purpose processor, and the process of taking instructions, instruction decoding, instruction execution is required. By this manner, the processing of the underlying IO is masked, so that the hardware and hardware decouple, but the data movement and operations It is impossible to achieve higher efficiency, so there is no ASIC, the FPGA can be high. The gap between the energy consumption ratio between the GPU and CPU is mainly in that the transistor in the CPU is used in Cache and control logic units. Therefore, the CPU is calculated compared to the GPU, calculating the complexity of complexity and redundancy. The transistor cannot function, and the energy consumption is lower than the GPU. Fourth, summary and prospect During the long-term development of the processor chip, it has formed some use and the distinctive characteristics of the market. There is a large number of open source software and application software in the CPU & GPU field. Any new technology first uses the CPU to implement algorithms, so the CPU programming has rich resources and easy to obtain, and the development cost is low. The implementation of the FPGA is implemented using the underlying hardware description language such as Verilog / VHDL. It is necessary for developers to have a more in-depth understanding of the chip characteristics of FPGA, but its high parallel property can often improve business performance; at the same time FPGA is dynamic It can be replaced. After the data center deployment, different logic can be configured according to the business form to implement different hardware acceleration functions; for example, the FPGA board on the current server is deployed is a picture compression logic, serving the QQ service. At this time, the advertisement needs to expand more FPGA computing resources. The FPGA board can be transformed into a "new" hardware to serve the real-time estimate by simple FPGA. The ASIC chip can achieve optimal performance, the area utilization is high, fast speed, low power consumption;, AISC development is very large, need to have a sufficiently large market to ensure cost prices, and it is very long from developing to the market. It is not suitable for, for example, in the field of rapid iteration of algorithms such as depth learning CNN. Telling so much, when you encounter a business bottleneck, you can choose a suitable chip according to business characteristics and chip characteristics? Analyze all kinds of chip characteristics, next, focus! Today's FPGA has great performance potential, supports depth variable pipeline structure, providing a large number of parallel computing resources, and a very complex function can be completed in a clock cycle. The FPGA programmable capacity ensures that such devices can meet the special needs of the application, there is no cost or delay problem for designing customization coprocessors. FPGAs are reprogrammable, which can provide a very flexible customization of coprocessing features for multiple applications in a single chip. With FPGA, the business has unlimited possible. The same semiconductor technology can be able to develop the performance of the processor to the limit, and the FPGA can develop from a simple glue logic controller to a high performance programmable architecture. FPGA is fully able to meet the "4P" requirements of the HPC market. The built-in memory of the FPGA also has a large performance advantage. For example, in-chip memory means that the memory access bandwidth of the coprocessor logic is not limited by the number of devices I / O pins. Moreover, the memory and calculation logic are closely combined, and the external high speed memory buffer is no longer needed. This also avoids big power buffer access and consistency problems. The use of internal memory also means that the association does not require other I / O pins to improve their accessible memory capacity, thereby simplifying the design. Many people have doubtful attitudes due to the development difficulty of FPGA and the development cycle, and good news is HLS and OpenCL language increasingly perfect. Many applications can use these two advanced languages to achieve greater performance improvement. Industry success case In order to better meet the requirements of computing performance, many large IT companies in the world have layout and practice on the accelerated hardware of FPGA. Intel: Intel decided to acquire FPGA producer Altera with $ 1.67 billion. Intel is expected to 2020, more than 30% of the server CPU chip will be equipped with an FPGA coprocessor. IBM: IBM and Xilinx jointly announced a multi-year strategic collaboration, using XilinxFPGA accelerated workload processing technology on the IBM Power system to create a higher performance, higher energy efficiency data center application. Microsoft: As early as 2014, Microsoft uses Altera FPGAs to use the business of its bing search, making Bing's search processing by double, and the search time is reduced by 29%. In 2015, Microsoft further applied FPGA to deep learning areas. In 2016, the "A Cloud-ScaleAcceleration Architecture" published by Microsoft Architecture Top Conference Micro shows its vitality in the data center architecture. Now, each server entering the Microsoft Data Center has an FPGA board, and its basic architecture is as follows: The application scenarios involved in the paper include: 1. Network acceleration (such as network data package plus decryption) 2, local application acceleration (Bing acceleration, DNN delay sensitivity service acceleration) 3. Support communication between FPGAs, FPGA computational resource pool, providing the concept of Hardware-AS-A-Service, and decouple the FPGA and server. Facebook: In 2016, Facebook also claims to be able to use Intel cooperation with the Xeon-FPGA platform for data centers. Baidu: Domestic Baidu has also launched the FPGA version of Baidu brain, which uses online services; FPGA version Baidu brain has been applied to include speech recognition, advertising hits the estimated model, DNA sequence detection, and drones. It is understood that the computational performance of speech online services, advertising hits rate estimated models is applied to the Baidu brain, and the computational performance of the advertisement hit rate is increased by 3 to 4 times. Original link: https://www.eeboard.com/news/cpu-26/ Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be paid directly] "