From GPU, TPU to FPGA and other: One article reads the neural network hardware platform battle

"In the era of deep explosion now, the relevant hardware platforms are also put in a hundred flowers, both of the technology giants such as Ying Weida and Google, as well as entrepreneurial companies such as Skyline Robots and Graphcore - they all propose their own solutions. Recently, Matt Hurd, a multi-company technical consultant, published a comprehensive review of various neural network hardware platforms, and Xiaobian's compilation introduced this article. This is a very good hidden node for a traditional 90-year gender identification neural network made a few weeks ago. A simple gender identifier network in the 90s style hidden node image My master's program is a neural network of Cascade Correlation, Multi-Rate Optimising Order Statistic Equaliser (Moose: Multi-speed Optimization Sequence Statistics Equalizer), which can be used in the Japanese Bund (Treas Bond Products). Moose has been designed for a high-speed LeO satellite signal (McCaw's Teledesic), and then turned the target to Bund when migrating from Liffe to DTB. As a professional trader of an investment bank, I can buy a good tool. I have the fastest computer in the world: an IBM Microchannel Dual Pentium Pro 200MHz processor with a few MB RAM SCSI. In 1994, it was like a black magic in my C ++ Stream / DAG processor in 1994. The finite difference method allows me to do many O (1) incremental linear regression such as 1000 times acceleration. At that time, this seems very good. Now, your phone can laugh at my big direction. At that time, there were many research in the neural network. It doesn't say that it has a productive forces, just because it is useful. Reading Lindsay Fortado and Robin Wigglesworth's FT article "Machine Learning Set To Shake Up Equity Hedge Funds" in Eric Schmidt About machine learning and transactions, it is really happy: Eric Schmidt is the executive chairman of the Alphabet of Google Pharma, and he said last week that he believes that he believes in 50 years, all transactions will have a computer to interpret data and market signals. "I look forward to the startup of the transaction to make machine learning, and see if the pattern I described can be better than the traditional linear regression algorithm that is more than data analysts." He added, "I" I am in this industry " Many people think that this is destined to be a new form of trading. " Old friends Eric, I was already late in the early 1990s, you really have a little later known. Ok, now the situation is different. I like to think about it, and like to refer to this new revival of the neural network as a sense of perception. This is not intelligent, just good at mode. It is still unable to deal with language ambiguity. It also has some time to understand the basic value and concepts, thereby form a deep financial understanding. Deep learning is both exaggerated and is underestimated. This is not intelligent, but it will help us to achieve intelligence. Some people exaggerated the artificial intelligence breakthrough to us will bring us alternatives. We are still trapped in common sense and ambiguity in simple text for reasoning. We have a long way to go. Relatively simple planning algorithm and heuristic approach, as well as magical deep learning visual, sound, text, radar, etc. will bring a profound impact, just like everyone and their dogs are now under understanding. So I call it "Perceived Age". It is like the supercomputer in our pocket suddenly has eyes and quickly adapts the flash of the real world. Deep learning will bring a huge impact and will change this planet's lifestyle. But we underestimate its danger. No, we will not be able to provoke or challenge the depth Tuling dialogue of our most profound ideas - yet. This will inevitably arrive, but it is not visible in the future. With voice, text and Watson's smart agents can achieve very advanced Eliza, but it will not be more advanced. Automated transportation, food production, building, and assisting familying will greatly change the value of people's lifestyle and real estate. In addition to these general discussions, the purpose of this article is to collect some ideological opinions about chips - they drive the current neural network revolution. Many of them are not the most exciting, but this is a useful exercise for me. Neural network hardware It is not very different from today's nerve treatment methods than 20 years ago. The depth is more a brand, not a difference. The activation function has been simplified to better adapt hardware. The main success is that we have more data, which has a better understanding of how to initialize the weight, handle many layers, parallelize, and enhance robustness, where it is necessary to use techniques like DROPOUT. The 1980 Neocognitron architecture did not differ significantly from today's depth learning or CNN, but Yann Lecun made it a learning ability. In the 1990s, there are many neur hardware platforms, such as CNAPS (1990), with 64 processing units and 256KB memory, can reach the speed of 1.6 GCPS under 8/16-bit conditions (CPS refers to the number of connections per second) / Connections Per Second, or the speed of 12.8 GCPs under 1 position. You can read Synapse-1, CNAPS, SNAP, CNS Connectionist Supercomputer, Hitachi WSI, MY-neuPower, Lneuro 1.0, UTAK1, GNU in "Overview of Neural Hardware" [HeemSkerk, 1995, Draft] [HEEMSK, 1995, Draft] General Neural NEURAL UNIT) IMPLEMENTATION, UCL, Mantra 1, Biological-Inspired Emulator, INPG Architecture, Bachus and ZISC036. Reading address: https: //pdfs.semanticscholar.org/5841/73AA4886F87DA4501571957C2B14A8FB9069.PDF Ok, things are really much, but it actually eliminates software and accelerator board / CPU combinations, such as Anza Plus, Saic Sigma-1, NT6000, Balboa 860 coprocessor, NI1000 identification accelerator hardware (Intel), IBM NEP, NBC, Neuro Turbo I, Neuro Turbo Ii, Wisard, Mark II & IV, Sandy / 8, GCN (Sony), TOPSI, BSP400 (400 microprocessor), Dream Machine, Rap, Cokos, Remap, Universal Parallel Neurocircuit ( General Purpose Parallel Neurocomputer, Ti Netsim and Genet. There are also achievements of simulation and mixing simulation, including Intel electrical training to simulate neural networks (801770nx). You know what I want to express, there is still much more things. This ushered in 1994: Optimistic Moore Law tells us that Tercps is about to achieve: "In the next decade, microelectronics is likely to continue to dominate the field of neural network. If progress is as fast as progress, it means that the performance of the neural computer will grow about 2 orders. Therefore, the neural computer will be close TERACPS (10 ^ 12 CPS) performance. The network consisting of 1 million nodes (about 1000 inputs per node) can reach the computational speed of the brain (100-1000 Hz). This will be able to reasonably huge networks for experiments. Provide a good opportunity. " Since Minsky and PapertT's incorrect and simple summary of hidden layers, the dream of Rosenblatt has hit Rosenblatt, and ultimately leads his unfortunate death. The study of neural network has encountered the first winter, and the research funds were cruelly revoked. In 1995, another neural network was in winter, although I didn't know when I was at that time. As a frog in a warm pot, I didn't pay attention to being heated. The main reason for the second winter is the lack of exciting progress, making people generally boring. In 2012, he lost to the winter survival skills of Geoffrey Hinton. The University of Toronto is based on AlexNET developed Supervision's improving in ImageNet, and the second neural network is also ended. After the Google's Lenet Inception model broke its record in 2014. Therefore, according to I estimate that the perceived era begins in 2012. Remember it in your calendar, five years have passed. Google has made excellent parallel CPUs on thousands of ordinary machines. Professor Wu Weida and his friends let dozens of GPUs can complete thousands of CPUs, so that scale is possible. Therefore, we have liberated from the neurological prospects that need good funding. Ok, more or less, now the most advanced network sometimes requires thousands of GPUs or dedicated chips. More data and more processing power is the key. Let us enter the focus of this article, list some key platforms in the war of big data in the times: GPU of Yingda This home is very difficult to defeat. Years from large video processing markets drive huge scale economy. The new Yida V100 has a new Tensor Core architecture with a speed up to 15 TFLOPS (single precision / SP) or 120 TFLOPS (floating point precision, where the fp16 multiplication and the accumulation or addition of FP32, very suitable for machine learning). Ying Weida is loaded with 8 computing cards in their DGX-1, with a speed of 960 Tensor TFLOPS. AMD GPU In the field of machine learning, AMD has always been the chase of British Weida. The upcoming AMD Radeon Instinct Mi25 has a wish to reach 12.3 TFLOPS (SP) or 24.6 TFLOPS (FP16). If you count the Tensor Core of Yingda, AMD is completely unable to compete. The bandwidth of the Yingda equipment is also twice the AMD 484GB / S. Google's TPU Google's original TPU has a big lead than GPUs and helped DeepMind's Alphago won the Go Shishi's Go Wars. According to the description, the original 700 MHz TPU has 95 TFLOPS 8-bit computing power or 23 TFLOPS 16-bit computing power, and the power consumption is only 40W. This can be much faster than the GPU at the time, but now lags behind the V100 of Ying Weida; but on the calculation capabilities of unit power consumption, the TPU did not fall behind. The new TPU2 is said to be a TPU device with 4 chips, which can reach 180 TFLOPS. The performance of each chip is doubled, and the 16-bit computing power of 45 TFLOPS is reached. You can see that the gap between Ying Weida V100 is becoming smaller. You can't buy TPU or TPU2. Google is providing these TPU services through their clouds, including 64 devices of TPU POD speeds up to 11.5 petaflops. The huge heat sink on TPU2 illustrates some reasons, but the market is changing - from a separate device to the combination of equipment and provide these combinations in the form of clouds. Wave computing Wave's Father Australian Dr. CTO Chris Nicol's results, leadership developed the asynchronous data flow processor in Wave's Compute Appliance (Asynchronous Data Flow Processor), which is in the COMPUTE APPLIANCE. A few years ago, Metamako's founder Charles Thomas briefly introduced me and Chris. They both have been studying in NICTA. These two people are very good. I am not sure WAVE's device is not designed for machine learning, but the speed of Tensorflow running Tensorflow on their 3ru Appliance can reach 2.9 Petaops / S, which is really great. Wave called their processor as DPU, and an appliance has 16 DPUs. Wave uses the processing elements they are called the coarse grain size reconfigurable array (CGRA: Coarse Grained Reconfigurable Array). I still don't know how many positions correspond to the speed of 2.9 PetaOps / s. According to their white paper, ALU can perform 1 bit, 8-bit, 16-bit, and 32-bit calculations: "The arithmetic unit is partition. They can perform 8-bit operations in parallel (perfect for DNN reasoning) and 16-bit and 32-bit operations (or any combination of the above).Execute some 64-bit operations, and you can use software to extend to any precision. " About 16 DPUs in its Appliance, there are some additional information: "Wave Computing DPU is a SOC containing 16384 PE, which is configured to 32 × 32 clustered CGRA. It contains four HMC (Hybrid Memory Cube) second-generation interface, a PCIe third generation 16 channel Interface and an embedded 32-bit RISC microcontroller for SOC resource management. This Wave DPU can be automatically executed without host CPUs. " For TenSOflow instructions: "Wave DNN Library team creates precompiled repositionable kernel. They can be combined into the agent and can be instantiated into machines to build large tensile data streams. Figure and DNN Kernel. " "... a session manager that can interact with Tensorflow, CNTK, Caffe, and MXNET and other machine learning workflows, can be used as a work process for training and reasoning. These workflows provide a work console process Quantity data flow graph. At runtime, Wave's session manager analyzes data flow maps and places them in the DPU chip and connect them to build data flow maps. These software intelligence will be assigned Global memory area for input buffer and local storage. The static nature of CGRA Kernel and the distributed memory architecture allows a performance model to accurately estimate the delay of the smart body. Thesession manager can use this performance model to insert FIFO between the smart Buffering, which contributes to the overlapping of communication and calculation in the DPU. This variable smart body supports software flows throughout the figure, thereby further increasing concurrency and performance. This session manager can monitor data flow maps Performance (over overflowing the carton, buffer overflow and / or overflow), and can dynamically adjust the size of the FIFO buffer to maximize throughput. In the processor attached to the DPU, there is a distribution The runtime management system will install and uninstall some of the data flow graphs at runtime, thereby balancing the amount of calculation and memory. This reconfiguration is true when the runtime in the data flow map in the data stream computer is still the first Second-rate." Yes, I also feel very cool. The amazing thing about this platform is that it is more coarsely granular than FPGA in architecture, so flexibility is lower, but it is likely to be better. very interesting. KNUPATH I talked to Knupath on Twitter in June 2016. After that, their product pages were missing. I am not sure they want to use the $ 10 billion how to use them on their MIMD architecture. At that time, they described me: 256 micro DSP (ie TDSP) cores in each ASIC and an ARM controller for sparse matrix processing in the 35W envelope. Its performance is unknown, but they compared their chips with a British Weida chip at that time, then they said that it achieved 2.5 times performance. We know that British Weida has been raised more than ten times with the Tensor kernel, so KNUEDGE must work hard to keep up with the rhythm. MIMD or DSP methods must have a very good effect to occupy a place in this area. Time will give us an answer. Intel's Nervana In addition to developing their NERVANA ENGINE ASIC, Nervana Systems has also developed a GPU / software approach, and I have acquired this company later. Performance comparison is not clear. Intel is also planning to integrate it into the PHI platform through a Knights Crest project. NextPlatform believes that its goal on the 28 nm node in 2017 is to reach 55 TOPS / S in a certain bit. Intel also arranged a Nervanacon, which will be held in December, so we may see their first results at that time. Skyline robot This Chinese entrepreneurial is developing a brain processing unit (BPU: Brain Processing Unit). Dr. Yu Kai is from the regular army, he was the head of Baidu University Studies Research Institute. Earlier this year, a YouTube video demonstrated BPU simulation based on Arria 10 FPGA: https: //youtu.be/gi9u9lufado. The public message on this platform is still very small. Eyeriss Eyeriss is a project of MIT, developed a 64nm ASIC with excellent primitive performance. On AlexNet, the speed of this chip is approximately half of Twida TK1. Its advantage is to achieve the performance of such a rule in a re-configurable accelerator that requires 278 mW by means of a ROW Stationary method. great. Graphcore Last year, Graphcore got $ 30 million in A round financing to develop their intelligent processing unit (IPU: Intelligence Processing Unit). Their website is still missing, just give some bright facts, such as more than 14,000 separate processor threads and greater than 100 times memory bandwidth. According to NEXTPLATFORM, it has more than 1000 real core on a piece of chip and is used in a customized interconnection. Its PCIe board has a 16 processor component. It seems that it seems to be a data stream. Throp the public relations, this team does have a strong background, and investors are not stupid, so they will wait and see. TenStorrent TenStorrent is a small entrepreneurial company in Toronto, Canada, which claims to achieve a quantitude improvement in the efficiency of deep learning, like most companies, but nothing is not open, but the company has selected the Cognitive 300 list. Cerebras Cerebras is worth mentioning because it gets Benchmark support, and its founder is Seamicro CEO. It seems to have been financing $ 25 million and is still in stealth mode. Thinci Thinci is developing a visual processor in Sacramento, USA, and has employees in India. They claim that they are about to launch their first silicon chip Thinci-TC500, and have begun standard evaluation and winning customers. But except for "everything in parallel processing", we know very little. Koniku Koniku's website is counting down, now there is still 20 days. I can't wait. They have not integrated how much, and they have seen this video on Forbes (https://goo.gl/va1pjx), you are likely to believe in them, but you can't expect how it will. It is definitely not the same as biological cells. It sounds like a research project, but they say this: "We are a business. We are not a research project." Wait next week, Agabi will speaking on the Pioneers Festival of Vienna, "Today, some needs are silicon can't be satisfied, and we can provide through our system." Koniku is the so-called neuron-shell. This startup said it can control neurons to communicate with each other, plus a electrode that is applying for a patent, can read and write on neurons. information. All of this can be installed in an iPad size device, they also hope to reduce it to a five-centious coin size before 2018. Adapteva Adapteva is my favorite small-tech company, as you are in the previous article "Adapteva Tapes Out Epiphany-V: a 1024 core 64-bit RISC processor": https: //goo.gl/6zh7jp. At the end of last year, Andreas Olofsson took out his 1024 nuclear chip, we all waited for its performance performance. Epiphany-V is useful for a new directive for depth learning, we must see if this less memory controller with 64MB of memory has a suitable extension capability. Andrea's design and excellent efficiency may allow us to really afford this chip, so let us want it to have a good performance. KNOWM Known research is Anti-Hebbian and Hebbian (AHAH) plasticity and memoirs. Here there is a paper covering this topic "AHAH calculation: from Metastable Switches to Attractors to Machine": https://doi.org/10.1371/journal.pone.0085175. This is a bit too advanced for me. Simply looked at it, I can't see the difference between this technology and grouped, but it seems scientific taste. I need to see it in my own to see it. The idea of neuromemrist processor is very interesting. I really need a good epidemic in the morning. Mythic Mythic's battery-driven neural chip has a low power consumption. Currently see too much truly detail. This chip is about a button size, but most of the chips are not true? "Mythic's platform can provide desktop GPU performance on the chip of the button size." Perhaps this is another chip suitable for drone and mobile phones, which is likely to be used in your phone, or it may be excluded. Qualcomm The mobile phone is obviously a large land of machine learning hardware. We hope to identify the dog's variety, flowers, leaves, cancer 痣, translation logo, understanding oral words. The supercomputer in our pocket is willing to use all the help it can get in order to enter a sense of perception. Qualcomm has always been drumming machine learning, launched the Zeroth SDK and Snapdragon Neural Processing Engine (NPE). This NPE is clearly effective on the HEXAGON DSP used in Qualcomm. Hexagon DSP is far more than a very broad parallel platform. Yann Lecun has confirmed that Qualcomm and Facebook are working together to develop a better way, see Wired's article "Industry | Google TPU also has Qualcomm, artificial smart chip competition Already expanded: "Recently, Qualcomm has begun to manufacture a special chip to execute the neural network. This message comes from Lecun, because Facebook is helping Qualcomm development machine learning related technologies, so he is very understanding of Qualcomm; Jeff Gehlhaar, Vice President of Qualcomm, confirmed this project. He said: "In the design and development of prototypes, we have a long way to go." Perhaps we will soon see the Kryo CPU, Adreno GPU, HEXAGON DSP, and other things other than HEXAGON Vector Extensions. For entrepreneurship in this field, and Qualcomm's machine learning competition will be difficult. PEZY-SC and PEZY-SC2 Both are the 1024 cores and 2048 nuclear processors developed by Pezy. The PEZY-SC 1024 nuclear chip can drive the first three systems of the 2015 Green500 Super Computer Table List. PEZY-SC2 is now the following subsequent chips, I have done a speech on it in June, but the relevant details are still very small, but still very attractive: "PEZY-SC2 HPC Brick: 32 of the single chassis with 64GB DDR4 DIMM (2.1 Petaflops (DP)) PEZY-SC2 module card, speed up to 6.4 Tb / s." I don't know what the 2048 MIMD MIPS WARRIOR 64-bit machine can do? In the 2017 Green500 list in June, a British Weida P100 system got a header chair, and ranked 7th is a PEZY-SC2 system. So this chip is still alive, but the relevant details are very small. Motoaki Saito (Qivatuan Chapter) is of course worth seeing. Kalaray Despite a lot of commitments, Kalaray chips have not exceed 256 cores, I will talk about an article in 2015: https: //goo.gl/pxqn7z. Kalray promotes its own product that is suitable for embedded automatic driving applications, but I think its current product architecture is not a perfect CNN platform. Kalay has a Kalaray Neural Network (Kann) package and claims better efficiency than GPUs, which can achieve up to 1 TFLOP / S on the chip. As the upcoming product update, Kalary's neural network wealth may also increase, just completed a new round of $ 26 million in financing in this month of Kalary. Their new COOLIDGE processors are expected to be listed in 2018. It will bring 80 or 160 cores, and there are 80 or 160 concessions for visual and deep learning optimization.Robots. This has changed great changes in their more than 1000 cores, and I think this is the most wise. IBM Truenorth Truenorth is IBM's neurological CMOS ASIC, which is developed with the Stnapse project of Darpa. This is a single chip design multi-core processor network with 4096 cores, each nuclear simulation of 256 programmable silicon "neurons", with a total of more than 1 million neurons. In addition, each neuron also has 256 programmable "synapses", and the signal can be passed between these synapses. Therefore, programmable synapses have always exceeded 268 million (2 ^ 28). In terms of basic build modules, its number of transistors is 5.4 billion. Because storage, calculation, and communication are handled in these 4096 nerve synaccinals, Truenorth avoids the bottleneck of von Norman architecture, and the energy efficiency is very high, the power consumption is 70 mW, about the traditional micro The processor's power density is one-pole (from Wikipedia). The IBM is also criticized in the Pulse Neural Network. It is unable to adapt deep learning. Now IBM has developed a new algorithm for running CNN on Turenorth. These neurons do not discharge every cycle, neurons in the pulse neural network must gradually accumulate their potentials, and then they can discharge ... Deep study experts generally believe that the pulse neural network is not efficient in deep learning - at least Convolutional neural network is like this. Facebook Artificial Intelligence Research Institute Director and Deep Learning Pioneer Yann Lecun once criticized IBM's Turenorth chip because it mainly supported pulse neural network ... ... This neurological chip is not too exciting because they pay attention to the pulse neural network in the field of deep learning. In order to make the Truenorth chip fit deeply to study, IBM has to develop a new algorithm that allows convolutional neural networks to operate well on this nerveogram. This combination method realizes the so-called "approaching current best" classification accuracy, and the experiment involves 8 data sets in vision and voice challenge. In the best case, their accuracy reached 65% to 97%. When only one TruNorth chip is used, it only exceeds the current best accuracy on one data set in these 8 data sets. But if you use up to 8 chips, IBM researchers can greatly improve the accuracy of this hardware in deep learning challenges. This allows Turenorth to compare the best accuracy in three of them or beyond currently optimal accuracy. This Turenorth test also processes 1200 to 2600 video frames per second. This means that a single TURENORTH chip can detect the pattern of data from up to 100 simultaneous works in real time ... (from IEEE Spectrum) Truenorth's power efficiency is excellent, so it is worth considering. Brainchip Pulse Neuron Adaptive Processor (SNAP: Spiking Neuron Adaptive Processor) Snap can't do deep learning, this is just a curiosity project, and there is still no practical land into a CNN engineering solution, at least not yet. If you want to explore this road, IBM's random phase-changing neurons seems to be more interesting. NEURAL ENGINE Will there will be no? Bloomberg report said this will be a secondary processor, but there is no detail information. For Apple, this is not only an important area, but also contributes to competition with Qualcomm. other 1. Cambricon - 1.4 million US dollars in chip on chips. It is an instruction set architecture for neural networks, parallel, custom vector / matrix instructions, on-chip SCRATCHPAD MEMORY. The declaration is 91 times the X86 CPU, 3 times the K40M, and only 1% of the peak power, ie 1.695W. See these two papers: Ambricon-x: A accelerator for sparse neural network: http://cslt.riit.tsinghua.edu.cn/mediawiki/images/f1/cambricon-x.pdf Cambricon: a command set for neural network S: http: //dl.acm.org/citation.cfm? ID = 3001179 2. The Groq Inc created by the front cereals staff may be another TPU? 3. AIMotive: https://aimotive.com/ 4. Deep Vision is developing low-power chips for deep learning, perhaps these two founders' papers provide a little clue: Convolution Engine: Balancing Efficiency & flexibility In Specialized Computing [2013]: http://csl.stanford.edu/~christos/publications/2013.convolution.isca.pdf Convolution Engine: Balancing Efficiency and Flexibility In Specialized Computing [2015]: http://csl.stanford.edu/~Christos/publications/2015.convolution_engine.cacm.pdf 5. Deepscale 6. Reduced Energy Microsystems is developing low-power asynchronous chips for CNN reasoning. According to TechCrunch, REM is the risk investment in Y Combinator's first ASIC field. 7. LeapMind is also very busy. FPGA Microsoft has already tetched FPGA. Wired This article is very good: "Depth |" Connection "long literary secret Microsoft Project Catapult: Artificial intelligent era bet FPGA". "Bing occupies 20% of the world's desktop search market and 6% mobile phone market, in Bing, this chip can help BING adapt to the artificial intelligence of new varieties: depth neural network." I am also interested in this approach. FPGA of Saulith and Intel (acquired Altera) is a very powerful engine. Shanling's natural claims that their FPGA is the best to int8, and their white paper contains the following slides: These two suppliers have supported their machine learning using their FPGA: Xilinx - Acceleration Zone: https://goo.gl/kheg5w Intel FPGA OpenCl (https://goo.gl/s62fma) and solutions (https://goo.gl/zkyyxb) Although the performance of the FPGA unit consumption is excellent, the price of larger chips of these suppliers has been scary. Sailive's VU9P is more than $ 50,000 on Avnet. Finding the balance between prices and capabilities is the main problem of FPGA. A great advantage of the FPGA method is that some excellent architectural decisions can be used. For example, if you want to compress the DRAM on the board and unzipped in real time, then you can improve your memory floating data stream, if you work hard, you can find the solution. Refer to "Bandwidth Compression of Floating-Point Numeric Data Streams for FPGA-BASED": http://dl.acm.org///dl.acm.org/.- Perform: //dl.acm.org/ Citation.cfm? ID = 3053688. This dynamic architecture agility is very difficult and is almost unable to be implemented in any other method. The architecture choice is too much likely to be a problem, but I still like this problem very much. This paper is very good "Use TILT to reduce the performance gap between Soft Scalar CPUs and custom hardware (Reducing the Performance Gap Between Soft Scal)": http: //dl.acm.org/citation.cfm Id = 3079757, which studies the performance gap between custom hardware and FPGA processors (with FPGA-based horizontal micro-coding computing engine), which reminds discrete instructions in ancient DISC and many months ( Discrete Instruction Set Computer. Who is the winner? In such a competition, it is predicted who is the winner is a fool. High-pass can easily enter the winner list with the dominant position of its mobile phone market. Apple will succeed no matter what. Ying Weida's V100 has a Tensor unit, basically win. I am not sure if I can see Google's TPU survive in an endless long-term Silicon Valley competition, although its performance is excellent. I really like the FPGA method, but I can't help but think that they should release DNN versions in a lower price, so they will not be indifferent by the public. Intel and AMD will do their own coordinator. Because all major players have gone, many of them will support standard kits, such as Tensorflow, so we don't have to be too much specifications, and care about the baseline. In a smaller player, I like and support the adapteva method, I think their memory architecture may not be DNN. I hope I am wrong. Wave computing may be my favorite method after the FPGA. Their entire asynchronous data flow method is very good. Rem seems to be similar things; but I think they may be too late. Can WAVE COMPUTIN continue to stay in the face of all competitors? Maybe as long as their asynchronous CGRA has an essential advantage, you can. Although I am not sure if they are successful as long as the DNN is successful, because their technology has a more wide application ability. Neuromatic pulse processors may now be temporarily ignored, but they can also be paid because they have a lot of power consumption. Quantum calculations will make all thiss a little bit. IBM's Truenorth may be an exception because it can not only perform a pulse network, but also effectively run DNN. Original link: https://www.eeboard.com/news/gpu-8/ Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be directly paid] Technology early know: [Event Registration] 2017 SCIENCE EDUCTIONS - Shenzhen Station 8.15 All Taiwan power failure accidents, because of a small mistake of staff ASUS low-profile release new notebook VivoBook W202, only 1800 pre-installed Win10S system Russian experts successfully found the manufacturing method of two-dimensional semiconductor, making miniature electronic equipment possible Memory DDR literacy - DDR and DDR2, DDR3 specific difference, do you know how much? "