"In the past five years, Ying Weida developed its data center business into a giant worth billion dollar, but never encountered a similar competitor. This is an amazing fact, in my memory, this In today's technology world is unparalleled.
I am going to write a prediction of the AI chip in the next year, as well as the articles of Ying Weida to deal with the challenge, but I am so far apart, the article is much longer than I expected. Because there are many contents to introduce, I decided to divide the article into 3 parts.
Part 1: Introduction, and analysis to challenge British Weida: Intel, AMD, Google, Saili, Apple, Gao Tong.
Part 2: Entrepreneurship and Chinese companies, as well as the roles they may play.
Part 3: British Weida resists the strategy of potential competitors.
1 Introduction
In the past five years, British Weida developed its data center business into a giant worth hundreds of dollars, but never encountered a similar competitor.
This is an amazing fact that in my memory, this is unparalleled in today's technology world. This rapid growth is mainly from the need for fast GPU chips for artificial intelligence (AI) and high performance calculations (HPC).
Jensen Huang, CEO, Yingda, likes to talk about "cold military outbreak" in the field of deep learning, especially refers to the fast pace of neural network algorithm innovation. We will discuss this on the meaning of British Weida in Part 3, but I choose to borrow this concept as the title of this series. We are at the entrance of many major and small companies in the world.
Three years ago, chip initiative companies almost impossible to get venture capital. Now, there are dozens of funds sufficient challengers in building chips for artificial intelligence.
Figure 1: British Weida's explosive development of the new neural network is the first time in the cold.
Last year, Ying Weida and IBM met the calculated peaks, they announced the Summit Supercomputer (ORNL) Summit Supercomputer (about 95% of the performance) of the Summit Supercomputer (about 95% of the performance) of the Summit Supercomputer (about 95% of the performance). ) Provides power. Although this is an incredible achievement, many people have begun to doubt. For the British Weida, the whole fairy tale can continue.
Figure 2: Summit Supercomputement of the US Energy Oak Ridge National Laboratory is the fastest computer in the world.
According to the latest quarterly report, Ying Weida Data Center income increased by 58% year-on-year to $ 792 million, accounting for nearly 25% of the company's total income. In the past four quarters, this number totaled $ 2.86 billion. If the company can maintain this growth, by 2019, the data center's income will reach $ 4.5 billion. This sounds like heaven, or at least the paradise on the earth, right?
There is no doubt that British Weida has created excellent products under the promotion of its powerful scalable architecture. Ying Weida has a powerful and self-maintained ecosystem consisting of software, universities, start-up companies and partners, which make it a host of the new world you created.
Although some people think that this ecosystem creates an insurmountable moat, but black clouds are now appearing on the horizon. Potential threats come from Intel, Google, AMD, and dozens of US and China's start-ups, they are attracted by hot artificial intelligence.
So far, in my opinion, competition is mainly small. Competitors have released dozens of statements, but I am very convinced that in addition to Google, there is no company to get any income from the Giberty of Yingda. Let's take a look at the current competitive structure and see what will be like in 2019.
Large challenger
Although more than 40 startups in the New York Times have entered this area, let us reality: Only a few companies can truly succeed in this market (such as more than $ 1 billion). In terms of the training of deep neural networks, considering the power of Ying Weida, the installation basis and the power of the ubiquitous ecosystem, and Ying Weida is difficult to be defeated.
However, there is a quite small reasoning market that will eventually exceed the total income of the training market. Unlike training, reasoning is not a single market. It consists of a large amount of data type and related optimized depth learning algorithm of the cloud and edges, each algorithm has specific performance, power consumption and delay requirements.
In addition, there is no giant in the reasoning market, even in the British Weida claims to have a leadership car market. For these reasons, reasoning is the domain of most new entrants or the first field of attention. Let us see the big companies that are being competed.
Google
The earliest demonstration-specific chip (called ASIC, or an application-specific integrated circuit) can confront more programmable, one of the more common GPUs for deep learning GPUs is Google. Coincidentally, Google may be one of the largest customers in British Weida. As I mentioned before, Google has now released four "Tensor Processing Units" (TPU), which can accelerate deep learning training and reasoning processing in the cloud, recently used for edge clouds. Google's TPU is used to train and handle the performance of deep neural networks is quite reliable, and each chip can provide up to 45 trillion operations per second (TOPS).
In contrast, Ying Weida's Volta is up to 125 TOPS. Google initially two TPUs are actually for internal use and boast, but Google now provides them as a service to its cloud customer on Google Compute Cloud.
Although TPUs have pushed the artificial intelligence initiative of Google, they serve the internal use cases in Google (of course, this is a market that is a considerable market) is intended to be restricted.
TPU can only be used to train and run Google Tensorflow Ai framework; you can't use it to train or run AI build Apache MXNET or Pytorch (these two are fast-growing AI frames that Facebook and Microsoft support). They cannot be used for non-Ai HPC applications that dominate GPUs.
In addition, you can't buy TPUs for internal calculations in business or government data centers and servers. But Google does not mind all this because it believes that TPU and Tensorflow have a strategic significance for its overall leadership of its artificial intelligence. Hardware optimized for hardware and hardware optimized for software can build a powerful and lasting platform.
The more direct impact of TPUs may be that the ASIC concept can be used as an alternative to GPUs, at least a potential investor. A depth study chip startup company's CEO and I share this experience: After Google announces its TPU, venture capital begins free to flow. He then raised hundreds of millions of dollars.
Google has always been good at robbing some winds from the predictable statement released from Ying Weida in the GPU Technology (usually in March). I will not be surprised to see that Google will debut again this year, and may bring a performance data. Trichi 7 nano TPU products.
Amazon Web Services is not willing, and the company announced last fall that it is also building a custom ASIC for reasoning processing. However, the chip is still under development, the company did not share any details of design or availability.
Intel
Figure 3: Prior to Nervana CEO Naveen Rao leads Intel's AI product development and transparent to the company's strategy.
This is a bit complicated, because Intel is a big company, while doing a lot of work, although Intel intends to compete in the "2019" and Nervan chip competition and reasoning, it realizes that reasoning will become a bigger market And very powerful influence. In addition to Xeon CPUs (after recent update, the reasonability is significantly improved), Intel has also acquired MobileEEYE and MOVIDIUS, which are used for automotive and embedded reasoning processing. I have seen the demonstration of these two devices, they are really impressive. Intel also invested a Run-Anywhere software called DB2Openvino, which allows developers to train anywhere and then optimize and run on all Intel processors, very amazing.
At the CES meeting held in Las Vegas, Intel revealed that it is closely working with Facebook on NERVANA Neural Network Processor (NNP-I), which is surprising, because many people predict Facebook is developing their own The reasoning accelerator.
At the same time, NAVEEN RAO, general manager of Intel Vice President and Artificial Intelligence, Share on Twitter, NNP-I will be a SOC (on-chip system), manufactured in Intel 10 Nano Crystal Plant and will include IceLake X86 core. Rao said that this will be a common topic of Intel, which may refer to the future X86 / GPU chip for desktops and laptops, similar to AMD APU.
In terms of training, Intel's initial plans are to release a product called "Lake Crest" Nervana NNP in 2017 after acquiring NERVANA. Then it dragged to 2018 ... Finally, the company decided to start again. This is unlikely because NERVANA is not complete, but in contrast, Intel realizes that the device is insufficient to significantly exceed the Ten Weida and it adds to Volta and subsequent GPU Tensorcores. I think, when Ying Weida announced any amazing new product made in 7nm crafts, we will once again see the same script, but this is too far away.
Qualcomm and Apple
For the sake of integrity, I will include these two companies, because they all provide impressive intelligence on mobile phones (and Qualcomm "and automatic driving vehicles). Of course, Apple focuses on iPhone's A Series CPU and supports mobile phone AI's iOS operating system. With the leading platform of artificial intelligence in the field of speech and image processing, these two companies have a large number of IPs that can establish leadership (although Huawei also vigorously promotes artificial intelligence, we will introduce).
AMD
In the past three years, AMD has been working hard to make its AI's software studio. When I worked there in 2015, if I don't start Windows, you can't even run its GPU on the Linux server. Since then, the company has achieved great progress, ROCM software and compiler simplifies the migration from CUDA, Mlopen (not confused with OpenML) accelerates the math library on the chip. However, the current AMD's GPU is still at least a generation of Ai version of the Yingda V100, and V100 is close to two years. How AMD is still waiting to be observed on 7 nm.
Sailive
There is no doubt that the leader of programmable logic (FPGA) is very good in 2018. In addition to the declaration of 7nm next-generation architectures, it also has achieved significant victory in the design of Microsoft, Baidu, Amazon, Alibaba, Daimler Mercedes. In artificial intelligence reason, FPGA has a significant advantage than ASIC because they can be dynamically reconfigured for a particular work of the hand. This is very important when the underlying technology is rapidly changing, just like artificial intelligence. For example, Microsoft demonstrates its FPGA (now from Xilinx and Intel) how to use 1 bit, 3 or almost any precision of a particular layer in a deep neural network. This may be like a nerd, but this can greatly speed up the speed and reduce the delay, while using less power. In addition, the upcoming Xilinx 7nm chip is called VERSAL, which has a AI and DSP engines that accelerate the processing of specific applications while having an adaptable logical array. Versal will start shipping at some time this year, I think it may change the rules of the reason for the reason.
2, start-up
This is the second article about the market conditions of the artificial intelligence chip, and the second article in the three articles that will happen in 2019. This year will be the feast of new chips and benchmarks. Lead is the big company I mentioned in the first blog (Intel, Google, AMD, Sailive, Apple, Gao Tong), in addition, dozens of dozens Silicon Valley startups and China's unicorn estimate exceeded US $ 1 billion. In this section, I will introduce the most famous startups in Western and China, or at least the highest speech, the Chinese government is working to build local artificial intelligent chip industry. We will start from Wave, it seems to be the first company that will use the chip to train.
Wave computing
Wave computing has experienced a lot of things in 2018, which launched the first data stream processing unit, acquired MIPS, creating MIPS Open, and delivered its first early system to some lucky customers. Although Wave architecture has some very interesting features, I will discuss in depth here, but we are waiting for the customer experience information of massively actual workloads.
Wave is not an accelerator connected to the server; it is a standalone processor for graphics computing. This method has advantages and disadvantages. From a good aspect, Wave will not be affected by the memory bottlenecks existing in the accelerator of the GPU. From a negative aspect, the installation WAVE device will be a forklift upgrade, which needs to completely replace the traditional X86 server and will become all the competitors of all server manufacturers.
I don't expect WAVE to provide better than NVIDA on a single node, but its architecture is designed very well, and the company has said that it should soon get the customer's results.
Figure 1: The system of WAVE is built according to the 4 node "DPU" board displayed from the above figure.
Graphcore
Graphcore is a world-class team with a strong capital ($ 310 million, currently estimated $ 1.7 billion) with world-class team. It is building a novel graphics processor architecture that is on the same chip with its logic, which will make practical applications higher. This team has been teasing new products that will be released for a long time. In April last year, it "is almost ready", and the company's latest information last year shows that it will soon start production. Its investor list is quite eye-catching, including sewed capital, BMW, Microsoft, Bosch and Dell Technology.
I have already understood the Graphcore architecture, which looks quite eye-catching, extends from edge devices to the training and reasoning of the data center. "COLOSSUS "double chip package. On the nearest neurips event, Graphcore demonstrates its Rackscale IPU POD, which provides more than 16 Petaflp performance on a rack with 32 servers. Although GraphCore often claims that its performance will be With the best GPU, the best GPU is more than 100 times, but my calculation results are different from this.
Graphcore said that a 4 "Colossus" GC2 (8 chip) server provides mixed precision performance of 500 TFLOPS (trillion per second. A single British Weida V100 can provide 125 TFLOPS, so in theory, 4 V100 should provide the same performance. As usual, the problem is that the V100 peak performance can only be used when the code is reconstructed to perform the 4x4 matrix multiplication executed by Tensorcore, and the GraphCore architecture is ingenious to avoid this limit. Needless to say that V100 is expensive, and power consumption is up to 300 watts. In addition, the GraphCore supports the film and "processor memory" (on-chip memory) method, which may bring excellent application performance that exceeds the TFLOPS benchmark test. In some neural networks, such as the generative counterfeit network (GAN), memory is a bottleneck.
Once again, we will have to wait for real users to use actual application results to assess this architecture. Despite this, Graphcore's investor list, expert roster and ultra-high valuation tell me that this may be a good thing.
Figure 2: Graphcore shows this very cool picture that handles imagenet datasets. This visualization can help developers understand which parts of their training processing consumes processing cycles.
Habana Labs
In September last year, Israel's startups announced at the first artificial intelligent hardware summit that it is ready to launch the first chip for reasoning, and perform image processing with a record-recorded performance running convolutive neural network. Many people feel unexpected. The results show that in the RESNET50 image classification database, the processor classifies 15,000 images per second, which is about 50% higher than the T4 of NVIDA, and the power consumption is only 100 watts. In December 2018, the latest round of financing of Habana Labs is compliant by Intel venture capital, and WRV Capital, Bessemer Venture Partners and Battery Ventures have also increased from $ 45 million for $ 45 million. Ten thousand U.S. dollars. Recent funds will be partially used for the second chip of "gaudi", which will focus on the training market, allegedly able to extend to more than 1,000 processors. In this highly competitive area, Habana Labs show a lot of hopes.
Other startup
I know there are more than 40 companies in the world in manufacturing the chip for artificial intelligence training and reasoning. I have found that most companies are doing simple FMA (floating point multiplication) and mixed precision mathematics (8-bit integer, 16-bit floating point), I am not surprised. This approach is relatively easy to build, and it is also possible to provide a long-lasting result with the fruits of easily picking up, but with British Weilida, Intel and other companies and a few startups (such as Wave and Graphcore), which have developed cool architectures. Architectural advantages. Here are a few companies that cause me to pay attention to:
GROQ: Founded by the former Google employees working in TPU, there is ambition of the world. TenStorrent: Canadian AMD staff founded in the confidential phase. I can only say that its CEO's vision and architecture have left a deep impression.
Thinci: Indian company, focusing on edge devices and automatic driving vehicles, with Samsung and DENSO have established partnerships.
Cerebras: Leaders leaders in the former Seamicro (AMD subsidiary), including Andrew Feldman, is still in the depth "stealth" mode.
Mythic: An entrepreneurial company with a unique method for edge reasoning, similar to analog processing on non-volatile memory; chips should be launched in 2019.
Chinese company
China has been trying to find a road to getting rid of the US semiconductor, and the artificial intelligence accelerator may provide an export that it has been seeking. The China Central Government has developed the goal of building trillions of artificial intelligence industries in 2030. Since 2012, investors have invested more than $ 4 billion to startups. The US Congress said this is an artificial intelligent army competition. The US technology industry may be backward due to the less consideration of Chinese enterprises and research institutions to hinder the privacy and ethics of Western progress in promoting innovation.
Cambricon (Hanwu Technology) and SENSetime may be the most interesting Chinese artificial intelligence company, but the company is worthy of concern. In addition, please pay close attention to large Internet companies like Baidu, Huawei, Tencent and Alibaba, which have a large investment in artificial intelligence software and hardware.
Han Wu Ji Technology is a $ 2.5 billion Chinese unicorn company that has released third-generation artificial intelligence chips. The company claims that under low power conditions, it can provide approximately 30% performance advantages than Ying Weida V100. Han Wu Ji Technology also sells IP to customers, and provides artificial intelligence hardware for Huawei Kirin 970 mobile chipset.
Shang Dynasty Technology may be the highest-to-valuation of artificial intelligence startup, it is the most famous for promoting intelligent monitoring cameras across China. These cameras have exceeded 175 million units, including other companies produced. Shang Tang Technology was established in Hong Kong, and the number of financing has reached $ 600 million in Alibaba. According to many media reports, this start-up company currently has a current estimate of $ 4.5 billion. Shang Tang Technology and Alibaba, Gao Tong, Honda, and Even Ying Weida have established strategic partnerships. The company now has a supercomputer, runs about 8,000 (probably Antida) GPU, and plans to build 5 supercomputers to process the face identification data acquired by millions of cameras.
3, Yingda
Since I have shocked all people holding Ying Weida Stock, I hope to bring the people who spend a lot of money to buy the Yingwei GPU, so let us realistically see how the Ying Weida maintains its leadership in a high competitive market. status. We need to study training and reasoning markets separately.
Historical class from Nervana
First, let's take a look at Intel's experience in Nervana. Before being acquired by Intel, Nervana claims that its performance will be 10 times higher than the GPU. Then, an interesting thing happened on the way to Victory: Yingda's Tensorcores makes everyone surprised, it is stronger than twice, but 5 times. Then, British Weida has doubled its efforts on NVSwitch to build an amazing 8 GPU DGX-2 server (price of $ 400,000, quite expensive), defeating most (perhaps all) competitors. At the same time, the performance of the Cudnn library and drivers of the Yingda has almost doubled. It also built a GPU cloud that allows the use of GPUs as simple as in clicking and download optimized software stack containers, can be used for approximately 30 deep learning and scientific workloads. Therefore, as I shared in the previous article, Intel promised that the performance advantage disappeared, and promised to launch a NERVANA new chip in the end of 2019, now have to return to the design phase. Basically, Ying Weida proves that more than 10,000 engineers with solid resume and technical reserves are better than 50 intelligent engineers in a virtual garage. No one should be surprised, right?
Give 10,000 engineers a big sandbox
Now, until three years to 2019. Once again, competitors claim that their chips have 10 times or even 100 times performance advantages, and this is still in development. Ying Weida still has 10,000 engineers and maintains technical cooperation with the world's top researchers and end users. Now, they all develop the next generation of 7nm chips for British Weida. In my opinion, this is basically converted from the company's product from "AI's GPU chip" into "AI chip with GPU".
Figure 1: The DGX-2 supercomputer connected to NVSWitch is available in NVSWitch AI performance.
What is the additional logical area to the company's next-generation product? Although the following analysis is simple, it can effectively build an answer to this key issue.
Let us start from the first ASIC that looks like excellent performance, ie Google TPU. I see that the analysis says that each Google TPU chip is approximately 2-2.5B transistors. The Volta V100 has approximately 21B transistors in the 12nm manufacturing process. It is the largest chip that TSM can be manufactured. As the NGD is migrated from 12 nm to 7 nm, the chip can contain about 1.96 (1.4 x1.4) transistors. Therefore, in theory, if Nova does not add any graphic logic (of course unlikely), it will have another 20 billion transistors to use, approximately ten times the entire Google TPU logic.
Suppose my logical part takes twice. In this case, the Yingda engineer still has a 5-fold logic available for new AI features. Now, all this assumes that British Weida will fully pursue performance without reducing cost or electricity. However, in the training market, this is exactly what the user needs: shorten the training time. About Ying Weida may provide anything, there are many ideas, including processor memory and more version of Tensorcore.
My point is that British Weida has no doubt that there is enough expertise and the available chip space to innovate, just like it is done on Tensorcore. I have talked with many interesting AI chips, but my respect for those companies told me that they didn't underestimate the Yingda, nor did they be trapped in the GPU's thinking mode. Yingda DLA and Xavier, an ASIC and a SOC, which proves that British Weida can create a variety of accelerators, not just GPUs. Therefore, many people in the CEO of these startups decided not to adopt the way Intai, but first pay attention to reasoning.
I think British Weida will not have a disadvantage in terms of training. Its problem may be high in chip costs, but in terms of training, customers will pay. In addition, in reasoning, the Xavier of Odida is an impressive chip.
Han Wu Ji broke out is good for programmability
Let us return to the view of Han Wu Ji. Ying Weida correctly pointed out that we are in the early stages of algorithm research and experiment. An ASIC (for example, the convolutional neural network for image processing) may be done (and almost certainly) is very bad (for example, GaN, RNN or neural network to be invented) . Here is where GPU programmability is combined with the researcher ecosystem of Yingda. If Ying Weida can solve the upcoming memory problem, the GPU can be fairly quickly to adapt to a new neural network processing. Create a mesh structure consisting of 8 GPUs and 256 GB high bandwidth (HBM) memory by using NVLink, British Weida has significantly reduced memory capacity problems with high cost. We will have to wait for its next-generation GPU to learn about it and how to resolve delay and bandwidth, which is about 10 times the HBM needs to be taken.
Initial war
As I wrote in the first part of this series, for the field of reasoning, there is no giant, the edge and data center reasoning market is diversified, and I am going to grow rapidly, but I have to doubt, from profit At the point of view, it will be a particularly attractive market. After all, in the future of the commodity market, the profit margin may be quite thinner because many companies are competing and sales. Some reasoning is very simple, some are very difficult. The latter market will maintain a high profit rate because only the complex SOCs equipped with CPU, Nervana, GPU, DSP, and ASICs can provide automatic driving desired performance. Intel's Naveen RAO recently published a message on Twitter that the NERVANA reasoning processor will actually be a 10nm SoC using the Ice Lake CPU kernel. Ying Weida has taken the lead in using Xavier So to automatically drive, and Xilinx will use a similar way to automatically drive in this year later. Any entrepreneurial company that walks on this road requires the following two points: a) Very good "Performance / Watt" value, b) innovation road map, which keeps them leading the commodity.
in conclusion
In short, I have to reiterate the following points:
The future of AI is achieved by special chips, and the market for special chips will become huge.
The world's largest chip is intended to win in the future artificial intelligence chip war. Although Intel is chasing, don't underestimate its ability.
There are many start-up companies that have funded, some of which will succeed. If you want to invest in a company supported by a venture capital, make sure they don't despise the strength of Ying Weida.
In the next five years, China will have a large extent to get rid of the dependence on American artificial intelligence technology.
Ying Weida has more than 10,000 engineers, and the next generation of high-end GPUs for artificial intelligence may make us all of our people.
The reasoning market will grow rapidly and will have many spacespace specific to the application. FPGAs may play an important role here, especially the next generation of FPGAs in Xilinx.
Obviously, there are a lot of content to introduce this topic, and I just touched the fur! Thank you for speaking time to read this series, I hope it has inspiration and knowledge.
Hot recommendation:
NXP LPC54018 Internet of Things Module OM40007, let your schedule fall more quickly
Texas Instrument MSP-EXP430FR2355 LaunchPad Development Kit, allows design development to be simple and intuitive
Microchip Technology SR087 Power Demo Board, Switching Power Supply New Selection
Bosch BMI088 High Performance IMU, improve the flight and navigation experience with you
Roma Semiconductor Sensorshield-Evk-003 Evaluation Kit, a "available"
Intel Realsense Depth Camera D400 series "more convenient secret" waiting for you to discover
Analog Devices Eval-ADXL362 Evaluation Board to Develop Designers More Choice
Cypress Semiconductor PSoc6 BLE development kit, beautiful design is about to happen
STMicroelectronics VL53L1X flight time ranging sensor, absolute ranging fearless color and reflection
NXP I.MX 8MQUAD Evaluation Kit MCIMX8M-EVK, a high-performance development tool for unpacking
Infineon IM69D120 and IM69D130 Xensiv MEMS microphone, designed for low distortion and high signal-to-noise ratio
Original address: https://www.eeboard.com/news/ai-353/
Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention!
[WeChat scanning picture can be paid directly] "
Our other product: