ARM new GPU depth interpretation: a major leap

"Today, Arm launched its new Cortex-A77 CPU micro-architecture, but more importantly, introduced a new Valhall GPU architecture and the new Mali-G77 GPU.Bifrost architecture have been available for three years, with the continuous development of the industry and workloads , Arm's GPU must advance with the times. Valhall and the new Mali-G77 after three generations of Mali GPU, in terms of performance, density and efficiency are some significant improvements. Despite last year's G76 architecture for computing execution engine made some significant changes, but the G77 further, and out of the Arm unusual calculate the relative core design. Recalling Bifrost-- third time is the charm Over the past few years on the GPU IP product Arm of not very friendly, it's not a big secret. When the first Bifrost GPU - Mali-G71 was launched in 2016 and put into production later in the year on Kirin 960 and Exynos 8895, we had expected to have a good performance and efficiencies. Bifrost first GPU architecture Arm scalar, unlike the previous generation (Midgard: T-600,700 & 800 series) Design of a vector instruction. This change is fundamental, as we saw AMD and NVIDIA desktop GPU such providers have introduced new GCN architecture and Tesla in the past decade. Unfortunately, the first two generations of Bifrost, Mali-G71 and G72 is not followed by a good GPU. Arm two major licensees, Hass and Samsung, with the launch of two generations of GPU architecture is very disappointing. Kirin 960 and 970 particularly bad in this regard, I think it Huawei and glory of product planning and marketing had a considerable impact. In the last iteration of the Bifrost architecture, Mali-G76 for Arm terms is a more important leap forward, IP in large part to address some of the key issues of the previous generation, so that the belt is Exynos 9820 and Kirin 980 chipset to the relatively good results. Unfortunately, when the Arm to catch up and solve the problem BiFrost, the competitors have not stalled, but also to challenge the limits. Qualcomm Adreno GPU architecture over the years has been a leader in the mobile space, although Adreno 640 is not impressive improvements made this year, but it is still significantly ahead of the Arm in performance, efficiency and density. Even more worrying is that Apple A12 in terms of GPU performance and efficiency is definitely a major leap forward, even through the best ratio of GPU should be much better, not to mention the Arm of the Mali GPU. Valhall Description: a major improvement Today we're introducing a new GPU architecture Arm: Valhall (Old Norse Valhöll, namely Valhalla). The new architecture brought a new computing ISA and core design, tried to solve the major drawback of Bifrost architecture, design and looks GPU vendors employ other methods we see very similar. The first iteration of the new Valhall GPU Mali-G77, it will achieve all the architecture and micro-architecture we will be discussing today improved. Arm commitment that energy efficiency and area density (based on ISO performance and processing) 30%, machine learning and reasoning on the GPU workload performance by 60%. Even more interesting is that in the end of 2019 and in the upcoming 2020, SoC performance is expected than 2019 devices increased by 40%. Next-generation SoC is expected to improve only a small process node, so most of the upgrade cited here thanks to the new leap Mali-G77 GPU architecture and micro-architecture. Valhall Profile: one for the new ISA modern workloads Valhall heart of the new architecture is the new execution core Arm, unlike its predecessor are very different. It is still a scalar design, but different from the Bifrost unusually narrow 8-wide design, the new execution core is more similar to the way we are from AMD and NVIDIA desktop GPU and other suppliers to see. In the initial iteration Bifrost Mali-G71 and G72, the core execution block by 4-wide SIMD units scalar units which warp / wavefront size is 4. On the last G76, which increased to two 4-wide unit, warp size is 8. Initially, BiFrost design, the use of such a narrow Arm unit and such a small warp reason is to reduce the idle period on the ALU. The workload was unable to fill enough threads to support larger warp design. The advantage of this is that, in theory, Arm can be more ALU control logic to achieve better utilization of the expense. However, with the development of a new generation of workloads, this method has suddenly become no meaning. Today's game is rapidly developing into a more complex calculation, especially in the last couple of years, we see the origins of the PC game was ported to the mobile device. With the increasing complexity of shading calculations, issuing more threads and its wider warp design easier. In this case, the new Valhall architecture supports warp execution model of 16-wide. Although still no 32-wide Nvidia or AMD's 64-wide as the width of the design, but considering the last year before the Mali GPU using a 4-wide warp execution model, so this is a huge leap forward. Arm also changed the ISA itself, simplifying a lot of instruction. Although we have no more details, but said the new ISA easier to compile, and it is designed to better meet the modern API (such as Vulkan). For the previous BiFrost GPU even Midgard GPU, what we see is a single GPU core uses multiple execution engines. These three engines have their own dedicated data path control logic, its own scheduler and instruction cache, and its own register file and messaging blocks, which will naturally generate considerable overhead transistors. Especially in the high-end, it does not make sense, because the GPU IP since four years ago, T860 / 880 series of the number of execution engines have not changed. The new generations of the front G77 "small" execution engine into a block having a larger share of the IP control logic. IP design of the new engines are still some repeat: the actual ALU pipeline is organized into two "clusters", each cluster has its own 16-wide FMA unit and the corresponding execution unit. Part of the new ISA is a new kind of coding, it is more the rule than we have seen in the Bifrost. An important new feature Valhall architecture, the new ISA out of the fixed scheduling problem, clauses and tuples. In the Bifrost, Arm instruction scheduling compiler entrusted to group them into a so-called clause. This is effective in practice, although it took a lot of work on the compiler to get the best results, and trying to hide the delay between the instruction and data access. In ValHall, the scheduling entirely by hardware, which in essence is more akin to work disorderly CPU. The new design also means that the actual ISA even more out of touch with the micro-architecture to achieve, which is a more forward-looking design choice. The new ISA will also focus on optimizing the texture command, which is related to the new architecture dramatically enhanced texture capabilities. Other changes include the evolution of the prior art and the incremental update data structures, including geometry optimization and optimization of the flow of AFBC (ARM frame buffer compression). We further study the structure of the execution engine, the structure is generally found to be divided into four modules: front end comprising warp scheduler and the I-cache, the same data path two clusters (processing unit), and connected to the load / store unit and fixed function block message blocks. Supports up to 64 front-end or 1024 warp threads. Each processing unit has three ALU: FMA and CVT (Convert) unit is a 16-wide, while the SFU (special function units) is 4-wide. Front-end creation and destruction warp, and maintain the necessary state. A dynamic scheduler in the front end, which determines which instructions are executed for each warp. Wait dependency will replace the warp warp is ready to be executed. I-cache shared between processing is 16KB (2048 instructions) 4-way set associative (4-way set associative), according to the bandwidth four instructions can be issued per cycle. In the actual processing unit (cluster), we have four extraction unit sent to the arithmetic unit. Each register file has an extracting unit tightly coupled, and a register file to reduce the access delay buffer forwarding. FMA ALU of support per week 16 FP32 FMA, twice FP16, and it is twice as INT8 dot product. convert basic integer operations and the processing unit of natural type conversion operation, and serves as a branch port. SFU is a 4-wide instead of 16-wide, because it deals with less frequently used instructions, that does not require a large throughput. Overall, by comparison higher level between Mali-G77 and Mali-G76 execution engine, we saw the single-engine duel with three engines. There are single-engine on the main data path more entities, the control entity I-cache and less, thereby improving the efficiency of the entire block region. A significant change in delay that the ALU data path now is the depth of the 4-cycle, instead of the previous 8-cycle, which may improve performance when no chaining. Similarly, the new superscalar core with similar releases, rather than the last barrel pipeline design, pipeline design barrel in the past, the pipeline must make the appropriate compromise to the delay. This change is also greatly simplified new compiler is one of the reasons, because it is no longer necessary to match the instruction issue pipeline design at the same time as before. Mali-G77 micro-architecture Previously introduced over the execution engine is responsible for arithmetic processing, micro-architecture part of this section only wider core design. Here, Arm usually remain with the previous generation GPU Zongtisheji very similar, but there are some important changes in several blocks. Still shader core comprising an execution engine, load / store unit with a buffer, the attribute unit, change unit, a rear end and a pixel texture mapping unit, and various other 3D fixed function block. The biggest change is here texture unit block, unit has doubled compared with that we found in Mali-G76, which doubled throughput. From the perspective of senior functional point of view, the new TMU looks very similar to its predecessor, but we found the new design there are some very important changes in terms of throughput. The design is divided into two "paths", a path to hit (hit-path) and the path of a miss (miss-path), which may be internal or handle cache external texture cache "miss" (miss). Hit the natural path is a shorter, through the path delay optimization. Hit on the path, texture cache itself has been improved and now is 32KB, and can reach 16 texel / cycle throughput. Filtering unit has been improved, the throughput is also improved, and now support a bilinear texture quadrilateral each cycle, each cycle trilinear texture support half square, twice G76 both throughput. Interestingly, Arm said the new TMU TMU before with roughly the same, but still achieve double capacity, rated as very good engineering feat. Basically, this core capability of a substantial increase in the texture changes ALU, Tex ie the ratio of the GPU. Although the ALU capacity building by 33%, but the TMU double the throughput means we can now return to a lower rate, prefer the texture throughput and past focused on improving the GPU computing performance. Arm think it's a necessary change because the current workload began to increase the burden on GPU in this regard. It should be noted that, although the texture filtering throughput increased, but the actual pixel back-end throughput is not increased. Shader core here is still only draw two pixels per clock, so we now texel (pixel ratio) is 2: 1, whereas in the past it is still 1: 1. Another shader core block redesign new load store cache block. Functionally, it is in the past, but now has been redesigned, taking into account the additional throughput. In the same area, the number of pipeline stages is reduced by half, to further reduce the delay of the core operation. The bandwidth has been expanded to a full cacheline width, this should be twice predecessor. The actual buffer size is 16KB, 4-way set associative (4-way set associative), said to be useful for machine learning workload. Be All the pieces together, and to reduce the GPU shader core level, we again see all Arm modular method to organize together. The architecture supports the shader core 32 to extend from a nuclear core, although G77 micro-architecture currently supports only 16 cores. In addition, minimal design RTL Arm is currently preparing a 7-core configuration, using a different IP (eg G52, or in the future in the same range has not launched IP) can better serve customers with smaller configurations. L2 cache still consists of four parts, each plate ranging in size from 256KB to 1MB. Currently, most vendors have adopted 2MB configuration, I believe that no unauthorized party had implemented 4MB configuration. In terms of bandwidth, L2 to the LSC bandwidth from 32B / cycle doubled, reaching 64B / cycle (a complete cacheline), while the bandwidth depends on external suppliers for each part of the L2 implementation 128 or 256 AXI interface. Performance target: 30% of PPC and efficiency On paper, the new Valhall architecture and new Mali-G77 looks really is a big change, but more important is to know how to Arm declared performance, efficiency and area. Arm performance statement is very interesting, because Arm is based on published performance per mm². Based achieve their suppliersThe GPU's way, they can change the number of cores and frequencies, which is actually difficult to give a clear number to describe the improvements between the two independent GPU configurations. For G77, ARM claims that the new IP can provide 1.2 to 1.4 times more than G76 per mm2 performance. In terms of absolute values, the size of the G77 shader core is said to be the same size as the core of G76. This means that this can be converted directly into a smaller GPU, or only more space to increase the extra GPU core, thereby improving performance. In particular, the ARM claims that G77 is very good in the texture of the game, so it will be a interesting thing to pay attention to the actual appearance of the device, and the performance of different workloads will be an interesting thing. Another way to improve performance is to improve the GPU clock. The fundamental limit here is 4 ~ 5W TDP restrictions of smartphones. In comparison of ISO processes and performance, the new G77 is said to complete the same workload using 17% to 29% of energy and power. Or in other words, performance / W increased by 1.2 to 1.39 times. ARM indicates that the basic frequency between G76 and G77 does not change too much, and the target of internal ARM is still 850MHz. I don't intend to try any performance and efficiency prediction because there are too many variables play. I expect that one of the greater changes next year is that SOC will support LPDDR5, which may significantly change the power dynamics of smartphones. ARM did point out that they expect SOC performance of G77 next year to increase 1.4 times. Taking Samsung's Exynos9820 as a reference for G76, this means that future G77 SOC will be close to Apple's A12 GPU performance with higher power efficiency (assuming sustain power levels). This will make Qualcomm into trouble because it will obviously transcend the current generation Adreno 640, but we expect Qualcomm will also launch a new generation of GPUs. G77's machine learning performance is ARM leads. This is not only because the number of the core of the core has increased by 33%, but also has great improvements in the design and bandwidth of the LSC, which makes the G77's reasoning performance ratio of 1.6 times. Finally, Arm recent generations Mali GPU generations were compared. Under the same process and the same properties, the new G77 continues to achieve a 30% year-on-year energy efficiency, and saves 50% more energy than Mali-G72. in conclusion During Techday event, ARM obviously be very excited about the new Valhall architecture and the new Mali-G77. They are very reasonable to be excited, because it seems that ARM seems to show the major intergenerational leap of Mali GPU IP. The new G77 uses a new architecture that fundamentally improves the core of ARM to achieve more modern, conforming to desktop computer GPU architecture. This seems to be a long-term shift, although G76 is a relatively good GPU, but the forward G72 and G71 are not. I look forward to seeing Mali-G77 in the next generation of Samsung Exynos and Haisi Kirin So, this year. In theory, if everything goes well, G77 should be able to narrow the gap between Apple and Qualcomm performance and efficiency. In particular, G77 should be able to transcend high-pass Adreno GPUs, at least the current generation of Adreno. I am very optimistic about this, now ARM and authorized partners only need to perform properly, allowing users to enjoy the final results. Recommended activities: Participation Award | MPS Power Symposium Hangzhou Station, at least 100 yuan reward Boom! Summit new user is the first welfare, three people are all unfill! [Invitation] Keysight World 2019 Meets Shanghai, now open forecast name Be Invite you to join the panel engineer group (group number 135513647) More activities, consultation, technical exchanges