NVIDIA TURING architecture analysis: rasterization rendering is right

"From the NVIDIA's Turing Architecture has passed a month, the release of the GeForce RTX 20 series and the launch of real-time light tracking technology, let NVIDIA will use" GeForce GTX "to rename" GeForce RTX "and completely changed Game graphics card. Real-time ray tracking, RT Core, Tensor core, AI function (ie DLSS), light tracking API, all of which are collected together, indicate new directions for game development and Geforce graphics card. The product launched in the past is very different, NVIDIA has divided the contents of its latest graphics card into two parts: architecture and performance. Recently, NVIDIA finally unveiled the veil of the new Turing architecture detail. Although some interesting aspects have not yet been explained, there are some links to study with objective data, but also let us have the opportunity to understand the GEFORCE RTX crown. Technique: Light Tracking. While using Turing real-time light tracking (DXR) API, NVIDIA's Optix Engine or Unpublished Vulkan Light Tracking Extensions with DirectX, and DXR for the game has not been released to end users, but in view of NVIDIA traditionally Powerful ecosystems with developers and middleware (such as Gameworks), they want to use high-end games to stimulate consumers support for mixed rendering (rasterization + ray tracking). As mentioned earlier, NVIDIA is working hard by mixing rendering to promote the transformation of the deserted bone. Behind NVIDIA takes the reasons for this step, except that "the real-time ray tracking is the holy cup of computer graphics", there are many other potential motives that transcend graphic pureism. Light tracking the first lesson: What & why Since NVIDIA is used for RT Core, RT Core is one of the two technology cornerstones of the Turing architecture, so it is best to discuss what is light tracking before we know the Turing architecture, and why NVIDIA will put in so much. Chip resources. Briefly, light tracking is a rendering method that simulates the performance of light in the real world (reflection, refraction, etc.). The biggest problem to achieve it is that it is almost no exaggerated performance, if the most original approach is used to try to calculate all the light emitted by each source in the scene, it will track endless light in the scene. For many years, algorithm engineers have developed many optimization measures for light tracking, most of which is the simple concept of "light", not starting to track rays from the light source, but from the screen, from the viewer's perspective Tracking the light, so you can only calculate the actual reach the screen, which greatly reduces the amount of calculation. However, even in many optimized methods including this method, the demand for performance is still high. In addition to the most basic, rough light tracking, anything else is still outside the range of real-time rendering. These optimization techniques are merely made to light tracking can be completed relatively "reasonable" on the computer. Of course, this "reasonable" is measured in hours or days, which depends on the complexity of the scene and the rendering you expect. Effect. In fact, so far, light tracking has been mainly "offline" scenes such as 3D animated movies. Rastering rendering is non-non- The cost of ray tracking means that it still cannot be used for real-time image rendering, so the computer industry uses a rendering method called rasterization from the beginning. Although the name is dipped in a "light" word, but the entire rasterization rendering is actually no "light" concept. Rasterization refers to the process of 3D geometry to 2D pixels, all screen matters are only for one of the pixels. When the game starts rendering a frame of screen, first generate vertices of all objects in the game scene, then send all the coordinate information of all the apex to the geometry unit within the GPU. The geometric unit constructs visual space with the screen position, and places these vertices into the space, followed by connecting the vertex to the line, constructs the outline of the object, and then covering the upper layer on the surface with light information Bottom texture as skins. To this step, our game screen is initially ubese. The next is the core of the entire rasterization rendering process: rasterization, rasterizer in the GPU, presses the entire visual space into a two-dimensional plane from the three-dimensional stereoscopic form in accordance with the line perspective. After the flow processor then determines which pixel bright & multi-bright, which pixels are dark & have multiple dark, which pixels are high light, which pixels are high, and which pixels are high light. While the flow processor is busy calculating the pixel information, the texture unit within the GPU also begins to cut the preset "entire" texture material to the shape of the screen. Finally, the flow processor and texture unit submit calculated pixel information and the tailoring texture material to the ROPS in the rear end of the GPU, and the ROPS mixes the two to the final picture and output. In addition, after the furniture, depth of field, dynamic blur and anti-aliasing, etc. are completed by ROPS. Seeing this here, we should understand that every frame screen we have seen is a 3D three-dimensional painting of GPU. 3D three-dimensional painting looks really unreasonable, depending on how the painter is level; and the picture of rasterization is really unreasonable, depending on whether the rendering algorithm is advanced and perfect. Mixed rendering, light tracking The simple and rapid determination of rasterization has limited simulation of the picture in the real world, which also leads to a defect in the rasterization of light, reflection and shadow unnatural. If the rasterization is so incorrect, how does the game further improve its image quality? Of course, you can continue to go so, rasterization is not impossible, just that the calculated computing performance will expand at high speed. Just like a tens of lies, in some cases, it is more complicated with rasterization rendering to generate realistic pictures, even more complicated than the natural process of light tracking. In other words, it is technically consumed so much performance in the rendering method of visual deception in the rasterization, why not put these efforts to import another technology that can accurately render the virtual world? In 2018, the entire computer industry is thinking about this issue. For NVIDIA, the forward road is no longer a pure rasterization, but mixed rendering: combining rasterization with light tracking, its idea is to use light tracking in meaningful places - for lighting, shadow, All other contents of the interactions involved, then use traditional rasterization to handle other things, which is also the core idea of the Turing architecture. This means that developers can balance the high-performance and light tracking of rasterization depending on the demand, without the need to immediately jump from raster to light tracking and lose all of the performance advantages. So far, the cases of NVIDIA and their partners are easy to implement, such as precise real-time reflections and better global illumination, but obvious mixed rendering can be extended to any light-related operations. However, NVIDIA, Microsoft and other companies have to establish an ecosystem from zero, but they must not only promote the advantages of light tracking, but also teach developers to implement it in an effective manner. However, we can still discuss light tracking first, see how NVIDIA will become real-time light tracking to reality. Boundary volume hierarchy It can be said that NVIDIA has a big bet under Turing, and the traditional GPU architecture can process rasterized rendering at high speed, but is not good at tracking this task. Therefore, NVIDIA must have a dedicated hardware unit for the light, and these additional transistors and power consumption have no direct welding for traditional rasterization rendering. This part of the dedicated hardware unit will be largely used to solve the most basic problem of light tracking: determine the intersection of the light and the object. The most common solution for this problem is to store triangularities in a very suitable light tracking data structure, which is called BVH (boundary volume hierarchy). From the concept, BVH is relatively simple, it is not detecting each polygon to determine if it intersects the light, but a part of the detection scene to see if it intersects with the light. If the scene is partially intersected with the light, it is subdivided into a smaller portion and detects again. Continue until the single polygon is sequentially, and the light detection is resolved. For computer scientists, this sounds like binary search applications, and it is true. Each test allows a lot of options (polygons in light tracking) as a possible answer, it can reach the correct polygon in a short period of time. BVH in turn is stored in nature of the tree data structure, each subdivision (boundary box) is stored as the child node of its parent border frame. Now BVH problem is, although it fundamentally reduces the amount of light intersecting the desired judge, these targets are a separate light. When each pixel requires a plurality of light passing, each light needs a lot of light. Detection, its calculation is still not low. This is why hardware acceleration is so important why it uses a special light tracking unit. Inherit the Turing architecture of Volta Let's take a look at this Turing architecture, the new Turing SM looks very different from the previous generation of Pascal SM, but people who understand the Volta architecture must notice that Turing SM is very similar to Volta SM. Like Volta, Turing SM is divided into four sub-core (or processing blocks), each subscriber having a single WARP scheduler and scheduling unit, and the two partition settings of Pascal are two opposures of the WARP scheduler of each subscriber. Scheduling port. From a broad sense, such changes means Volta and Turing lose the ability to issue second non-dependent instructions from threads in a clock cycle. Turing may be the same as Volta in two cycles, but the scheduler can issue independent instructions at each cycle, so Turing can ultimately maintain the two-way instruction level parallel (ILP), while still having twice the PASCAL The number of schedulers. As we see in Volta, these changes are closely linked to the new scheduling / execution model, while Turing also has independent thread scheduling models. Unlike Pascal, Volta and Turing have a scheduling resource for each thread. There is a program counter and a stack of each thread to track the state of the thread, and a converged optimizer to intelligently packet the same WARP thread. In the SIMT unit. For CUDA and ALU (arithmetic logic units), the Turing subscriber has 16 INT32 units, 16 FP32 units, and two Tensor units, which are the same as the Volta subscriber. Using split int / fp data paths like Volta, Turing can also perform FP and INT instructions simultaneously, and this is closely related to RT Core. The difference between Turing and Volta lies in Turing without an FP64 unit, and its FP64 throughput is only 1/32 of FP32. Although these details may be more preferred, the design of Volta seems to maximize the performance of Tensor Core, while minimizing the destructive parallelism or coordination with other calculated workloads. The case is the case for Turing's second-generation Tensor Core and RT Core, where four independent scheduling subsidiaries and granular thread processing are very useful for achieving maximum performance under hybrid gaming workloads. In terms of memory, each subscriber of Turing has a L0 instruction cache similar to Volta, with a 64 KB register file having the same size. In Volta, this is important for reducing the delay of Tensor Core, and this may also benefit RT Core in Turing. Turing SM has 4 load / storage units, which are lower than 8 in Volta, but still maintain 4 texture units. The new L1 data cache and shared memory (SMEM) are further extended, it has been improved and unified as a single partition memory block, which is another innovation of Volta. For TURING, this looks a combination of 96 KB L1 / SMEM, traditional graphics workloads are divided into 64KB-specific graphics shader RAM and 32 KB texture cache and register file overflow zones. At the same time, the calculation of the working load can divide the L1 / SMEM to be divided by 64 kb as L1, and the remaining 32 kb as SMEM, and vice versa (VLTA's SMEM can be configured to 96 kB). Rt Core: Mixed rendering and real-time light tracking On TURING, light tracking does not completely replace the traditional raster rendering, but as part of "mixed rendering", and "real-time" can only pass only a small amount of light in each pixel and suppleme noise reduction. Implementation in the case. For performance reasons, at this stage developers will consciously and targeted to achieve rasterial imaginary effects, such as global illumination, ambient light shielding, shadow, reflection and refraction. Light tracking can also be limited to specific objects in the scene, and use rasterization and z buffer instead of the main light projection, but only light tracking of the secondary light. With the importance of light tracking in the computer graphics, NVIDIA Research has been studying various BVH implementations for a long time, as well as architecture issues that explore light tracking acceleration. However, NVIDIA did not disclose many details about RT Core or its BVH implementation. Rt Core is different from Tensor Core, Tensor Core is more like an FMA array with FP and int core, and RT Core is more like a typical uninstall IP block. It is very similar to the texture unit in the nuclei, and the instruction of the RT Core is routed to the subsidiary. After receiving the light detector from the SM, the RT core continues to traverse BVH and performs light intersection detection. This type of "traversal and cross" fixed function light tracking accelerator is a well-known concept, which has many implementations for many years, because traversal and cross detection are two tasks with the highest intensity. In contrast, the BVH in the shader will require thousands of light projecting.A instruction tank, all of which are used to detect the boundary box crosspoints in the BVH. RT Core also processes some memory operations to maximize memory throughput across multiple light. Like many other workloads, memory bandwidth is a common bottleneck with light tracking, and is also the focus of NVIDIA research. Considering that light tracking produces very irregular and random memory access, there may be some memory and light buffers in the SIP block. Tensor Cores: Learning deep learning to reason for game rendering Although Tensor Cores is a typical feature of Volta, the second-generation Tensor Core equipped with Turing is Ying Yue. The main change of the second generation TENSOR CORE is to increase the INT8 and INT4 precision mode for reasoning, enabled through the new hardware data path, and perform point accumulation is int32. The operational speed of INT8 mode is twice the FP16, or 2048 integer operations per clock; the operational speed of the INT4 mode is four times the fp16 rate, or each clock 4096 times integer operation. The second generation TENSOR CORE still has an FP16 mode and can support pure FP16 modes without an FP32 accumulator. Although CUDA 10 has not yet come out, the enhanced WMMA operation should be able to explain any other difference, such as an additional acceptable matrix size of the operands. The new brand named by GeForce RTX and Turing is not only RTX, but also the NVIDIA RTX platform that will be integrated with all functions of Turing, including: NVIDIA RTX platform: Universal platform containing all Turing features, including advanced shaders NVIDIA RTX Light Tracking Technology: Name of Light Tracking Technology under the RTX Platform Gameworks raytracing: Gameworks SDK in Light Tracking Denowing Modules GeForce RTX: Use NVIDIA RTX real-time light tracking with game-related brands GeForce RTX: graphics brand brand NGX is technically affiliated with the RTX platform, and its most representative is DLSS (depth learning super sample) technology. DLSS uses DNN (depth neural network) designed for games, training using ultra-high quality 64-fold super sampling images or real screens, and thus through Tensor Core to infer high quality anti-aliasing results. In standard mode, DLSS infers out high-degree anti-aliasing results at a lower input sample, which can achieve similar effects with TAA on the target resolution. Due to deep learning, NVIDIA is pushing pure computing / professional functions to consumer fields. On TURING, Tensor Core can accelerate DLSS and other features, or accelerate some AI-based noise reduction, to clean up and correct the real-time light tracking. summary The Turing architecture and the release of GeForce RTX marks computer graphics to develop from false visual deception on the consumer market. So far, the industry has always been unfortunately. Although the Turing architecture has added a dedicated light tracking unit RT Core, it is adjuvant to Tensor Core for AI noise reduction, but under the cold and objective thinking, according to the Lei Fengwang (public number: Lei Feng network), at 1080p resolution Getting Started Thresholds with Basic Availability is that the basic availability is 100 million lights per frame. If it is based on 60fps, the GPU needs to reach at least 6 billion light calculations per second. Turning back to see the three graphics cards just released, their light tracking performance is 10 billion / 8 billion / 6 billion lights per second, and NVIDIA seems to represent the lower geforce RTX / GTX 2060 and other graphics cards will no longer support light tracking. I don't know if this is a coincidence. The light tracking performance of GeForce RTX 2070 is just pressing the gate threshold with basic availability described above. In this way, lower-end graphics cards do not support light tracking is also affordable. In addition, it may be that the current light tracking algorithm is too pursuit of simplification, and there is still a possible error in the restoration of the light and shadow relationship. For example, when NVIDIA uses the game V. The game demonstrates the RTX effect, the car has a mistake in the reflection of the fire, and the lattice at the red frame is the flare behind the car. From the perspective, it should not have Fire reflection: Original address: https://www.eeboard.com/news/nvidia-turing-2/ Search for the panel network, pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you master. Recommended attention! [WeChat scanning picture can be paid directly] "