Vectorization of H.264 decoder based on dragon core 3b

"Keywords: 3b, decoder, godson, vectorization Today's society has entered the information age, and the traditional information carrier and communication mode can no longer meet people's demand for information. The experiment shows that compared with voice and abstract data, the information received by human beings is more carried by pictures and videos. Video information is intuitive, specific and efficient, which determines that video communication technology will become one of the important technologies in the information age. Because the amount of video data is huge, and the resources for storing video are usually very limited, it is very necessary to compress and encode video to reduce the consumption of storage resources. However, generally, the higher the complexity of the compression algorithm used, the higher the compression ratio, and the lower the decoding speed during video playback. Therefore, while improving the coding compression rate, it is also necessary to optimize the decoder to improve the performance of the video decoder on the target platform. This paper implements the transplantation and vectorization of ffmpeg decoder on Godson 3b, and improves the performance of the decoder on Godson 3B. 1 video codec and Godson 3B 1.1 video encoding / decoding At present, there are many mature compression coding / decoding methods. Among them, H.261, MPEG-1, MPEG-3 and H.263 adopt the first generation compression coding methods, such as prediction coding, transform coding, entropy coding and motion compensation; MPEG-4 and H.264 adopt the second generation compression coding methods, such as segmented coding and model or object-based coding. The main purpose of video compression coding is to reduce the resources occupied by storing video, and the goal of decoding technology is to improve the decoding speed, so as to improve the fluency of video playback. Common soft decoders based on H.264 coding methods include coreavc, ffmpeg and JM. JM is the codec provided by H.264 official website, which integrates various codec algorithms, and the structure of the code is clear, which is very suitable for the research of video codec technology. The coreavc decoder is mainly used in commercial applications, and its decoding rate is more than 50% faster than ffmpeg. Ffmpeg is an open source decoder with relatively good performance. Many open source projects use ffmpeg directly or indirectly, such as Mplayer player. Through the comprehensive consideration of performance and open source characteristics, this paper selects ffml peg as the object of transplantation and vectorization. 1.2 Godson 3B architecture Godson 3B processor is compatible with MIPS64 instruction set and implements vector extension instructions for multimedia applications, which is very helpful to improve the performance of video coding / decoding applications. Godson 3B provides a 256 bit vector register and implements a vector extension instruction including 256 bit vector access. Using vector instructions, you can complete the operation of 32 bytes of width data at a time. With such a structure and instruction set design, Godson 3b is very suitable for realizing the same operations of large-scale data of the same type, such as matrix multiplication, FFT, video encoding / decoding, etc. However, since ffmpeg does not support Godson 3B platform, it is necessary to complete the transplantation of ffmpeg to Godson 3B. Before this paper, there are also some transplantation of ffmpeg to other platforms and the transplantation and optimization of Godson platform, which have achieved good results. 2 ffmpeg transplantation based on Godson 3B 2.1 transplantation of ffmpeg Ffmpeg decoder provides support for different target platforms, and the files related to these platforms are saved in a directory named after the target platform. For example, the ffmpeg decoder implements support for arm and SPARC platforms, as well as x86 platforms. For realizing the support of ffmpeg decoder for Godson 3b, the following five steps are mainly completed: (1) Modify the configuration file and add configuration options related to Godson architecture; (2) Create Godson special folder and store all Godson architecture related files in this folder; (3) Add the newly added files to be compiled under Godson folder to makefile; (4) Added with dsputil_ Init is similar to the new initialization function dsputil_ init_ godson； (5) Add the declaration of the new function in the header file. Ffmpeg transplantation for Godson 3b is relatively simple, so this paper focuses on the vectorization for Godson 3B. 2.2 performance comparison of transplanted ffmpeg This section tests the performance of the transplanted ffmpeg decoder, and compares the performance with Godson 3B vector extension instruction and without Godson 3B extension instruction. During the test, gcc compiler supporting Godson 3B extension instruction set is used for compilation, and - ftree vectorize and - March = godson3b compilation options are enabled to support Godson 3B extension instruction set. The test case used is the video "walk"_ vag_ 640x480_ Qp26.264 ", the test results are shown in Table 1. It can be seen from the test results in Table 1 that using the vector extension instruction of Godson 3B can improve the performance of ffmpeg decoder on Godson 3b, and the decoding time of the video used for test is reduced by about 466s. Nevertheless, due to the limitation of the automatic vectorization ability of gcc compiler, the performance improvement of ffmpeg decoder is relatively limited. Therefore, vectorization of the transplanted ffmpeg decoder for the instruction set of Godson 3B has become an important work to further improve the performance. Vectorization of 3 ffmpeg 3.1 oprofile test of ffmpeg Use oprofile to test the process of ffmpeg decoding video "002. MKV", and the test results are shown in Table 2. Table 2 lists the calling process of each function and the proportion of running time. The vectorization work for ffmpeg decoder is mainly aimed at the vectorization of several functions with long execution time and large proportion in oprofile test results. The execution time of the above functions accounts for almost 60% of the execution time of the ffmpeg decoder. Therefore, vectorization of the above functions can completely improve the overall decoding speed of ffmpeg. 3.2 ffmpeg vectorization for Godson 3B 3.2.1 vectorization method To realize the vectorization of ffmpeg decoder on Godson 3b, we mainly use the vector instruction extended by Godson 3b to improve several functions with a large proportion of execution time in the oprofile test results in Section 3.1. Moreover, while vectorization, some optimization strategies can also be used to improve the performance of vectorized functions. The main optimization methods used include: (1) Loop expansion. Loop unrolling is a loop transformation technology, which copies multiple instructions in the loop body, increases the amount of code in the loop body and reduces the number of loop repetitions. It should be noted that loop unrolling itself does not directly improve the performance of the program. The main purpose of using loop unrolling is to fully mine the parallelism between instructions or data. The use of vector extension instruction makes use of the parallelism of the data in the expanded loop; The purpose of using instruction scheduling and software pipelining technology in the expanded loop is to make full use of the parallelism between instructions. (2) Instruction scheduling. The number of instructions in the loop body after the loop is expanded increases, so the instruction scheduling can be carried out to schedule the instructions without operand correlation and operation unit correlation together, which can give full play to the pipeline performance of Godson 3b, so as to improve the execution speed of the code on Godson 3B. In addition to using the vector extension instruction of Godson 3b and the above two optimization methods, other optimization methods can also be used for optimization according to the characteristics of specific functions, such as using logic operation and shift operation instead of multiplication operation. Vectorization optimization for each function is described in section 3.2.2. 3.2.2 vectorization for specific functions Section 3.2.1 summarizes some optimization methods used in vectorization. This section focuses on several functions with a large proportion in oprofile test. For the functions in Table 2, we can classify them according to their names. Basically, similar optimization methods can be used for functions with similar names. (1) Simple vectorization. For the optimization of No. 1 and No. 2 functions, this paper adopts the strategy of using shift operation instead of multiplication operation, and uses saturation vector operation to improve the boundedness of the internal operation of the loop. However, for the memory access operation of function 2, due to the data misalignment, additional vector instructions are used to package and write back the data. Function 3 is a mixture of functions 1 and 2, so the optimization of functions 1 and 2 indirectly improves the performance of function 3. For functions 4, 5 and 6, this paper only uses the loop expansion and instruction scheduling strategy for their inner loop, which can achieve good operation results. Similarly, vectorization can be performed intuitively for functions 11 and 12, which will not be described in detail here. (2) Indirect vectorization. For functions 7 and 8 which are difficult to vectorize, the strategies of using mask and matrix transpose operation are used to realize vectorization indirectly. Aiming at the problem of many judgment statements in the C language implementation of function H264 V loop filter luma, this paper uses the method of constructing mask to eliminate these judgment statements. Take the loop in Figure 1 (a) as an example to introduce the construction of mask. Fig. 1 (b) shows a vector instruction instead of the loop. The specific operation results are shown in Figure 1 (c): saturate the P vector (array) and Q vector (set the negative result to 0), and the resulting vector is shown in Vsub. Use Vsub to compare with zero vector to set the mask: the result is true and the mask value is 0xff; Conversely, the result is false and the mask is 0. Finally, the operation result of the loop can be obtained by summing the mask value with the P vector. Using the method of constructing mask to eliminate judgment statements not only reduces the time overhead caused by judgment, but also indirectly vectorizes the loop and improves the function performance. For functions 9 and 10, the same method can be used to improve. For function 8, vectorization cannot be performed because the operation processes continuous data. Using matrix transpose, after repackaging the data, the corresponding vector operation can be carried out. For the operation in Fig. 2 (a), the original calculation is an internal operation of the P vector, so it cannot be vectorized. We use the vector instruction to transpose the P vector to Q, where Q0 stores the data labeled 1 in P, Q1 stores the data labeled 2 in P, and so on. The Q vector obtained by transposing can be operated by the vector instruction in Fig. 2 (b), and the operation result is the same as the original operation. For the optimization of functions 13 ~ 15, the transpose method above is also used. The test results in Section 4.1 illustrate the optimization effect of each function. 4 experimental results 4.1 ffmpeg function speedup ratio In this paper, each function after vectorization is tested and compared with the function before vectorization. The acceleration of each function after vectorization optimization is shown in Figure 3. The function serial number shown in the abscissa in the figure corresponds to each function in Table 2 one by one. The acceleration ratio of the function in Figure 3 spans a large range. For example, the acceleration ratio of function 6 is about 23.9, while the acceleration ratio of the last function is only about 1.2. The above situation is not only related to the number of vector instructions used by the improved function and the proportion of modified code, but also related to the type of operands used in the operation. For function 6, the type of operands used in the operation in the loop is byte type, so only vector instructions are used for optimization, and the theoretical speedup can reach 32. However, this paper only vectorizes the inner loop of the function, and the vectorized inner loop only processes 16 byte data at a time, That is, 256 bit vector registers are not fully used. Therefore, the theoretical speedup ratio should be 16, but due to the combination of other optimization strategies such as loop expansion and instruction scheduling, the actual speedup ratio can reach about 23.9. Similarly, by analyzing the three same types of functions No. 4, No. 5 and No. 6, we can also find that the acceleration ratio of the latter function is about twice that of the previous function. This is because for function No. 4, the inner loop vectorization can calculate 4 bytes of data at the same time, while function No. 5 can calculate 8 bytes of data at the same time. Therefore, the theoretical acceleration ratio should also be in the form of double proportional series, and the actual results are consistent with the theoretical analysis. For function 7 and function 8 highlighted in section 3.3.2, their original functions cannot be simply vectorized. In this paper, optimization methods such as mask and matrix transpose are used to enable them to use the vector extension instruction of Godson 3B. Therefore, although the performance is not improved greatly, the acceleration ratio is also 3.2 and 5.5 respectively. 4.2 vectorization comparison on different platforms This paper also tests the ffmpeg decoder on different platforms. The two test videos used are "002. MKV" (video a) and "walk"_ vag_ 640x480_ Qp26.264 "(video b). Video a is the clip intercepted from video (720p), and the latter is the walk through x264_ Vag.yuv (480p) code is generated, and the QP value selected for coding is 26. The test platform selects AMD and Intel processor platforms respectively. It can be seen from the test results in Table 3 that for video a, the performance improvement on Godson 3b is much higher than that on the other two platforms; For video B, the performance improvement on Godson 3b is also close to that of the other two platforms. The experimental results show that the vectorization of ffmpeg decoder on Godson 3b is very helpful to improve the performance, and the performance improvement is even higher than that of commercial processors with superior performance when decoding some videos. Compared with the results of vectorization compilation using gcc in Table 1, it can also be seen that the performance of vectorization of ffmpeg decoder manually is much better than vectorization using gcc. 5 Summary and Prospect In this paper, the transplantation of ffmpeg decoder to Godson 3b is realized, and according to the characteristics of Godson 3B's support for vector extension instructions, the ffmpeg decoder is manually vectorized. The experimental results show that the performance of ffmpeg decoder after manual vectorization is better than that after GCC vectorization