Improvement and optimization of H.264 algorithm for implementing parallel processing acceleration hardware

From the perspective of hardware implementation, the H.264 algorithm has focused on the optimization of the prediction part of the most calculation time, which gives an improvement to intra prediction, Hadama transform, and motion estimation, by simplifying complications, and efficiency High modules and reduced module data correlation, etc., optimize hardware. It is effective to prove the improvement by simulation of various test sequences. H.264 was originally drafted by ITU-T and will become the joint standard of ITU-T and MPEG in the future. H.264 Because of a high encoding compression efficiency, friendly internet interface, it will become the next generation of new video coding standards. However, while coding efficiency is high, the complexity of its algorithm has also increased four times, which limits its implementation on a large program. Therefore, improvement and optimization must be improved for hardware implementation. The initial test model (JM) of H.264 is designed to achieve a high encoding effect. In this test model, there are many algorithms that require a lot of computational volume, but there is not much increase in coding efficiency, and many simulations are data-related, which limits the implementation of acceleration hardware with parallel. Previously, there was a complexity of this new video coding. However, these studies have obtained the complexity of the H.264 algorithm through software analysis. These results are precise to applications in software, but when the parallel processing of hardware design is not applicable. After testing, it can be concluded that the key point in H.264 hardware implementation is the prediction section because the calculation time occupied by this module is almost 90% of the total time. Therefore, the improved focus is in the prediction section. 1 H.264 algorithm Fig. 1 is a block diagram showing an algorithm prediction between intra prediction of one frame image. If the intra prediction is used, the inter prediction portion will not be judged. The motion estimate of multi-frame prediction and variable block size is used when performing inter-frame prediction. The encoding mode selection section selects an optimal prediction mode in all prediction modes. After the prediction, the residual data block is obtained by the original input frame and the predicted frame. For the luminance residual block, 4% 26; # 215; 4 integer DCT transformation, the DC coefficient of the chromaticity residual block performs the integer DCT transformation of 2x2. After the change after the changed coefficients, the quantified coefficients are entropy encoded, and finally the resulting stream. The encoding mode passes the mode table, and it is also entered into the entropy encoder. The reconstructed cycle process includes reverse quantization, anti-DCT transform, and reverse block filtering. Finally, write the reconstruction frame into the frame buffer, ready to be used in later motion estimates. Because all computational power is brought to almost all calculations in spatial prediction and time prediction, the algorithm improvement on JM 4.0 is mainly on both parts. In the implementation process, these two parts are implemented by hardware, so optimization for hardware. The hardware system implemented by the encoder is a macroblock, that is, the encoder is operated for one of the continuous macroblocks. The entire coding system can be seen as a macro block of the pipeline, so it is possible to start a macroblock reconstruction process when starting encoding the next macroblock, which affects the pipeline of the pipeline. Many macroblock-based commercial encoders are using this hardware implementation mode, so this problem is critical. 2 intra prediction The encoding block diagram of Figure 1 is similar to the H.261, H.263, and MPEG-4. The H.264 contains 4% 26; # 215; 4 and 16% 26; # 215; 16 two intra prediction portions. The intra prediction requires a pixel value that requires image reconstruction. In a typical macroblock-based system, the reconstructed pixel value can be obtained only after completing the entire encoder. The correlation between this data will bring great difficulties to the implementation of hardware. 2.1 4X4 intra prediction Figure 2 depicts the correlation of data in 4x4 blocks. Pixel values from A to P are predicted from pixel values from A to N and Q. The reconstructed pixel value is represented by uppercase letters. Since a macroblock consists of 16 4x4 blocks, the current block is not required to be reconstructed before the encoding is completed. The encoding of these blocks is implemented in JM with a dual channel algorithm. In order to make a 4X4 block prediction, it is necessary to perform transformation, quantization, and reverse transform to the reverse quantization in JM. This is too complicated for a hardware. It is impossible to implement in the existing hardware level. This requires the following improvements to the algorithm: all the reconstructed frame pixel values in all predictions are replaced with the original value of the input frame. Through such improvements, the intra prediction and transformation of 4% 26; # 215; 4 can be successfully achieved on the pipeline of the macroblock. 2.2 16x16 intra prediction Figure 3 shows the data relevance of 16x16 intra predicted. The prediction of the current macroblock is based on the 17 pixels of the current macroblock and 16 pixels above the current macroblock position. Because the reconstruction of the left macroblock may not completely complete the prediction of the current macroblock, the original pixel is replaced when the pixels on the left side of the current macroblock is used. 2.3 Coding mode selection According to the improved algorithm given above, if only the original pixel replaces the pixel, the error is selected. Figure 4 shows a curve of the intra-inflated rate distortion improvement, and the sequence of simulation is "claire", 10FPS. As can be seen from Figure 4, the PSNR decreases caused by the error selected by the encoding mode is obvious. The raw pixel is an identical frame, and the reconstruction pixel is subjected to a redundant degree through an inter-frame or intra code, so it has a higher correlation with the raw pixel compared to the reconstruction pixel. Therefore, the error generated by the improved intra prediction algorithm is much larger than the original algorithm. In order to reduce the error of the encoding mode selection, it is also necessary to modify the error cost function (ERROR COST FUNCTION). The current practice is to add an error item. This error term reflects the difference between raw pixels and reconstructing pixels. Since the quantization parameter (QP) can affect the mismatch between the raw pixel and the reconstruction pixel, the determination of the error item is related to the quantization parameter value. In H.264, as the linearity of the quantization parameters increases, the influence of quantization on the encoding is an index. In order to meet the growth trend of this impact, the basic form of the error term determines the A / B (51-QP), where A and B are the coefficients of pending coefficients. How to determine that A and B are the key to influence error elimination. In H.264, the increment of QP per stage is 12%, so the parameter b that is theoretically compatible with it should be set to 1.12. However, the calculation of the error cost function is done in the Hada code transform area, which is different for each coefficient of weight. Moreover, the probability distribution of the coefficients after transform is uncertain. Therefore, the setting of parameter B cannot be used, and the empirical value should be considered. Through the experimental simulation results, it can be obtained: for 4% 26; # 215; 4 intra prediction, A is set to 80, and B is set to 1.07. In different sequence tests, this set of parameters is best. From Fig. 4, the improved intra prediction basically eliminates the mode selection error, and the performance of the PSNR is close to the original frame prediction algorithm. 3 motion estimation A variable block size, a motion estimate of 1/4 and a multi-reference frame is used in H.264. During motion estimation, the starting search point of the global search point is determined according to the motion predictor. For whole pixel search, distortion is measured with SAD. If you need a better effect, you can add the SAD to the compensation item. Although global search motion estimates that there are various hardware structural support, from hardware implementation perspectives, the original search range and motion prediction factor in H.264 are not practical. The corresponding improvements are described below. 3.1 Search Hardware implementation of motion estimation, generally stored in the inside of the filled reserved storage bandwidth. A typical search area data repetition method is given in FIG. 5, wherein the search range is -16 ~ + 15. 3% 26 of the left side of Fig. 5; 3 blocks represent the current macroblock motion estimate to perform the area, 3% 26 on the right; # 215; 3 indicates the next macroblock motion estimate, The data of their overlap regions can be reused in two macroblock motion estimates, and the newly added data is the rightmost 1x3 area. In order to match H.264, the starting point of the search area should be set (0, 0). This change will cause a decline in video quality when the real motion vector exceeds the search range. 3.2 Exercise Prediction Factor In H.264, the motion predictor is used to determine the bit number of motion vector data and the compensation factor of calculating the motion vector data encoding error. The compensation factor will be referred to the entire motion estimate to perform rate distortion optimization. Figure 6 shows the associated case of motion prediction factors. Where P1 to P4 are macroblocks before the current macroblock. The motion predictor of the current macroblock is calculated by the motion vector of the P1 to P4 macroblock. However, because in hardware, the above-based macroblock-based processing is used to use macroblocks, the motion vector of P1 may be invalid. Solving this problem requires the cancellation of motion predictor calculations. Specifically, only the motion vector of the P2 to P4 macroblock during the calculation process. The change is only for the calculation of the motion estimation compensation factor, so the improvement algorithm is still in line with the H.264 standard. 3.3 1/4 pixel precision motion estimation In H.264, the half pixel motion estimation is implemented by the insertion wavelength filtering in 2D 6 taps. Two-dimensional filtering requires the use of line caches to implement transpressive operations, while the hardware of the line cache is quite complicated. However, when another part of the encoding loop is compensated, the motion vector of the macroblock has been determined. In order to reduce hardware cost, a simpler method can be used to generate 1/4 pixel precidity data. Although it is not necessary for motion estimation with 1/4 of the yarn data for motion compensation, the error between them will affect the encoding effect. So you can't simplify the interpolation process. This problem can better solve this problem better with two-line interpolation 2D 6 taps. 3.4 Hada Code Transformation Hada code transform is the number of bits generated after the transformation is estimated with a simple transformation. In the H.264 motion estimation, use the Hada code to change the SAD, if the design low cost hardware can be omitted. 4 simulation results Software Simulation is performed on "Foreman", "Grandma", "Salesman", and "carphone" sequence, and the frame rate is 10 frames per second. For hardware considerations, the rate distortion optimization mode is not used because the code rate control is not used in JM 4.0, so the rate of distortion is generated corresponding to the change of QP. The loss curve is shown in Figure 7, Figure 8. As can be seen from the simulation results, in an improved intra prediction algorithm, the drop program of the PSNR is very low. In the entire pixel motion estimate of the slow motion sequence, the PSNR has little decline. The improvement of the QME algorithm caused by about 0.4 to 0.6 dB of PSNR to drop. This improvement is acceptable in a low cost system. In the 64 kbps environment, the drop of PSNR of each sequence does not exceed 0.58 dB. In a macroblock-based system, parallel processing can be implemented in parallel processing using the above improvement algorithm. The results of software simulation indicate that after improving the algorithm in intra prediction and whole pixel motion estimation, the decline in the PSNR value is almost negligible. For the low-person cost system, the improvement of QME and Hadarma transform also can consider the method. Editor in charge: GT, read full text