Improvement of H.264 video coding algorithm for hardware implementation

"Abstract: This paper analyzes the H.264 algorithm from the perspective of hardware implementation, focuses on the optimization of the prediction part that occupies the most operation time, gives the improvement of intra prediction, Hadamard transform and motion estimation algorithms, and optimizes the hardware by simplifying the modules with complex operation and low efficiency and reducing the data correlation between modules. Through the simulation of various test sequences, it is proved that the improvement is effective. H. 264 [1] was originally drafted by ITU-T and will become a joint standard of ITU-T and MPEG in the future. H. 264 will become the next generation video coding standard because it provides high coding and compression efficiency and friendly network-oriented interface. However, while the coding efficiency is very high, the complexity of its algorithm is also increased by four times, which limits its implementation in a large program. Therefore, the implementation of hardware must be improved and optimized. H. The original test model (JM) [2] of 264 is designed to achieve high coding effect. In this test model, many algorithms need a lot of computation, but the improvement of coding efficiency is not much, and many simulations are data related, which limits the use of parallel processing to accelerate the implementation of hardware. Previous articles have analyzed the complexity of this new video coding [3 ~ 5]. However, these studies get the complexity of H.264 algorithm through software analysis. These results are accurate for the application in software, but they are no longer applicable when it comes to parallel processing of hardware design. Through experimental comparison, it can be concluded that the key point in the hardware implementation of H.264 is the prediction part, because the calculation time of this module is almost 90% of the total time. Therefore, the focus of improvement is on the prediction part. 1 H.264 algorithm Fig. 1 is a block diagram of an algorithm for intra prediction and inter prediction of one frame image. If intra prediction is adopted, the inter prediction part will not judge. In inter prediction, multi frame prediction and variable block size motion estimation are used. The coding mode selection section selects the best prediction mode among all prediction modes. After prediction, the original input frame is subtracted from the prediction frame to obtain the residual data block. Do 4 for the luminance residual block × 4 integer DCT transform, 2 for the DC coefficient of chroma residual block × 2 integer DCT transform. After the transformed coefficients are scanned and quantized, the quantized coefficients are entropy coded and finally become the output code stream. The encoding mode is also input into the entropy encoder through the mode table. The cyclic process of reconstruction includes inverse quantization, inverse DCT transform and inverse block filtering. Finally, the reconstructed frame is written into the frame buffer for future motion estimation. Because almost all the computing power is spent on spatial prediction and time prediction, the algorithm improvement on JM 4.0 is mainly in these two parts. In the implementation process, these two parts are realized by hardware, so we should optimize the hardware. The hardware system used to realize the encoder is based on macroblocks, that is to say, the encoder operates on successive macroblocks. The whole coding system can be regarded as a pipeline of macroblocks, so it is possible that the reconstruction process of the previous macroblock is not completed irregularly when coding the next macroblock, which affects the progress of the pipeline. Many commercial coders based on macroblock use this hardware implementation mode, so it is very important to deal with this problem. 2 intra prediction The coding block diagram in Fig. 1 is similar to that in H.261, H.263 and MPEG-4. H. 264 contains 4 × 4 and 16 × 16 two intra prediction parts. Intra prediction requires the pixel value of image reconstruction. In a typical macroblock based system, the reconstructed pixel values can be obtained only after the whole coding program is completed. This correlation between data will bring great difficulties to the implementation of hardware. 2.1 4 × 4 intra prediction Figure 2 depicts 4 × Correlation of data in 4-block intra prediction. The pixel values from a to P are predicted from the pixel values from a to N and Q. The reconstructed pixel values are represented in uppercase letters. Because a macroblock consists of 16 4 × 4, so the reconstructed pixel value cannot be obtained before the current block is encoded. In JM, a dual channel algorithm is used to encode these blocks. To make a 4 × The prediction of 4 blocks requires the process of transformation, quantization, inverse transformation to inverse quantization in JM. This is too complicated for a hardware. It is impossible to achieve at the existing hardware level. For this, the algorithm needs to be improved as follows: the pixel values of all reconstructed frames in all predictions are replaced by the original values of the input frames. Through such improvement, 4 × The intra prediction and transformation of 4 can be successfully implemented on the pipeline of macroblocks. 2.2 16 × 16 intra prediction Figure 3 shows 16 × 16 intra prediction data correlation. The prediction of the current macroblock is based on 17 pixels above the current macroblock position and 16 pixels to the left in the reconstructed frame. Because the reconstruction of the left macroblock may not be completely completed when predicting the current macroblock, when those pixels to the left of the current macroblock position are used, the original pixels are used instead. 2.3 coding mode selection According to the improved algorithm given above, if only the original pixel is simply used to replace the reconstructed pixel, it will cause the error of coding mode selection. Figure 4 shows the rate distortion improvement curve of intra coding. The simulated sequence is "Claire" and 10fps. As can be seen from Fig. 4, the PSNR decrease caused by the error of coding mode selection is obvious. The original pixels belong to the same frame, and the reconstructed pixels remove the redundancy through inter frame or intra frame coding, so the original pixels have higher correlation than the reconstructed pixels. Therefore, the error of the improved intra prediction algorithm is much larger than that of the original algorithm. In order to reduce the error of coding mode selection, the error cost function needs to be modified. The current approach is to add an error term. This error term reflects the difference between the original pixel and the reconstructed pixel. Because the quantization parameter (QP) can affect the mismatch between the original pixel and the reconstructed pixel, the determination of the error term is related to the quantization parameter value. In H.264, with the linear increase of quantization parameters, the influence of quantization on coding increases exponentially. In order to meet the growth trend of this influence, the basic form of the error term determines a / b (51 QP), where a and B are undetermined coefficients. How to determine a and B is the key to error elimination. In H.264, the increment of QP per stage is 12%, so theoretically, the matching parameter B should be set to 1.12. However, the calculation of error cost function is carried out in hada code transform domain, and the weighting coefficients of each coefficient are different. Moreover, the probability distribution of the transformed coefficients is uncertain. Therefore, the setting of parameter B can not be determined by theoretical value, but by empirical value. Through the experimental simulation results, it can be concluded that for 4 × 4 intra prediction, a is set to 80 and B is set to 1.07. In the test of different sequences, the effect of this set of parameter values is the best. From Fig. 4, the improved intra prediction basically eliminates the mode selection error, and its PSNR performance is close to the original intra prediction algorithm. 3 motion estimation Variable block size, 1 / 4 image yarn and multi reference frame motion estimation are used in H.264. In the process of motion estimation, the starting search point of global search is determined according to the motion predictor. For integer pixel search, distortion is measured by sad. If you want better results, you can add sad to the compensation item. Although global search motion estimation is supported by various hardware structures, from the perspective of hardware implementation, the selection of original search range and motion predictor in H.264 is not practical. The corresponding improvements are described below. 3.1 search scope In the process of realizing motion estimation in hardware, we usually use on-chip memory to make up for the lack of off chip memory bandwidth. A typical search area data reuse method is shown in Fig. 5, in which the search range is - 16 ~ + 15. 3 on the left in Figure 5 × 3 blocks represent the current macroblock motion estimation area, and 3 on the right × 3 represents the next macroblock motion estimation region. The data of their overlapping region can be reused in two macroblock motion estimation, and the newly added data is 1 on the far right × 3 area. In order to match the H.264 data reuse mode, the starting point of the search area should be set at (0, 0). Only when the real motion vector exceeds the search range, this change will cause the decline of video quality. 3.2 motion predictors In H.264, the motion prediction factor is used to determine the number of bits of motion vector data and calculate the compensation factor for the coding error of motion vector data. The compensation factor is referenced in the whole motion estimation process for rate distortion optimization. Fig. 6 shows the correlation of motion predictors. Where P1 to P4 are macroblocks before the current macroblock. The motion prediction factor of the current macroblock is obtained by calculating the motion vector of P1 to P4 macroblocks. However, in hardware, when the above macroblock based processing process uses macroblock pipelining, the motion vector of P1 may be invalid. To solve this problem, it is necessary to eliminate the correlation in the calculation of motion predictors. Specifically, only the motion vectors of P2 to P4 macroblocks are used in the calculation process. The change is only for the calculation of motion estimation compensation factor, so the improved algorithm still meets the H.264 standard. 3.3 motion estimation with 1 / 4 pixel accuracy In H.264, half pixel motion estimation is realized by two-dimensional 6-tap interpolation filtering. Two dimensional filtering needs to use line buffer to realize transpose operation, and the hardware implementation of line buffer is very complex. However, when another part of the coding loop is motion compensated, the motion vector of the macroblock has been determined. In order to reduce the hardware cost, a simpler method can be used to generate data with 1 / 4 pixel accuracy. Although the quarter image data used for motion estimation and motion compensation are not necessarily the same, the error between them will still affect the coding effect. Therefore, the interpolation process cannot be simplified blindly. Using bilinear interpolation instead of two-dimensional 6-tap interpolation filtering can better solve this problem. 3.4 hada code transformation Hada code transformation uses a simple transformation to estimate the number of bits generated after transformation. In H.264 Motion estimation, hada code transform is used to replace sad. If it is required to design low-cost hardware, this part can be omitted. 4 simulation results The software simulation is carried out on "foreman", "grandma", "salesman" and "Carphone" sequences, and the frame rate is 10 frames per second. In consideration of hardware, the rate distortion optimization mode is not adopted, because the code rate control is not adopted on jm4.0, so the rate distortion curve is generated corresponding to the change of QP. The rate distortion curve is shown in Fig. 7 and Fig. 8. It can be seen from the simulation results that the descent procedure of PSNR is very low in the improved intra prediction algorithm. In the whole pixel motion estimation of slow motion sequence, PSNR hardly decreases. The improvement of qme algorithm will reduce the PSNR value of about 0.4 ~ 0.6dB. This improvement is acceptable in low-cost systems. At 64 Kbps, the PSNR of each sequence decreases by no more than 0.58 dB. In the system based on macroblock processing, parallel processing can be realized by using the above improved algorithm. The results of software simulation show that the decrease of PSNR value can be almost ignored after improving the algorithm of intra prediction and whole pixel motion estimation. For low human cost systems, the improvement of qme and Hadamard transform can also be considered., Technology Zone Tech supports Amazon (AWS) media services to provide quality assurance for end-to-end video IMEC is about to show its first short wave infrared (SWIR) band hyperspectral imaging camera 4K Ultra HD home theater projector brings HD experience to participate in the grand event Design of video display system based on Unified Computing Architecture Technology Apple TV 4K disassembly report: familiar modular components“