Key Technologies of New Generation Video Coding Standards H, 264 / AVC

"Keywords: AVC, encoding, video In December 2001, ITU-T and ISO / IEC established JVT (joint video team), which is committed to formulating the next generation video coding standard with H.26L as the platform. In May 2003, the new h.264/avc standard was officially launched. The official name of the standard is h.264/mpeg-4part10avc. H. The main goal of 264 / AVC standard is to design simple and effective coding technology, with efficient compression performance and easy network transmission ability, so as to meet the growing occasions of "dialogue" (mobile video phone, conference), "non dialogue" (video storage, broadcasting and streaming media), digital cinema, video monitoring and other video applications. 1. Basic coding structure of h.264/avc Like the early video coding standards, h.264/avc standard does not clearly define a pair of complete codecs, but defines the syntax of the encoded code stream and the method of decoding the code stream. H. 264 / AVC adopts a hybrid coding framework of motion estimation / compensation + block DCT transform similar to previous standards. H. 264 / AVC adopts the idea of "returning to basics" to develop high-performance video coding standards, that is, it adopts the existing basic algorithms and structures, and carefully optimizes the calculation process and methods to achieve better video coding performance. Compared with the existing H.261 and H.263 standards, H.264 keeps the system structure of the encoder unchanged, mainly including four steps: (1) A frame of image is divided into small blocks (macro. Block and block). Each small block contains many pixels. The coding of the whole image is divided into many small blocks. (2) Through the transformation, quantization and entropy coding (or variable length coding) of image blocks, the spatial redundancy in the image is eliminated. (3) Due to the great similarity of adjacent frame images (i.e. time redundancy), it is only necessary to encode and transmit the changes between adjacent frame images, which is realized by motion search and motion compensation. For each coding block, a motion vector is found by searching the corresponding position of the previous coding frame (or previous frames). This vector will be transmitted together with the inter frame difference for the encoding and decoding of this image block. (4) Residual coding: transform, quantize and entropy code the difference between the original block and the corresponding prediction block to remove the remaining spatial redundancy of the current frame. However, compared with the previous coding algorithm H.263, H.264 adds some new features to improve the coding efficiency. These features are as follows: (1) For the image encoded within the frame, instead of directly transforming, quantizing and encoding the original image, a variety of different prediction methods are used to predict the image, and then the difference is processed as described above to obtain better coding efficiency. (2) In the aspect of motion search and motion compensation, H.264 adopts from 4x4 to 16 × 16, a total of 13 search blocks are used for motion search to improve the matching degree. 1 / 4 pixel accuracy is used for search to improve the search accuracy. In addition, according to different requirements for coding delay, H.264 can also search the motion of previous encoded frames to achieve the best effect. (3) In the aspect of transform coding, H.264 adopts 4 × The integer transform (ICT) of 4 replaces DCT transform. The effect of integer transform is close to DCT, but the amount of computation is less, and no error will be introduced due to the problem of calculation accuracy in the process of inverse transform. (4) In the process of entropy coding, H.264 uses single variable length coding (UVLC) and content-based context variable length coding (CAVLC). 2 layered processing of coding structure H. The coding structure of 264 is conceptually divided into two layers. The video coding layer (VCL) is responsible for efficient video compression; Network adaptation layer (NAL) is responsible for network adaptation, that is, it should have different adaptability to different networks, such as packaging and transmitting data in an appropriate way. H. The hierarchical structure of 264 encoder is shown in Figure 1. A packet based interface is defined between VCL and nal. Packaging and corresponding signaling are part of nal. In this way, the tasks of efficient coding and network adaptability are completed by VCL and nal respectively. VCL includes block based motion compensated hybrid coding and some new features. Nal is responsible for encapsulating the data according to the characteristics of the lower layer network, including framing, sending signals to the logical channel, using synchronization information, etc. Nal obtains data from VCL, including header information, segment structure information and actual payload information (if data segmentation technology is adopted, the payload data may be composed of several parts). The task of nal is to correctly map them to the transport protocol. Following nal are various specific protocols, such as H.323, H.324, etc. The introduction of nal layer greatly improves the ability of H.264 to adapt to complex channels. Nal in JVT standard defines the interface between video codec itself and external. Its basic unit is nalus (nerworkab stractionlayer units). This provides a good support for the realization of many current network packet transmission methods. A Nalu consists of a one byte header and a variable length bit string containing a specific type of syntactic elements. A Nalu can contain slice coding information, random access points, parameter set information or supplementary enhancement information. The structure of Nalu head is as follows: The Nalu type (T) is a 5-bit sub segment that indicates which of 32 different types the Nalu unit is. Types 1 to 12 have been defined by H.264, and types 24 to 31 can be used by standards other than H.264. The RTP load specification will use some of these values to represent packet aggregation and packet segmentation. Other values are reserved for future use. nal_ reference_ IDC (R) is used to mark the importance of the Nalu in the reconstruction process. 0 means that the Nalu will not be used as a reference frame, so it allows the decoder or gateway to discard it without causing error propagation. The higher the value, the more important the data in the Nalu is. This allows network nodes to effectively protect important data according to this value. forbidden_ zero_ Bit (f) is set to 0 in the encoding process. When the network node identifies the bit error in the Nalu, the position can be set to 1. Due to different network environments, the decoder may take different operations on nalus containing bit errors, and some may simply discard them. For biddenzerobit facilitates this operation. Some networks using packet transmission can directly use nalus as the load of h.223al3sdus or RTP packets. However, in front of some stream oriented systems, such as ITU-T video conference recommendation H.320 and MPEG-2 transport stream in digital TV, the format of bit or byte stream is required. Therefore, the JVT standard defines a transformation from nalus to bitstream format, that is, nalus is encapsulated with a starting codeword, which is very consistent with the traditional video coding standard. The word length of the starting code can be 16b or 24b, which depends on the importance of the Nalu load. The start codeword only appears at the cross node alignment position, so the decoder can scan the start codeword and extract nalus with a simple U1 to byte memory copy operation. In order to prevent the competition of initial codewords in byte stream format, many video coding standards carefully adopt entropy coding. Since the JVT standard contains two different entropy coding modes, this initial codeword rarely competes. JVT relies on a byte filling mechanism, that is, it avoids contention by inserting non-zero bytes at the position where initial codeword contention may occur in Nalu. In order to facilitate gateway design, byte filling is still performed in some seemingly unnecessary environments, especially in packet transmission networks. Since the vcl-nal interface is only conceptual, it is customary to perform it as part of VCL entropy coding in order to prevent initial codeword competition. H. The H.264 video stream is transmitted on the IP network with frequent bit error and packet loss, which enhances the robustness of H.264 video stream. In order to reduce transmission errors, time synchronization in H.264 video stream can be completed by using intra image refresh. Spatial synchronization is supported by slice structure coding. At the same time, in order to facilitate resynchronization after bit error, a certain resynchronization point is also provided in one frame of video data. In addition, intra macroblock refresh and multi reference frame mode enable the encoder to consider not only the coding efficiency but also the characteristics of the transmission channel when determining the macroblock mode. H. 264 also defines a data segmentation mode: the image is segmented first. The macroblock data in the segment is divided into three parts: macroblock header information, motion vector and DCT coefficient, and the three parts are separated by identifiers. In this way, the decoder can easily detect the type of damaged data and reduce the damage to image quality caused by bit error. This data segmentation mode is also conducive to unequal protection during channel coding, that is, high-level protection for important data. Fast rate control can be achieved by changing the quantization accuracy at the macroblock layer. Performance analysis of 3nh.264 The coding performance of H.264 is tested by experiments, and the coding efficiency of H.264 is observed by comparing with H.263. Comparison of coding performance between 3.1nh.264 and H.263 In this test, the Grenadier Guards sequence is tested, and the fidelity, PSNR and macroblock coding bits of H.264 and H.263 are compared respectively. The results are as follows: (1) Fidelity test It can be clearly seen from the comparison of residuals that the residuals of H.264 reconstructed frame and reference frame are relatively smooth and basically free of spots; The residual error of H.263 is obvious, especially near the person. Due to the large amount of motion, H.263 uses half pixel motion vector estimation, while H.264 is improved to 1 / 4 pixel. On the basis of 1 / 4 pixel, the motion vector with 1 / 8 pixel accuracy is obtained by interpolation, which greatly improves the quality of image coding, as shown in Fig. 2. (2) PSNR test (as shown in Figure 2) Compared with H.263 video coding standard, H.264 has improved its method of enhancing prediction coding content, such as adaptive selection of field and frame coding; Motion compensation of variable size block; High precision motion compensation; Multi reference frame motion compensation; Weighted prediction; Integer transformation; Adaptive entropy coding; Loop deblocking filtering, which greatly improves the PSNR of H.264. As can be seen from Fig. 2, the PSNR of H.264 is higher than that of H.263 in both luminance signal and color difference signal. (3) Macroblock coded bits The following is a more intuitive comparison between H.264 and H.263. As shown in Figure 3, the color bar changes from blue to red, indicating the gradual increase of the number of bits. The comparison results are shown in Figure 4 and figure 5. The macroblock of the third frame image in the grerladieguards sequence is 4 × 8 coding, the number of bits used in each macroblock can be clearly seen. Through comparison, it is found that the number of bits used for macroblock coding in H.264 is 50% less than that in H.263. Especially near moving objects, the effect is more obvious. H.264 uses a lot of red color blocks, while H.264 is more blue color blocks. Basically static background pattern, there are many differences between the two. It can be seen that many of H.264 are dark blue macroblocks, and the number of bits used is about 10 bits, while H.263 is biased towards green, and the number of bits is about 20 bits. It is also found that the coding efficiency of H.264 is much higher than that of H.263. 3.2 H.264 coding performance 3.2.1 multi reference frame prediction mode For many types of video sequences, the multi reference frame prediction mode can effectively improve the coding performance. It allows one of several reference frames to be selected at the macroblock level by adding a time domain part to the motion vector. Since a reference frame buffer area needs to be maintained, the demand for memory in the codec is increased. In addition, the introduction of additional reference frames also expands the search area, which significantly improves the computational complexity of the encoder in the process of motion estimation. In this experiment, foreman video sequence is encoded by UVLC entropy, with 1 / 4 pixel motion vector accuracy and 16 pixels search range. Fig. 6 shows the effect of using different reference frames m on the peak signal-to-noise ratio of luminance component. Experiments show that the use of multiple reference frames can save an average bit rate of 10%. Similarly, the use of multiple reference frames is also related to the specific sequence content. The sequence with high bit rate will greatly improve the PSNR of the image. 3.2.2 two way prediction mode H. Video coding standards before H.264 generally adopt multi hypothesis prediction mode, while H.264 uses bidirectional prediction mode, which is a linear combination of forward / backward prediction frames. Both forward and backward prediction can contain multiple reference frames. At the same time, it is divided into independent estimation and joint estimation of bidirectional prediction signals. Among them, joint estimation can greatly improve the efficiency of coding. In this experiment, foreman video sequence is encoded by UVLC entropy, with 1 / 4 pixel motion vector accuracy and 16 pixels search range. Fig. 7 shows the effect of using independent estimation and joint estimation on the peak signal-to-noise ratio of luminance components. Fig. 7 shows the relationship between the frame bit rate and the peak signal-to-noise ratio of the luminance component when reconstructing the B frame. If five forward prediction frames and three backward prediction frames are selected, it can be seen from the figure that the performance of the joint estimation is higher than that of the independent estimation. The linear bi-directional prediction model not only uses the components to suppress noise, but also provides the function of eliminating wave peaks. Assuming that an object in the current frame will appear in the subsequent frame but not in the previous frame, increasing the forward reference frame can not improve the coding efficiency, but increasing the backward reference frame can greatly improve the coding efficiency. 3.2.3 entropy coding H. 264 has two different entropy coding modes: universal variable length coding (UVLC) and context based adaptive binary arithmetic coding (CABAC). UVLC only uses a variable length code to encode all binary syntax elements, while CABAC adopts context mode and adaptive algorithm based on conditional probability and symbol statistics. UVLC algorithm is simple and can achieve good compression efficiency at low computational cost. CABAC has high computational complexity, but it can greatly save bit rate. In this experiment, foreman video sequence uses 1 / 4 pixel motion vector accuracy, and the search range is 16 pixels. Fig. 8 shows the effect of using UVLC and CABAC on the peak signal-to-noise ratio of luminance component. The test shows that C