What is the process of the H.264 codec? What is the structure of the H.264 stream?

"This article is Piasy original, published Https://blog.pianess.com, please read the original text support original Https://blog.pianess.com/2017/09/22/i-need-know-about-h264/ I have joined a small company that has a very deep accumulated in the stream of media this year. It is responsible for the development of the video group chat, YOLO is a live APP. I often say this is from technology downstream (SDK users) Run the technology upstream (SDK provider). However, things are of course not so simple. After long-term thinking and discussion, I finally confirmed: real-time multimedia fields, a wider point of view, real-time visual, perceived show, there is a big demand for a long time in the future, too There is a big challenge, so this will be the big direction of my long-term technology accumulation. After the big direction is clear, it needs to be kerydrily. I have always emphasized the importance of basic knowledge. I have taken time to study the foundation of H.264 ("New Generation Video Compression Coding Standard: H.264 / AVC (2nd Edition)"), strive to figure out two questions: What is the process of the H.264 codec? What is the structure of the H.264 stream? I used to share the book after reading the book, I didn't dare to release blogs. This article I strive to describe the above two core problems according to my own understanding, and it will not be unfolded. Interested friends can read the original book, of course, the most authentic information is too h.264 spec. The images used herein are basically taken from "New Generation Video Compression Coding Standard: H.264 / AVC (2nd Edition)". Video codec foundation Why does the video need to be encoded? Because the original video data is too large! A video with a resolution of 640x480 and a frame rate of 30 fps, if directly transmits / stores the original RGB data, the yard ratio will be up to 210.94 Mbps (in the professional domain code rate unit usually use bit instead of byte), 1280x720 30 fps The code rate will be up to 632.81 Mbps. 640 * 480 * 3 * 8 * 30 = 210.94 MBPS (width * high * pixel byte * byte bits * frame number) 1280 * 720 * 3 * 8 * 30 = 632.81 Mbps Such a high code rate cannot be used directly. Even if you replace a more space-saving YUV format, the code rate is still unacceptable whether via network transmission or disk storage, so code compression must be performed. Why can videos can be encoded? Since there is a redundancy in video data. The first is data redundancy, and there is a strong correlation between the frames of the image. For example, a white wall in the picture, the pixel value of each region is very close, such as the daily photographed video, which is basically the same object to move in different locations. Secondly, visual redundancy, depending on some of the characteristics of the human eye, such as the brightness discrimination threshold, the visual threshold, the sensitivity of brightness and chrominance, even if the appropriate amount of errors is introduced, and will not be observed. What is the main technique for video coding? The goal of video coding is to ensure the quality of the video while compressing data as possible. Therefore, the main techniques of video coding are to eliminate redundancy and increase compression ratio. Of course, considering packet-switched network environments, as well as real-time multimedia application scenarios, video coding also considers network adaptive, fault tolerance and other issues. Note: The following paragraphs, involve a lot of key technical nouns, no related friends may have no concept, you can check the Wikipedia entry to understand its specific meaning, keyword: predict code, intra prediction, frame Prediction, motion compensation, motion estimation, motion vector, transformation coding, discrete cosine transformation, quantization parameters, entropy coding, Hafman coding, calculation. Predicting coding and motion compensation: Predicting encoding is intended to eliminate data redundancy, after encoding compression, transmission is not the actual sample value of each pixel point in the image, but the difference between the predicted value and the actual value. The predicting coding is divided into intra prediction and inter-frame prediction, which is used to eliminate intra redundancy and inter-inter-inter-inter-inter-frame redundancy, respectively. In order to improve efficiency and effect, it is predicted that the encoding is completed for the pixel block, not a pixel point. Inframe prediction is to predict the pixel block with a neighboring pixel block, and the inter prediction will prior to the neighboring frame to find the similar block of the pixel block, obtain the two spatial position offset, and then predict. We call the process of finding the offset (ie, the similar block). Motion estimation, the offset is called a motion vector, and we call this description of the frame difference to move compensation. Note: The "forecast" mentioned here, it is actually the meaning of "reference", which is to find the reference object and calculate the difference. Transform coding and quantization: Most images have a common feature, a flat region, and a slow change area occupying most of an image, while the detail area and content mutation areas account for a small part, that is, the DC and low frequency zone in the image Most, high frequency zones are small. Therefore, the image is converted from the time-space domain to the frequency domain, more for compression. The process of this transformation is called transform coding, and the transformation method is most commonly used by discrete cosine transformation (DCT), DCRETE COSINE TRANSFORM, DCT. After the transformation encoding, the transform coefficient is mapped to a smaller value, which is called quantization. Entropy encoding: Coded rate compressed encoding is called entropy encoding, also called statistics. The high-frequency symbol gives a short code, and the low frequency symbol gives a long code to reduce the number of overall bits. The video coding common entropy encoding has variable length coding (VLC, also called Hafman) and the Binary Arithmetic Coding, BAC). The encoding framework such as coding, transform coding, entropy encoding, in fact, the end of the 1970s has been identified until the H.266 norms that have not yet been released today are still in use, these four decades are basically in the old bottles of new wine State, of course, the details are still optimized. H.264 code stream structure We first understand the Code stream structure of H.264, and the cause of this design, after the code stream structure, the process of codec has a specific relying on the process. In fact, the H.264 specification is also specified in the code stream structure, and then specifies the structure of the decoder (there is no specific provision for the structure of the encoder and the implementation mode), it is the same reason. Syntactic element layering In the code stream of the encoder output, the basic unit of the data is the syntax element (which can be understood as the code stream structure), the syntax representation of the organizational structure, semantics explains the specific meaning of the syntax element, All video coding criteria specify the encoder workflow by defining syntax and semantics. In H.264, syntax elements are organized into sequences, images, sheets, macroblocks (MB), five levels of sub-blocks, as shown below: The layering facilitates the saving stream, such as the shared information in the next layer can be saved on the previous layer, rather than each underlying structure carrying one. However, in the hierarchical structure of H.264, each layer of data organization does not form a strong dependency, which helps improve robustness. Because the group exchange is easy to make an error, if there is a strong dependency, once the head is lost, the data behind it cannot be used. Compared with the previous standard, H.264 canceled the sequence layer and the image layer (conceptually, but in fact, the most syntax element originally subjected to the sequence and the image header, forming a sequence parameter set (Sequence Parameter Set, SPS) and the image parameter set (PIPS), the rest of the syntax elements are placed in a layer. The parameter set is an independent data unit, which does not rely on other syntax elements outside the parameter collected, which can be transmitted separately, focus protection. The hierarchical structure of the sequence layer and the image layer and the layers are removed as shown below: From the above figure, we can see that a image consists of multiple sheets, and the tablet data references PPS, and the PPS will reference the SPS, while PPS and SPS can be transferred separately, focus protection. What is the organizational structure of the three-layer data in the macroblock and sub-block? Please see the picture below: Skip_run: When the image is predicted encoded, H.264 allows the "Jumping" block in the image, "Jumping" block itself does not carry any data, and the decoder is recovered through the data of the reconstructed macroblock. " Jumping "block; MB_TYPE is macroblock type, such as macroblock of I frame, macroblock of P frame (Note: About frame type, you can search for etiomei, keyword: I frame, P frame, B frame, SP frame , Si frame); MB_PRED and SUB_MB_PRED is the prediction information of the predicted encoding process, such as how the macroblock is divided, reference macroblock ID, etc., the residual data is predicted, the prediction block and this block data Difference; The macroblock is the basic unit of decoded, and the decoder decodes according to the predicted information and residual data. Functional layering In addition to the layers of the syntax element, H.264 is also divided into two layers: Video Coding Layer, VCL and Network Abstract Layer, NAL. The VCL data is the output of the encoded processing, which is divided into five-layer structure in the above. The VCL data is first encapsulated in the NAL unit prior to transmission or storage, and each NAL unit is divided into a raw byte sequence payload (RBSP) and the header describing RBSP (ie VCL data). In grouping switched network transmission, the NAL unit is independent, and the NAL unit does not need a separator between the NAL unit, but when the disk is stored, the NAL unit is continuously stored, and the start code must be introduced to divide the NAL unit. This start code is a continuous three-byte data 0x000001, if the data needs to be added, then add several bytes of 0 to populate it before the start code. To prevent coding data and start code conflicts, it is defined as follows "to prevent competition" (actually escape) rules (00 is ended as the NAL unit, 01 is started as NAL unit, 03 is used for escape 02 has not been used yet): If the encoder is encoded, if these escape sequences are detected, insert 0x03 before the last byte, if 0x000003 is detected when the decoder decodes, the last 0x03 is discarded. After having the above escape rule, the decoder can use the data before the 0x000001 to 0x000000 as a NAL data unit. The structure of the NAL unit is shown below: The NAL type is defined as follows: From Nal_Unit_Type, it is understood that the basic unit of the encoded data transmission is a film, and the film contains a macroblock and a sub-macroblock. In fact, intra prediction is also limited to the on-chip, and the different sheets are not reference, which is mainly to limit the impact of errors when the error occurs. Here we can summarize the following: the basic unit of the H.264 code stream is the NAL unit. The most critical data carried within the NAL unit is the parameter set and the film data; the decoded basic unit is a macroblock, the decoder is based on predictive information and disabled Differential data, decoding the original data; macroblock decoding, splicing into a film, slice, and a film form a video! Here I also want to mention it, that is, the stitching relationship between this layer of layer is, how is it sewing? If we make our design, we may add play order for images, and the image can be played in order; we may add numbers for the film, and then each piece can be spliced as an image; we may increase the macroblock Number, such a macroblock can be spliced into a piece according to the number. In fact, the H.264 program and our simple ideas have a few: each image except for the Picture Order Count, POC, there is a decoding order (frame_num), because the inter-frame prediction has two-way prediction, so the decoding order may And the order of playback is different; each macroblock is not numbered, because all macro blocks of a piece are in a NAL unit, they are arranged on demand, no additional numbers; each film is not numbered, but the film is indicated in the film The first macroblock in the overall image (first_mb_in_slice) so that we know what this film should be placed in the image, the effect and number; until the entire video, the overall information of each image, such as a wide high Information, the relevant fields are described in the SPS and PPS. Specific syntax and semantics Originally I want to simply have a simple element of each layer (actually doing this), but I can't cover the details, and many syntax elements are unable to get out of the stacked suspicion, so I delete it. Interested friends strongly recommend reading the original book, or H.264 SPEC, as for friends engaged in video codec related work, must be familiar to the chest. I feel that the way H.264 syntax described is still very clever, it defines the data format in the form of decoder pseudo code, it is really two. The syntax of H.264 has been carefully designed, and various syntact elements that constitute syntactics are both interdependent and independent of each other. Dependent is to reduce redundant information, improve coding efficiency, and independent is to make communication more robust, restrict the spread of errors when the error occurs. H.264 encoding process The H.264 specification does not specify the structure and implementation mode of the encoder. As long as it produces the code stream structure that is in line with the specification, this encoding process is very flexible. However, its basic structure is the basic framework we mentioned in the first part: predicting coding, transform coding, entropy encoding. The basic structure of the encoder is shown below: The most complex expansion spaceThe biggest, the predicting the process of the encoding, and the most important manner in the predictive coding is also the most consumable calculation resource, which is the search process of the motion estimation. In addition, regardless of the structure of the encoder, the control of the corresponding video encoding is the core problem of the encoder implementation. During the encoding process, it is not possible to directly control the size of the encoded data size, only by adjusting the quantization parameter QP value of the quantization process, and since the QP and the encoded data size are not determined, the code rate control of the encoder cannot be It is very fine and basically tries. Either a midway change the quality of the subsequent macro blocks, or re-coding changes the quality of all macroblocks. H.264 decoding process The decoding process is the encoded reverse process: entropy decoding, transform decoding, predicting decoding. The H.264 specification specifies the structure of the code, so we can summarize the decoding process with a macroblock, in turn, in turn, in turn, to obtain residual data, and then combine the prediction of the macroblock Information, find the decoded reference block, in connection with the decoded reference block and this block residual data to obtain the actual data of this block. After the macroblock decoding, the film is combined and the film is combined. The basic structure of the decoder is shown below: H.264 scalable encoding Scalable Video Coding, SVC) is essentially decomposed according to importance, and encodes the various parts of the decomposition according to its own statistical characteristics. Generally, it will encode the video into a basic layer and a set of enhancements. The basic layer contains basic information, can decode independently, the enhancement layer depends on the basic layer, and the information of the basic layer can be enhanced, the more the enhancement layer, the higher the recovery quality of video information. SVC usually has three: Airspace scalable: Videos can be decoded; time domain scalable: the video, the resolution can be decoded; the quality retractable: the video, resolution of a variety of code rates can be decoded. The same frame rate; SVC's realization details are not expanded, interested in friends can check the relevant information. Summarize In this article, I tried to answer two questions for H.264 codec: H.264 How is the process of codec? What is the structure of the H.264 stream? Limited to the space, this article cannot describe the concepts involved, and there is no relevant basic readers need to consult many professional materials, and the relevant basic readers don't necessarily need such a summary article. Therefore, this article is more in the meaning of my own ideas. Large, please understand. Finally, under the wave of the AI, video codec is definitely combined with AI. In the process of video codec, I think at least the following link AI can play a big role: During the motion estimation process, the selection of the search strategy should be a link that AI can function; adaptive block, AI can analyze image pretreatment, analyze image detail distribution; encoding control: Scenario, content, select encoding strategy, Ai can also play a large value; ← Previous POST POST POST → Please enable JavaScript to view the & lt; a href = & quot; http://disqus.com/? Ref_noscript & quot; & gt; comments Powered by Disqus. & Lt; / a & gt;