Optimization of H.264 to Block Filter Performance Based on Blackfinbf533 Processors

"Introduction In the existing block based video coding and decoding systems, there is a block effect when the code rate is low, and the same is true in the new video coding standard H.264. There are two main reasons for this block effect: one is that after the block based integer transformation of the transformed residual coefficients, the block edge of the decoded reconstructed image will be discontinuous if the transform coefficients are quantized with a large quantization step; Second, the error caused by interpolation in motion compensation makes the reconstructed image after inverse transformation of codec appear block effect. If not processed, the block effect will accumulate with the reconstructed frame, which will seriously affect the image quality and compression efficiency. In order to solve this problem, the deblocking filtering technology in H.264 uses a more complex adaptive filter to effectively remove this blocking effect. Therefore, how to optimize the deblocking filtering algorithm in real-time video decoding, reduce the computational complexity and improve the reconstructed image quality has become a key problem of H.264 decoding. 1. Deblocking filtering of H.264 1.1 filtering principle Large quantization step size will cause relatively large quantization error, which may turn the gray continuity between pixels at the "border" of adjacent blocks into a "step" change, subjectively there is a "pseudo edge" block effect. The method to remove the block effect is to restore these step gray changes into small or nearly continuous gray changes while keeping the total energy of the image unchanged. At the same time, we must minimize the damage to the edge of the real image. 1.2 adaptive filtering process In H.264, the deblocking filter is based on 16 × The macroblocks of 16 pixels are carried out in unit order, and each 4 in the macroblock × 4. The edges between sub blocks are carried out in the order of vertical first and then horizontal, so as to filter all edges (except image edges) in the whole reconstructed image. The specific edge diagram is shown in Figure 1. For 16 × The 16 pixel luminance macroblock has 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. And corresponds to 8 × The chroma macroblock of 8 pixels has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering. 1.2.1 adaptability of filter at two levels H. The deblocking filter in 264 has better filtering effect because of its adaptability at the following two levels. 1) Filter at 4 × 4 sub block level adaptivity The filtering is based on the pixel edges in each sub block. By defining a parameter BS (edge strength) for each pixel edge, the filtering strength and the involved pixels are adaptively adjusted. The pixel edge intensity of the chroma block is the same as that of the corresponding luminance pixel. Suppose P and Q are two adjacent 4 × 4 sub blocks in which the pixel edge intensity is obtained by the step of Fig. 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge, which is set according to the reason for the block effect. If the block phenomenon of the sub block in the intra prediction mode is obvious, set a larger pixel edge intensity value for the corresponding edge in the sub block for strong filtering. 2) White adaptability of filter at pixel level Good filtering effect can be obtained by correctly distinguishing the false edge caused by quantization error and motion compensation from the real boundary in the image. Generally, the pixel gradient difference on both sides of the real boundary is larger than that on both sides of the false boundary. Therefore, the filter sets the threshold by setting the gradient difference of the gray value of the pixels on both sides of the edge α、 The threshold is set for the gradient difference of gray values of adjacent pixels on the same side β To determine the true and false boundary. α and β The value of is mainly related to the quantization step size. When the equivalence step grows, the quantization error is also large, the block effect is obvious, and it is easy to produce false boundaries. Therefore, the threshold value becomes larger and the filtering conditions are relaxed. On the contrary, the threshold becomes smaller when the quantization step size is small, which reflects the adaptability. The setting of sampling points is shown in Figure 3. If all conditions are met, the filtering is started. In addition to these two adaptations, you can also adjust the filtering strength by setting the coefficients loopfilteralphac0offset and loopfilterbetaoffset at the chip level. For example, when the transmission bit rate is low, the block effect is obvious, and the receiving end wants an image with relatively good subjective quality, the coding end can increase by setting the filter offset loopfil teralphac0offset and loopfilterbetaoffset in the header information as a positive value α and β To strengthen the filtering and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filtering can be weakened by transmitting negative value offset to keep the details of the image as much as possible. 1.2.2 filter the adjacent pixels according to the BS value of each pixel edge If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate clipping operation is carried out to prevent image blur. When BS values are 1, 2 and 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0 and Q1 to obtain new Q0 and P0. If there are false boundaries inside, further adjust the values of Q1 and P1. When the BS value is 4, it corresponds to the macroblock edge using intra coding mode, and strong filtering should be adopted to enhance the image quality. For the luminance component, if the condition (| P0 ~ Q0|《（（ α》 2) + 2) & ABS (p2-p0) is established, select 5-Tap filter to filter P0 and P2, and use strong 4-tap filter to filter P1; If the condition is not true, only the weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For chrominance components, if the above conditions are met, P0 is filtered by 3 taps. If the conditions are not met, all pixel values are not modified. The filtering operations for Q0, Q1 and Q2 are the same as those for P0, P1 and P2. 2. Characteristics and structure of BF533 Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI company. Blackfin series DSP mainly has the following characteristics: a) Highly parallel computing units. The core of Blackfin series DSP architecture is dau (data arithmetic unit), including 2 16 bit MAC (multiplication accumulator), 2 40 bit Alu (arithmetic logic unit), 1 40 bit single barrel shifter and 4 8-bit video ALU. Each Mac can perform 16 bit by 16 bit multiplication on four independent data operands in a single clock cycle. A 40 bit Alu can add two 40 bit numbers or four 16 bit numbers. This architecture can flexibly perform 8-value, 16 bit and 32-bit data operations. b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows the voltage and frequency to be adjusted independently, which minimizes the energy consumption of each task and has a good balance between performance and power consumption. It is suitable for the development of real-time video codec, especially the real-time motion video processing with strict requirements on power consumption. c) High performance address generator. It has 2 DAGs (data address generators) for generating composite loading or storage units that support advanced DSP filtering operations. It supports bit reverse addressing, circular buffering and other addressing modes, which improves the flexibility of programming. d) Hierarchical memory. The hierarchical memory structure shortens the access time of the kernel to the memory to obtain the maximum data throughput, less latency and shortened processing idle time. e) Unique video operation instructions. Provide operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and mixed communication problem between the main processor and an independent video codec. These features help to shorten the time to market for end applications and reduce the overall cost of the system. The ADSP-BF533 we use can work continuously at 600 MHz, with 4 GB of unified addressing space; L1 instruction memory of 80 KB SRAM, of which 16 KB can be configured as 4-way joint cache; Two L1 data memories of 32 KB SRAM, half of which can be configured as cache; Integrate rich peripherals and interfaces. 3. Optimized implementation of H.264 deblocking filter based on BF533 The optimization implementation of deblocking filter in Blackfin BF533 is mainly divided into three levels: system level optimization, algorithm level optimization and assembly level optimization. 3.1 system level optimization Turn on the optimization option of the compiler in the DSP platform and set the optimization speed to the fastest. Turn on the automatic inlining switch and the interprocedural optimization switch to give full play to the hardware performance of the Blackfin BF533 through some of the above settings. 3.2 algorithm level optimization The deblocking filtering part of JM8.6 reference model is appropriately modified and transplanted to the original H.264 basic gear decoder based on Blackfin BF533, and its time-consuming is analyzed through image sequence. Paris.cif, mobile.cif, foreman.cif and claire.cif sequences with a bit rate of about 400 kbit / s are selected. The clock cycle consumed by deblocking filtering is about 1600 MHz ~ 1800 MHz. Even after system optimization, the computational complexity is still quite large and the efficiency is very low, which is a considerable burden for the continuous working frequency of 600 MHz of Blackfin BF533 processor. By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are: a) The function logic relationship in the algorithm is complex, and there are many cases such as judgment, jump and function call; b) The most time-consuming part, that is, there are a lot of repeated calculations in the function loop, resulting in a sharp increase in computational complexity; c) Many data used in the algorithm, such as motion vector, image brightness and chroma data, are stored in the slow off-chip SDRAM, but the frequent calls in the filtering process increase the data handling time sharply. For the reason of time-consuming, the algorithm is improved as follows: 3.2.1 simplify the complex functions and loops in the original program Instruction length and operation speed restrict each other. The code can be highly simplified through conditional judgment, but the speed becomes slower due to the increase of judgment workload of the machine; On the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycle, but the length of the code will increase. The deblocking filter code in JM8.6 is short. Simplify the relationship between functions, and increase the execution speed in exchange for the increase of code length. For the most time-consuming loop body of the system, appropriate rewriting of the loop form and multiple loop body expansion are adopted to effectively reduce the computational complexity. In addition, reducing the number of function calls and rewriting if else statements are also effective optimization methods. 3.2.2 remove a large number of redundant codes and repeated calculations in the reference code a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the code streams of various gears and levels of H.264, while the decoder is based on the basic gears and only involves the filtering operation of I frame and P frame, the relevant filtering parts related to B frame, SP / Si frame, field mode and frame field adaptive mode in the reference code can be removed. b) In the process of obtaining filter strength BS and brightness / chroma filtering, the program should obtain the accessibility information of adjacent macroblocks of the macroblock where the current sub block is located (that is, whether this macroblock can be used is realized by calling getneighbor function). Since the filtering is carried out according to the edges in the macroblock first vertically and then horizontally, the information obtained for an edge is the same, Therefore, this operation can obtain each edge once without repeated judgment within the loop. At the same time, in the filtering algorithm, only the accessibility information of the macro block above and on the left of the current macro block needs to be obtained, and the redundant operation of obtaining the macro block information in the upper left and upper right corners of the current macro block can be removed. At the same time, when the function to obtain the filtering strength in the horizontal direction calls getneighbor, the values of the getneighbor parameters are luma as the fixed value 1, xn as [- 1, 3, 7, 11], yn as [0-15]. At this time, many if else statements in the function getneighbor are invalid judgments, and these redundant judgments occupy a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the greatest probability is executed in front, which also improves the speed of function execution. The following is the simplified getneighbor function code. There are only a few statements, which greatly reduces the amount of computation. c) In the jm86 reference code, 16 for a luminance macroblock × 4. The BS values of 64 pixel edges are obtained one by one. Through the analysis of BS acquisition conditions, it can be seen that the BS values of the four pixel edges located at the vertical edge or horizontal edge between the two sub blocks are equal respectively. Therefore, for one edge, it is only necessary to obtain the BS values of the 1st, 5th, 9th and 13th pixel edges and assign them to the corresponding other pixel edges. Since the operation of obtaining the BS value is in the loop, it needs many judgments and operations. Through this improvement, the computational complexity is greatly simplified. d) There are many statements inside the loop in the reference code that have nothing to do with the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculation. 3.2.3 use BPP block processing technology to solve the problem of frequent call of off chip data In view of the problem that the frequent call of off chip data affects the running speed of the program, BPP blocking technology is used for optimization. Three spaces are opened in the on-chip L1 cache to store the brightness component, chroma u component and chroma V component to be filtered respectively. According to the pixel range that may be involved in filtering each macroblock, when filtering CIF image, 396 macroblocks of a frame are divided into four categories: Class A is the first macroblock, its upper edge and left edge are image edges, and the brightness data read before filtering is 16 × 16. The chroma data is 2 8 × 8； Class B refers to the remaining macroblocks except the first macroblock in the first macroblock row. The upper edge is the image edge, and the brightness data read before filtering is 16 × 20. The chroma data is two 8 × 12； Class C is the remaining macroblocks excluding the first macroblock in the first macroblock column