Design of decaplation effect filter using 5-order line wire

"Introduction Image encoding and decoding technology is the key of multimedia technology. H.264/avc is the most advanced video compression technology in the world. Its main characteristics are small-size integer cosine transform, 1 / 4 pixel motion estimation accuracy, multi reference frame prediction, context based variable length coding and in loop deblocking filter. Because the deblocking filter accounts for about 1 / 3 of the operation of the whole decoder, the design of this part has become the bottleneck of the whole decoder design. Here, a novel design of in loop deblocking filter is studied. In the design, the deblocking effect module of 5-order pipeline is adopted, and the method of mixed filtering sequence and disordered storage update mechanism is used to improve the smoothness of the pipeline × A 16 size macroblock requires only 198 clock cycles. 1 deblocking effect of h.264/avc In the block based video coding method, the encoding and decoding of each block are independent of each other. Discontinuities will occur at the boundary between blocks due to prediction, compensation, change and quantization. Therefore, the new h.264/avc standard adopts in loop deblocking filter to solve each 16 × 16 boundary distortion after macroblock reconstruction. There are two methods of deblocking effect filtering: post-processing deblocking effect filtering; Deblocking effect filtering in the loop. H. 264 / AVC adopts in loop deblocking filtering (see Figure 1), that is, the filtered frame is used as the reference frame for later prediction. Compared with the previous filter of H.263 or MPEG, the filter adopted by the new H.264 standard is based on a smaller 4 × 4. The boundary of the basic macroblock conditionally filters the boundary of the macroblock to be filtered according to the characteristics of the chip level / macroblock level and the gradient of the pixel passing through the filter boundary. Each pixel of the reconstructed frame needs to be readjusted from the external memory for filtering processing or as an adjacent pixel to judge whether the current pixel needs filtering. Obviously, these operations need to consume huge memory bandwidth and modify the pixel value. The deblocking filter module designed in this paper adopts pipeline technology to improve the system throughput. The efficient implementation of the ideal pipeline is based on adjacent filtering operations without data. Literature [3,4] adopts a non pipelined architecture, so it can not improve the throughput of the system. For pipeline architecture, if the filtering order and memory access order are not optimized, the resulting data and structure risk will greatly reduce the efficiency of pipeline. Some people use dual port on-chip SRAM to reduce the bandwidth of off chip memory and increase the throughput of the system, but the dual port memory area is large and increases power consumption. Compared with pipelined filters, the operations of non pipelined filters (including condition judgment, table lookup, pixel calculation, etc.) are sequential, that is, each clock only processes one operation type, so the maximum system frequency it can achieve is much lower. Using different boundary filtering order will greatly affect the performance of deblocking filter. In h.264/avc standard, the filtering order of each macroblock is described. As long as the filtering data dependence is maintained, the filtering order described in h.264/avc standard can be improved. The filtering order includes two types: sequential filtering and hybrid filtering. However, the filtering order and the corresponding storage update mechanism are aimed at the non pipeline structure. Therefore, if it is directly applied to the pipeline design in this paper, it may lead to serious competition and risk, so as to reduce the performance of the pipeline. 2 storage management and filtering algorithm of deblocking filter H. 264 / AVC standard is based on 4 × 4 macroblock is the basic macroblock of filtering. It has five filtering strengths, namely BS = 0, 1, 2, 3 and 4. There are three filtering methods: strong filtering, standard filtering and straight through. The strong filtering affects a total of 6 pixels on both sides of the boundary, and the standard filtering affects a total of 4 pixels on both sides of the boundary. The straight through method does not modify the pixels on both sides of the boundary. H. 264 / AVC standard stipulates that the vertical boundary is filtered first, and then the horizontal boundary is filtered. The next macroblock can be filtered only after all the vertical and horizontal boundaries are filtered. In the same macroblock, the luminance part is filtered first, and then the chrominance part is filtered; When filtering the chromaticity part, filter the C6 part first, and then filter the CR part to filter the whole 16 × The filtering sequence of 16 macroblocks is shown in Fig. 2. (1) Memory for boundary filtering intensity and pixel filtering According to h.264/avc standard, the pixels on both sides of the filtered boundary need to be conditionally filtered. This condition depends on the boundary strength BS and the inclination of the pixel across the boundary. Boundary strength BS: 0, 1, 2, 3 or 4, which is assigned to the corresponding boundary before filtering. BS = 4 indicates strong filtering, and BS = 0 indicates that filtering is not required, i.e. through mode; Otherwise, BS = 1, 2 and 3 represent medium intensity filtering, and the filtering intensity of the boundary of the chromaticity part is the same as that of the corresponding brightness part. Filtering each horizontal or vertical boundary needs to be provided with 8 pixels on both sides of the boundary, P0 ~ P3 & Q0 ~ Q3; There are 6 or 4 pixels to be updated: P0 ~ P2 & Q0 ~ Q2 or sound P0, P1 & Q0, Q1. Yes, a 16 × 16. The left adjacent pixels, the right adjacent pixels and the pixels of this macroblock need to be provided for filtering. For macroblock boundaries, such as the leftmost and rightmost boundaries, P0 ~ P3 and Q0 ~ Q3 come from different modules (i.e. pixels from adjacent macroblocks and pixels of this macroblock respectively); For non 16 × 16 macroblock boundary filtering, pixels P0 ~ P3 and Q0 ~ Q3 are from 16 × 16 macroblock itself, so at least four storage units are required: left adjacent pixel storage unit, upper adjacent pixel storage unit, pixel storage unit of its own module and conversion buffer unit. The bandwidth of each storage unit is 32 bits. When the filter is transformed from the vertical boundary to the horizontal boundary, in order to facilitate the memory access in the filtering process, additional conversion buffers buf0 ~ buf3 are used to cache the intermediate filter data. It only takes one clock cycle to obtain the value of a row or column of pixels (i.e. P0 ~ P3 & Q0 ~ Q3) after using the conversion buffer, otherwise it takes four clock cycles. (2) Filtering algorithm The basic idea of loop filtering is to judge whether the boundary is the real boundary of the image or the block effect boundary formed by coding; The real boundary is not filtered, and the pseudo boundary is filtered according to the gradient and coding mode of the pixel crossing the boundary; According to the filtering intensity, different filtering coefficients are selected to filter the pixels on both sides of the boundary. The boundary with filter strength BS = 0 will not be filtered, and the boundary with filter strength BS not 0 depends on the obtained quantization parameters α And β， The threshold is judged and the adjacent pixels are conditionally filtered. When the filtering intensity BS is not 0 and the following three conditions are true, the adjacent pixels are filtered. Direct calculation α，β It is very difficult and consumes a lot of hardware resources, so it is obtained through look-up table (LUT) α，β Operation of. The calculation of pixels can be divided into the following two types: (1)Bs=4 If the following two conditions are true, a very strong 4-tap or 5-Tap filter will be used to filter adjacent pixels and modify pixels P0, P1 and P2. Otherwise, if one of equation (2) is not true, P1 and P2 will not be filtered, and only P0 will be filtered with weak intensity. For the filtering of chromaticity part boundary, if equation (2) is true, only P0 and Q0 will be filtered. (2)Bs=1～3 The luminance pixels P0 and Q0 are calculated as follows: And D_ 0 is defined in the trimming operation: Where: C0 comes from C1, and C1 is obtained by looking up the two-dimensional LUT table. Pixel P1 is modified only when equation (3) is true, in the same way as P0 and Q0; When the filtering intensity BS of pixels P2 and Q2 is not 4, no filtering is performed. When the chrominance component is filtered, only P0 and Q0 are filtered, and the filtering method is the same as that of brightness filtering. 3 pipelined filtering architecture 3.1 Pipeline Analysis Pipeline technology is suitable for continuous batch tasks. When an n-order pipeline is full, the system can process n tasks in parallel in a cycle, which improves the processing speed of the whole group of tasks and increases the system throughput. If there is no data competition between adjacent filtering operations and all stages are well balanced, the filtering process can be pipelined and the speed can be increased by N times. However, if there are competition and risk-taking problems, it cannot be achieved. The main task at this time is how to balance each stage of the pipeline, how to evenly distribute the total operation to different pipeline stages as much as possible, and how to avoid or eliminate competition and risk, so as to obtain a more balanced and smooth pipeline architecture. According to the implementation algorithm of deblocking filter module, most of the critical paths are located in the following operations. (1) Lookup table operation: get α，β， C1 parameter. α，β All parameters need to be used in the calculation based on quantization parameters and slice level offset parameters before the look-up table operation. When BS = 1, 2 and 3, LUT operation is performed to obtain C1, which is better than acquisition α，β The LUT operation is 3 times larger. (2) When BS = 4, a 4 or 5-Tap filter is required for filtering, and the original P and Q pixel values need to be shifted and added to obtain the final result. 3.2 pipeline architecture Based on the above analysis, a 5th order pipeline is proposed to improve throughput, as shown in Figure 3. Since the whole task is assigned to different stages, the average filtering time is reduced. Tasks in each stage of the 4th order pipeline The tasks of each stage of the first-order pipeline are: obtaining pixels and filtering intensity; Threshold judgment; Pre filtering; Secondary filtering; Write back. Operation type conversion and reconfigurable path design: firstly, the operation type is transformed, and all the original multiplication and division hardware are replaced by addition and shift operation hardware. When BS = 4, the filtering is performed by filters with 3, 4 and 5 taps. Although filters with different tap numbers are applied, hardware multiplexing and input data path reconfiguration are still considered. Because the expression in the design adopts two input addition, the intermediate result of addition can be shared. In addition, the input of adder at different filter tap coefficients is reconfigured to achieve the purpose of sharing resources. Similarly, when BS = 1, 2 and 3, through the reconfiguration of the input path, the adder and subtractor can also be shared to achieve the purpose of sharing resources. See Table 1 for the comparison before and after the use of resources. 5 pipeline competition and hybrid filtering sequence 5.1 reasons for pipeline competition (1) Data contention: when the destination result needs to be used as the source operand; (2) Structure competition: caused by limited memory bandwidth, large and frequent pixel access needs and inefficient memory management; (3) Control competition: the filtering of adjacent boundaries is relatively independent. When a boundary enters its pipeline stage, it cannot stop until the new pixel value in stage 5 is written back to the memory. Control contention, caused by branching statements or delayed waiting. 5.2 a novel hybrid filtering sequence The traditional design uses the basic sequential filtering according to the h.264/avc standard, and does not consider the data reuse and data interdependence of adjacent filtering boundaries and the read and write access delay of memory. Therefore, a novel filtering method is proposed here. The novel filtering sequence still follows the principle of left before right and top before bottom, but considers the data dependence and reusability of adjacent boundaries, solves the problems of data risk and structure risk, and avoids the delay of pipeline. Filtering includes brightness part and chroma part, with a total of 48 boundaries. The filtering order is from small to large as shown in Fig. 4. 5.3 novel storage update strategy Considering that the bandwidth of the external memory is 32 bits, in order to cooperate with the boundary filtering order proposed here and avoid the structural competition caused by the bandwidth limitation of the memory, which leads to the delay of the pipeline, a novel memory update mechanism is proposed here, that is to give different 4 × 4 macroblocks allocate different time slots for pixel write back. The deblocking effect module is allocated to the last step of the whole decoding module, while other reconstruction steps, such as intra filter module and inter filter module, are implemented in 4 × 4 macroblock is the basic unit for pipeline processing, but due to the data dependence between different boundaries in the deblocking effect filtering module, it is based on the whole 16 × 16 macroblocks are filtered in basic units. In addition, only the whole 16 × After the pixel reconstruction of the 16 macroblock is completed, the filtering of the macroblock can be carried out. Therefore, two SRAMs are used, one provides pixels for pixel reconstruction; The other provides pixels for pixel filtering. When a macroblock is processed, the two SRAMs exchange roles, so as to avoid the time and power consumption overhead caused by transferring data between the two SRAMs. The whole deblocking effect top-level module DF is simulated with simulation tools_ Top is simulated, and the simulation results are shown in Figure 5. 6 Conclusion The design is completed using hardware description language and verified on FPGA platform. The design adopts pipeline technology, hybrid filtering method and novel memory update mechanism. The upper limit of real-time filtering frequency is about 200 MHz and the throughput is 16 × 16 macroblocks require 198 clock cycles. The hjtc, CMOS process and DC tool of syn opsys Co. are used for synthesis, timing analysis and power consumption analysis. The conclusion is that the timing meets the convergence requirements, and the energy consumed to complete the filtering of a single macroblock is about 2 μ W. The power consumption has been greatly reduced., Read the full text“