Implement a 4x4 point change transformation of H.26L on the TM1300

Key words: 4x4, TM1300, integer transformation H.26L is the next generation of video coding standards. Its coding beyond all existing standards, including H.263 + and MPEG-4 (SP). This article analyzes the various new coding characteristics introduced by H.26L, focusing on 4% 26; # 215; 4 points transformation, and proposes a fast conversion algorithm implemented on TM1300. H.26L is the next generation of video coding standards. Initially, H.26L was developed by the VCEG team of ITU-T. In November 2001, MPEG and VCEG jointly established a JVT team to participate in the formulation of H.26L. It is also because of the addition of MPEG, H.26L will be included in the tenth part of MPEG-4. Since the H.26L standard is still developing, this paper temporarily tested TML8 provided by JVT. Be The basic encoding frame of the H.26L source encoding is similar to the current popular video coding standard, using mixed encoding techniques that combine transform coding and predicted encoding. Its excellent performance is mainly from the introduced new coded characteristics: 4% 26; # 215; 4 points transformation, use UVLC for entropy encoding, 1/4 ~ 1/8 pixel range run vector, there are multiple block sizes Motion estimation, etc. These new coding techniques increase compression performance and fault tolerance from different sides. Especially 4% 26; # 215; 4-point integer transformation, is unique in all video compression protocols. Be Although the H.26L standard is still developing, in a preliminary test, its encoding performance exceeds existing standards, including H.263 + and MPEG-4 (Simple Profile). These test results show that the H.26L ratio H.263 + can save 20% to 50% of the code rate than the MPEG-4 (SP), which saves up to 50% of the code rate than MPEG-4 (SP). As the next-generation video coding standard, H.26L shows its huge development prospects. Be 1 H.26L 4x4 point harmonization 1.1 Transformation Introduction In H.26L encoding technology, 4x4 points integer transformation can be seen as an integer version of DCT transformation, which mainly completes the spatial correlation of the image, with 4% 26; # 215; 4 points DCT transformation has the same properties. First consider one-dimensional integer transformation: set A, B, C, D is the point of 4 to be transformed, A, B, C, D is the corresponding four transform coefficients, and the following formula can be used to represent A, B, C. , D point positive transformation: A = 13A + 13B + 13C + 13D B = 17a + 7b-7c-17d C = 13A-13B-13C + 13D D = 7A-17B + 17C-7D The reverse transform formula is as follows: a "= 13A + 17B + 13C + 7D B "= 13A + 7B-13C-17D C "= 13A-7B-13C + 17D D "= 13A-17B + 13C-7D The relationship between A and A is A "= 676A. That is, after the reverse transformation, normalization operation is also required to make the positive conversion and the transformation scale. Be The same two-dimensional 4x4 integer transformation is separable. The separated transformation will calculate complexity from O (N4) to O (N3). Be 1.2 comparison with 8x8 point DCT transformation Compared to traditional DCT transformations, H.26L uses 4x4 points to transform to video coding: 1 Help to reduce the block spots and cycphin, improve image quality. Since the transform coefficients are quantified, there is a losing high frequency coefficient loss, so there is a block class and ring class in the recovered image. In H.26L, a smaller 4x4-point transform can be effectively suppressed to suppress the block spots and cycphus. Be 2 Integer transformation reduces the accumulation error. Traditional accumulation errors come from two aspects: the errors and quantization caused by positive transform and reverse transition do not match. In order to achieve the purpose of compression, the second mistake is inevitable. However, since H.26L uses an accurate integer transformation, positive transform and reverse transform do not generate errors, which effectively reduces the accumulation error. Be 3 Fast operations. Because the transform formula used by H.26L is a simple integer equation, that is, the calculation is based on integers, rather than floating point numbers, so it reduces the amount of single transformation, which is also advantageous to adopt a fixed point DSP implementation. Be 2 Implementation in TM1300 The TM1300 is a 32-bit ultra-high performance multimedia processor. Its core processor uses the VLIW long instruction word structure, and 5 operations can be performed simultaneously within each clock cycle; support highly parallel custom operations, which can greatly speed up special operations in digital signal processing and multimedia applications. Performance, and custom operation is similar to the C language function call, which is convenient for the program. Be In this paper, the characteristics of the 4x4 point integer transformation and the characteristics of the TM1300 custom arithmetic instructions, the integer transformations are changed to the following adjustments: first do line transformations, then do column transformation. Since the results of the row transform do not exceed 16 bits of representations, the data is reconfigured before the column transformation, and then the re-transformation is based on the following two points. Be First, since the video input data is an unsigned byte type, the TM1300 is a 32-bit processor, and the word is accessed in units, and the efficiency of access can be improved. The current 4x4 data block (pointer is P1) and reference frame 4% 26; # 215; 4 data tissue (pointer is P2) data organization is as follows. The point to be transformed is the difference between the value of the current data block and the value corresponding to the reference frame data block. Be P1: CAL, CB1, CC1, CD1 P2: RA1, RB1, RC1, RD1 CA2, CB2, CC2, CD2 RA2, RB2, RC2, RD2 CA3, CB3, CC3, CD3 RA3, RB3, RC3, RD3 CA4, CB4, CC4, CD4 RA4, RB4, RC4, RD4 Second, the 8-bit multiply / accumulated custom operation can be utilized, one operation can complete 4 8-bit multiply / accumulation, and a machine cycle (CLK) can perform 5 operations. Reduced the number of computational operations compared to non-customized multiplication / accumulation, improved the parallelism of the program operation. Figure 1 is a schematic diagram of the IFIR8UI custom operation function. Be 3 experimental results The rapid algorithm based on TM1300-based 4X4 integer transformations, using parallel, has greatly reduced computational amount. Experiments have shown that a 4x4 point thread transform is performed, and the multiplication and addition operation requires 80 machine cycles, and the improved algorithm only needs 28 machine cycles; and the TM1300 is used to perform 1 8x8 point DCT transformation require 180 machine cycles. It is also significantly greater than 4 4X4 points integer transformation time. The transformation coding calculation complexity of H.264 in the conversion is smaller than other encoding methods.