Implementation and optimization of H.264 video encoder in DSP

"Abstract: the H.264 video encoder is implemented on the DM642 EVM platform, and the encoder is optimized from the aspects of memory allocation, cache optimization, code optimization and assembler level optimization. The experimental results show that the optimized encoder can maintain high image quality and compression efficiency, and has good real-time performance. 1 Introduction H. 264 / AVC is a new generation of video coding standard jointly proposed by ITU-T video coding expert group and ISO / IEC moving picture expert group. Under the same conditions, compared with MPEG-1, MPEG-2, H.263, MPEG-4 and other standards, h.264/avc can reduce the code stream by 50%. However, the high coding efficiency of H.264 comes at the cost of high computation and high complexity. In this paper, DM642 EVM with high operation speed and strong data processing ability is used as the DSP hardware platform for the implementation and optimization of H.264 video encoder, the H.264 video coding algorithm is realized, and the algorithm program is comprehensively optimized. The experimental results show that the optimized H.264 video encoder can maintain high image quality and compression efficiency, and has good real-time performance. 2 H.264 video coding technology and DM642 EVM development platform 2.1 H264 video coding technology H. The 264 compression algorithm adopts the block based hybrid coding method similar to H.263 and MPEG-4, and adopts two coding modes: intra and Inter. In order to improve coding efficiency, compression ratio and image quality, H.264 adopts many new coding technologies, mainly including: (1) the H.264 standard compression system is composed of video coding layer (VCL) and network abstraction layer (NAL). ⑵ H.264 adopts intra prediction to minimize the spatial redundancy of the image. (3) the inter prediction of H.264 adopts new methods such as multi frame reference frames (the number of references is 1 ~ 5), high-precision interpolation algorithm (including 1 / 4 and 1 / 8 precision), and a variety of deformation search blocks, which greatly improves the efficiency of motion estimation and compensation. (4) sub pixel motion estimation with 1 / 4 and 1 / 8 pixel accuracy: 1 / 4 pixel accuracy prediction method is used for QCIF video format and 1 / 8 pixel accuracy prediction method is used for CIF video format. (5) for the 4x4 integer DCT transformation technology of residual image, there is no matching error in the inverse transformation process. (6) new loop filtering technology and entropy coding technology. 2.2 EVM development platform DM642 EVM is a development platform for multimedia applications launched by Ti. The on-board resources include: DM642 CPU chip, 4m & Themes; 64bit synchronous dynamic memory (SDRAM), 4m & times; The structure of 8-bit flash memory, one-way video coding and two-way video decoding is shown in Figure 1. DM642 is based on C64x core, with a main frequency of up to 600MHz and adopts ultra long instruction word (VLIW) structure. Each instruction cycle can process 8 32-bit instructions in parallel, with a processing capacity of 4800mips; The on-chip memory adopts a two-level cache structure. L1 is composed of 16kb data cache L1D and 16kb program cache l1p. 256Kb L2 can be configured as SRAM or cache, which greatly improves the running performance of the program; The on-chip 64 bit EMIF (external memory interface) interface can be seamlessly connected with SDRAM, flash and other memory devices, which greatly facilitates the movement of a large amount of data; DM642 includes three dedicated video ports (VP0 ~ VP2) for receiving and processing video data, which improves the performance of the whole system; The EMAC port of DM642 and the ATA port extended from the EMIF port also provide a storage channel for the massive data generated after processing. Therefore, to realize the video algorithm of H.264, the high-performance DM642 EVM is an ideal hardware platform. 3 implementation and optimization of H.264 video encoder 3.1 implementation of encoder H. There are many ways to implement 264 video encoder, but most of them are transplanted and optimized. H. 264 code needs to pay attention to several problems in order to run under the CCS environment of DSP software platform: such as the change of configuration file and library file, the adjustment of data type, the processing of assembler, the adjustment of memory termination mode, etc. H. 264 coding adopts the hybrid coding method of transformation and prediction, and its principle is shown in Fig. 2. The input frame or field FN is processed by the encoder in macroblocks, that is, the image is divided into sub image blocks, and the sub image blocks are used as the coding unit. When intra prediction coding is used, the prediction value p is obtained from the encoded reference image in the current slice after motion compensation (MC), in which the reference image is represented by f1n-1; In order to improve the prediction accuracy and thus the compression ratio, the actual reference image can be selected from the frames that have been encoded, decoded, reconstructed and filtered in the past or in the future. After subtracting the predicted value p from the current block, a residual block DN is generated. After block transformation and quantization, a group of quantized change coefficients X are generated, and then entropy coding is used to form a compressed code stream together with some edge information required for decoding (such as prediction mode quantization parameters, motion vector, etc.), which is used for transmission and storage through nal. 3.2 memory allocation and cache optimization Compared with PC, the program data storage space of DSP is very limited. Therefore, for video coding, which needs to process a large amount of data, the storage mode of data and program must be reasonably arranged to optimize the memory. Experiments show that the efficiency of the system can reach 80% ~ 90% of that of all the internal memory with high working frequency by making rational use of the two-level cache and cooperating with the external memory with low working frequency. In this paper, data occupying a large space or programs with low frequency are placed in off chip memory, L2 cache is enabled, cache-setl2mode function in CSL of C6000 chip support library is called, and L2 is set to 198kb SRAM and 64KB cache mode. According to the structure of H.264 algorithm itself, the following methods are adopted to optimize the memory: analyze the C code by using the analysis tool profile of CCS, put the repeatedly called program segments (such as DCT transform and IDCT transform) in the on-chip program storage area, and put the frequently used data segments (such as coding table) in the on-chip data memory, Put the programs and data segments used less times in the off chip memory to avoid unnecessary repeated movement of programs or data. During the operation of H.264 encoder, due to the large amount of data of a frame image, the reference frame and current frame data are put into the off-chip memory. When they need to be used, they are moved from the external memory to the on-chip memory to improve the operation efficiency of the program. 3.3 code optimization For code optimization, we should first find out the bottleneck of the program, that is, the code that occupies more CPU time, and then optimize it. Using the code analysis tool profile provided by CCS can statistically display the running time of each important segment and function in the program, find out the program segments with large amount of computation, and optimize these program segments, which has a great impact on improving the performance of the algorithm. (1) jointly use - PM and - 03 compilation options to optimize the code at the project level: CCS provides powerful compilation options, including four levels of optimization from - o0 to - O3- The O3 compilation option enables software pipelining and other optimization methods, and the - PM option combines all source programs of the whole project as a module from the perspective of program code- The combination of PM and - 03 enables a series of optimizations, and the code size becomes much smaller. (2) modify pointers with const and restrict Keywords: const indicates that the content pointed to by the modified pointer of the compiler cannot be modified; Restrict instructs the compiler that the modified pointer and the contents pointed to by other pointers will not be overwritten. These information will prevent the two pointers from accessing the same memory address and eliminate the correlation between memories. In this way, multiple data reads and operations can be performed in parallel to maximize the efficiency of code operation. (3) wide length memory access (data packaging processing) is used for short word length data: that is, when the CPU performs a series of short data operations (such as 16bit data), the data type can be set to int type with 32bit length, so that two short data can be accessed at one time, and then the C6000 instruction is used to operate two data at the same time, reducing the access to memory, This saves half the time compared with the 16bit length short type. (4) loop expansion: open the loop in C language, change multiple loops into fewer loops, reduce loop nesting, and increase possible parallel instructions, so as to improve software pipelining and code performance. (5) reduce the call of C function, and try to use the inline function (intrinsics function) provided by the system to replace c function. C6000 compiler provides many intrinsics, which are online functions directly mapped with C6000 assembly instructions, which can quickly optimize C code, so as to reduce many unnecessary operations and improve the code operation speed. (6) software pipelining technology is used. Software pipelining is a technology for scheduling and optimizing instructions in the loop. Using software pipelining can generate very compact loop code. When the - O2 or - O3 level optimization option is adopted during compilation, the compiler will software pipelining the loop in the program. Through the optimization of software pipelining, the efficiency of cyclic code can be greatly improved and the parallelism of instructions can be greatly realized. 3.4 assembler level optimization Find out the inefficient parts through the profile clock tool, and continue to optimize using linear assembly. Linear assembly language is a unique programming language of C6000 Series DSP, which is between high-level language and low-level language. Different from the standard assembly language, it is unnecessary to consider the delay of instructions, the parallelism of instructions, the use of registers and the allocation of functional units when writing linear assembly programs. The assembly optimizer will automatically determine these information according to the situation of the code. Assembler optimization can be carried out by modifying the assembly file generated by automatic compilation. In fact, assembly optimization is to adopt targeted methods according to the characteristics of the above aspects in order to obtain the highest program efficiency as possible. Common assembly pseudo instructions are as follows: (1) define a pseudo instruction of a linear assembly code segment that can be optimized by the assembly optimizer and called as a function by C / C + +: label .cproc [ var1,[var2,…] ] . endproc (2) define a pseudo instruction of a linear assembly code segment that can be optimized by the assembly optimizer: label .proc [ reg1,[reg2,…] ] . endproc [ reg1,[reg2,…] ] The following aspects need to be considered in linear assembly optimization: ① evenly allocate and use functional units to improve the efficiency of code Line degree. ② Minimize the number of clock cycles of the pipeline core cycle. 4 experimental results After the optimization of the above algorithms, the performance of H.264 coding algorithm based on DSP hardware platform DM642 EVM has been greatly improved. In the experiment, three H.264 standard test sequences, foreman, container and news, representing high, medium and low motion formats respectively, are used to test the optimized algorithm by using IPP coding mode. Table 3 shows the test results after coding various standard test sequences before and after optimization. The algorithm optimization greatly improves the coding speed on the premise of ensuring the image quality, and the video image better meets the real-time coding requirements. This paper focuses on the implementation and optimization of H.264 video coding algorithm on DM642 EVM hardware platform. After optimization, the algorithm has good implementation and real-time performance. On this basis, we can also optimize the code structure to make it more suitable for the instruction system of DSP. In addition, the structure of TMS320DM642 chip and rich external interfaces can be used more reasonably to realize the codec algorithm more efficiently. For more encoder knowledge, please visit http://www.elecfans.com/zhuanti/20111111242149.html , technology zone "N +" VR / Ar / MR technology International Summit Forum_ VR / Ar / MR ecological chain Chip apple a10x chip how to take you to understand its true face The new generation of powervr GPU is compared with the previous generation of GPU Powervr 2nx NNA for the most efficient solution Trusted execution environment (TEE) Workshop_ Provide security for digital services and devices“