Optimized decoding algorithm for implementing H.264 based on NVDK C6416 processing platform

"Multimedia communication terminal equipment has a wide application prospect and can be used in various fields such as video conference, videophone, PDA, digital TV and so on. Therefore, efficient and practical multimedia terminal equipment has always been one of the main research directions in the communication field. The realization of multimedia communication terminal mainly has two points: on the one hand, it needs a fast and stable processor as the platform of multimedia signal processing, on the other hand, it needs protocol standards and software algorithms suitable for multimedia communication, especially the compression processing algorithm of audio and video signals. The combination of the two can produce efficient multimedia communication equipment. At present, with the rapid development of digital signal processor (DSP), it is possible to realize efficient audio and video signal processing; On the other hand, the introduction of the latest low bit rate video compression standard H.264 provides video standards and algorithm guidance suitable for communication. Therefore, the combination of the two and the implementation of H.264 algorithm on DSP has certain significance and value for the research of multimedia communication. This paper introduces the DSP Implementation of H.264 decoder algorithm. In the design, the network video development platform (NvdK C6416) of ateme company is used as the DSP processing platform to realize the optimized decoding algorithm of H.264. For QCIF video sequences, the decoding speed is up to 50 ~ 60 frames / s. 1 Introduction to network video development platform NvdK NvdK is a Ti c6400 series DSP evaluation and development kit launched by TI's third party ateme company. It is a high-speed DSP development platform suitable for image and video signal processing [1]. The suite provides convenience for advanced video application manufacturers such as video infrastructure and networked video equipment, and improves the development speed of digital video application projects. 1.1 NvdK C6416 architecture NvdK C6416 is composed of TMS320C6416 DSP core, 10 / 100 Mbps Ethernet sub card, audio / video interface box, PCI bus, memory unit, expansion interface and independent power supply. Its functional structure block diagram is shown in Figure 1. 1.2 main features of NvdK C6416 As a network and video development kit, NvdK makes many audio and video interfaces and network interfaces directly on the board, providing a convenient front-end platform for development users who use TI C6000 Series DSP chip as processing unit. It provides a complete DSP development platform for project demonstration, algorithm implementation, prototyping, data simulation, FPGA development and software optimization. Its main features are as follows: ·C6416 DSP Core: 600MHz clock frequency and 8-instruction parallel structure, up to 4800mips processing capacity. ·Video features: at the input end, NvdK can capture PAL system or NTSC system analog video signal, which can be input by composite video (CVBS) or S-Video video signal, and the input analog video signal is digitized into YUV422 digital video format. At the output end, NvdK not only supports composite video (CVBS) and S-Video output, but also provides SVGA output mode, which can directly output signals to the display. In terms of image size, video capture provides three image formats: full, CIF and QCIF, and video output provides two image formats: full and CIF. ·Audio features: provide two channels of dual channel audio output, CD sound quality input and output stereo interface, and one channel of mono microphone input. ·Main interface: PCI interface is provided to allow connection with PC. The board can run in PCI mode or work offline alone. ·Network interface: Ethernet interface brings convenience for network transmission of video code stream. ·External extended memory: 256M 64 bit wide extended memory sdrama, 8m 32-bit wide extended memory sdramb and 4MB flash ROM provide sufficient memory space and flexible memory allocation scheme. 2 H.264 video compression standard H. 264 is the latest international video coding standard jointly proposed by ITU-T video coding expert group (VCEG) and ISO / IEC moving image expert group (MPEG). It is further improved and extended on the basis of H.261 and H.263 video compression standards. Its purpose is to further reduce the coding rate, improve the compression efficiency, and provide a friendly network interface to make the video code stream more suitable for transmission on the network [2]. Because the standard can provide lower bit rate, it is more suitable for multimedia communication. H. 264 mainly has the following new features: ·Network adaptation layer nal (network abstraction layer). The video stream encoded by traditional video coding is a unified stream mode in any application field (whether for storage, transmission, etc.), and the video stream has only video coding layer. H.264 adds different nal headers according to different applications to adapt to different network application environments and reduce the transmission error of code stream. ·Intra prediction coding. Intra prediction coding makes reasonable use of the spatial redundancy of I frame, which greatly reduces the coded code stream of I frame. ·Adaptive block size coding. H. 264 allow 16 × 16、16 × 8、8 × 16、8 × 8、8 × 4、4 × 8、4 × 4 sub block prediction and coding mode, using smaller blocks and adaptive coding, the amount of data of prediction residual is reduced, and the bit rate is further reduced. ·High precision sub-pixel motion estimation. H. 264 clearly proposes the sub-pixel motion estimation method for motion estimation, and formulates the optional motion estimation methods of 1 / 4 pixel and 1 / 8 pixel. Subpixel motion estimation improves the prediction accuracy and reduces the coding rate of residual. ·Multi frame motion compensation. Traditional video compression coding uses one (P frame) or two (B frame) decoded frames as the reference frame for current frame prediction. In H.264, a maximum of 5 reference frames are allowed. Through motion estimation and compensation in more reference frames, the prediction block with smaller residual is found to reduce the coding rate. ·Inter transform coding. H. 264 uses shaping transform instead of DCT transform, and shaping transform uses fixed-point operation instead of floating-point operation. This transformation can not only reduce the encoding and decoding time, but also bring convenience to the implementation of the algorithm on the multimedia processing platform. In this regard, H.264 video coding standard is more suitable as the coding and decoding standard of multimedia terminal. ·Two alternative entropy coding CAVLC and CABAC. CAVLC (context based adaptive variable length coding): Content-based Adaptive Variable length coding. CABAC (context based adaptive binary arithmetic coding): adaptive binary arithmetic coding. In the previous video compression standards, Huffman coding and variable length coding are used for entropy coding. Although Huffman coding is a good entropy coding method, its coding efficiency is not the highest, and the error resistance performance of Huffman coding is very low. H. Two alternative entropy coding methods are used in 264: CAVLC coding has high error resistance, but the coding efficiency is not very high; CABAC coding is an efficient entropy coding method, but the computational complexity is very high. Both have their own advantages and disadvantages, so different coding methods are selected for different applications. 3 DSP implementation and optimization of H.264 decoder algorithm 3.1 implement and optimize H.264 algorithm on PC The core algorithm of H.264 officially provided by ITU-T needs to be improved not only in the code structure, but also in the specific core algorithm to meet the real-time requirements. The specific work to be done in this step includes: removing redundant code, standardizing program structure, adjusting and Redefining Global and local variables, adjusting structure, etc. 3.2 DSP of PC H.264 code The C6000 development tool Code Composer Studio has its own ANSI C compiler and optimizer, and has its own syntax rules and definitions. Therefore, to implement the H.264 algorithm on DSP, it is necessary to change the H.264 code written in C language on PC to make it fully comply with the rules of C in DSP. These changes include: removing all file operations; Operation of removing visual interface; Reasonably arrange the reservation and allocation of memory space; Standardizing data types - because C6416 is a fixed-point DSP chip, it only supports four data types: short (16 bit), int (32 bits), long (40 bits) and double (64 bits), so it is necessary to re standardize the data, approximate the operation part of floating-point number with fixed points, or realize floating-point operation with fixed points; Define near and far range constants and variables according to memory allocation; The commonly used data is extracted from the data structure, and the near data is defined in the internal storage space of DSP to reduce the reading of EMIF port and improve the speed. 3.3 DSP algorithm optimization of H.264 [3] By DSP the H.264 code of PC, the encoding and decoding algorithm of H.264 can be realized on DSP. However, the operation efficiency of the algorithm is very low, because all the codes are written in C language and do not make full use of the various performance of DSP. Therefore, we must combine the characteristics of DSP and further optimize it in order to realize the real-time processing of video image by H.264 video decoder algorithm. The optimization of DSP code is divided into three levels: project level optimization, C program level optimization and assembler level optimization. (1) Project level optimization: mainly by selecting the compilation optimization parameters provided by CCS, optimizing according to the requirements of H.264 system, and continuously selecting, matching and adjusting various parameters (- MW - PM - O3 - MT, etc.), so as to improve the performance of loops and multiple loops and carry out software flow, so as to improve the parallelism of software. (2) C program level optimization: it is mainly to simplify the function of the code, optimize the data structure, optimize the loop and parallelize the code according to the specific characteristics of the adopted DSP. The main work here includes the following parts: the program module that removes SNR calculation, frame rate and other auxiliary information. Adjust the function and data mapping area, store the frequently used data in the on-chip memory, and map the frequently called programs in the adjacent or similar storage area as far as possible. For the parallelization of C function, for the function with poor parallelization effect, especially the multi loop body, it is necessary to disassemble the multi loop into a single loop. Reduce the reading and storage of storage area data, especially the call of off chip storage area data, so as to reduce time. Redefinition and adjustment of data structure. The following describes how to make rational use of DSP characteristics for software optimization with the adjustment of data structure. Data structure refers to the type of data and its allocation in memory space. Different data structures have different effects on the performance of programs. Therefore, the adjustment of data structure is an essential step for the parallel execution of programs on DSP. In the kernel code of H.264 decoder, the array MPR [i] [J] is used to store the prediction coefficients of a macroblock. The data type is int, where I and j are the coordinates of the coefficients. However, the prediction coefficient is actually only 8 bits wide, so it is sufficient to define it as byte. On the one hand, this saves memory space. On the other hand, using byte type can directly use LDW instruction instead of LDB instruction to read 4 data at a time, saving reading time. Because the coefficients in H.264 are read in blocks, and the MPR data structure in the kernel obviously can not make full use of the characteristics of DSP, the data storage structure also needs to be adjusted. Allocating each block in MPR to a continuous memory space is conducive to data transmission, as shown in Figure 2. In this way, after each block is determined, the position of the coefficient can be determined as long as the one-dimensional information is changed, and the original structure has two coefficients for each coefficient. Through such data adjustment, the running speed of the program can be significantly improved. (3) Assembler level optimization. Assembly level optimization includes two parts: using linear assembly language for optimization and directly using assembly language for optimization. Due to the limitations of the system compiler, not all functions can be well optimized, so it is necessary to count the time-consuming C language functions and rewrite them in assembly language. These functions include interpolation function, intra prediction function, shaping inverse transform and so on. The performance improvement brought by assembly is illustrated by a paragraph in the difference function. Horizontal 1 / 2 interpolation source code: for （j = 0; j 《 BLOCK_ SIZE; j++） { for （i = 0; i 《 BLOCK_ SIZE; i++） { for （result = 0， x = -2; x 《 4; x++） result += mref［ref_ frame］［ y_ pos+j］［ x_ pos+i+x］*COEF［x+2］; block［i］［j］ = max（0， min（255，（result+16）/32））; } } This code uses a sixth order filter to interpolate the pixel value of 1 / 2 position, and a total of 16 values (a block) are interpolated. The source code adopts a triple loop. The inner loop is an interpolation filter. If the compiler is used to compile the source code into assembly directly, the inner loop must repeatedly read some memory data. If you write it by yourself, you can improve the algorithm and greatly reduce the running time of the function. As shown in Fig. 3, when interpolating the first half pixel position, the values of 1 ~ 6 pixels should be read in the memory, and when interpolating the second half pixel position, the values of 2 ~ 7 points should be read. In this way, the values of 2 ~ 5 pixels are read repeatedly. Moreover, 6 times of multiplication and 5 times of addition are required for interpolating a point. Written in assembly language, manual