1 Introduction
In recent years, with the development of network and multimedia technology, the importance of video information communications has grown sharply, and the key is the application of video compression coding technology. A video coding scheme based on TMS320DM642 DSP is proposed to implement the H.264 algorithm. Compared with H.264, MPEG4 has the advantages of low and easier implementation of software and hardware development, and is the mainstream of current video coding applications. This paper proposes a method of implementation based on the MPEG4 video encoder based on TMS320DM642 DSP, which can be used for remote video surveillance, video conferencing and many other areas.
MPEG4 is an international general video compression coding standard developed by the International Motion Images expert group (MPEG), which has developed to adapt to different transmission bandwidth and can use least high-efficiency compression algorithms and tools for optimal quality images. MPEG uses algorithms of DCT, quantization, entropy encoding, by analyzing information on shape, motion, texture, eliminating the correlation of image data in time and space, with unique advantages such as high efficiency compression and universal applicability, for video Information efficient storage, transmission provides convenient.
MPEG4 defines different frameworks and levels of encoder and code streams for different applications, with different frameworks and levels of encoder and code streams, which provides encoding functions for rectangular video objects. The implementation of this article is a simple framework for the MPEG4 video coding algorithm.
2 MPEG4 encoder hardware platform
Implementing the hardware platform of the MPEG4 encoder with TMS320DM642DSP as the core, and cooperates with peripherals such as appropriate external memory SDRAM, FLASH.
2.1 TMS320DM642 features
TMS320DM 642 is a high-performance fixed-point digital signal processor based on a C64x kernel based on Ti for multimedia applications, clock frequencies 600 MHz, and the highest processing power can reach 4 800 MIPs. The DM642 has a public setpoint instruction set of the C6000 Series DSP, which increases the multimedia extension instruction, which can easily quickly perform an algorithm in image processing. These features of DM642 make it very suitable for video image processing, which is an ideal hardware platform for implementing the MPEG4 video encoder.
2.2 Hardware system structure
As the core of the entire system, the DM642 performs high speed processing of video data, completes the MPEG4 encoding algorithm; the programmable video format conversion circuit presets the input raw video data to convert the digital signal in the encoder acceptable video format; E2PROM and Flash is used to curing the application and initialization parameters, SDRAM as a single memory, stored in the encoding process, the above three, the above three passes the EMIF bus and the DM642 connection; through the JTAG interface, the CCS can be easily implemented Hardware simulation and debug; real-time clock provides real-time time base information for digital video.
3 MPEG4 encoder software implementation and optimization
3.1 MPEG4 software implementation
MPEG4 is an open frame standard that does not specify specific algorithms and programs, and users can develop code as needed, we use XVID 1.1.0 open source to implement MPEG4 encoders. The XVID code implements the simple frame algorithm of MPEG4, and does not require shape encoding, only I-VOP and P-VOP are encoded. However, XVID is designed and developed for PC. To transplant him into DSP, the code must be analyzed to analyze the code, combined with the DSP instruction structure and characteristics.
The MPEG4 encoder implemented by the XVID code is used as a video object in each frame in the original video data. First, it is determined whether the I frame is still the P frame, and the I frame needs to encode the entire frame image data, and the P frame is motion estimation and Compensation, only image residuals and motion vectors encoded between the current frame and the reference frame. Each frame data is divided into 16 × 16 macroblock, each macroblock is divided into 8 × 8 sub-blocks, DCT, quantization, and VLC encoding on the basis of macroblock and sub-blocks. Based on non-high image quality requirements, we reduce certain features of XVID, such as GMC (global motion compensation), RVLC, etc., reduce code operation, reduce complexity.
3.2 Code Optimization
In order to improve the code execution efficiency, the code must be optimized for the characteristics of the DSP, and the optimization is mainly divided into 3 levels:
3.2.1 Project Level Optimization
TI provides a powerful integrated development environment CCS that includes a variety of efficient compilation tools. During the code compilation process, the compiler can automatically improve the code automatically by using the compiler with compiler (such as -O3 and -PM, etc.). Structure, reducing the correlation of instructions in code, improves instruction parallelism, improves cycle performance, and optimizes the size of the code.
3.2.2 C language program level optimization
By using the Profile tool in the CCS, the C code is evaluated, find the maximum amount of computation, such as DCT, quantization, motion estimation, etc., this part of this code optimization has a significant impact on the improvement of the encoder performance, we use the following C program level optimization method:
(1) Use the C6000 DSP unique keyword and the inline function to rewrite the C code, such as using keyword restrict to eliminate the correlation between data to improve the code parallel execution capacity, and use the inline function (such as _add2 (), NASSERT ()) can quickly optimize the C code as a special function mapped to the inline C6000 instruction, improve the execution efficiency of the code in the DSP.
(2) Use an integer to access the short data, use 32-bit integer to access 2 16-bit short data, store the high, low 16-bit field of 32-bit registers, reduce the number of internal access, will The efficiency of reading data is doubled, and the inline function that can be operated by the two registers at the same time, such as add2 (); MPY2 (), etc., can greatly increase the code execution efficiency.
(3) Use the circulatory method to change the multi-cycle to less cycle or even single cycles, reduce the cyclic neck, eliminate redundant loops, and improve the degree of instruction parallel execution.
(4) DSP does not have a special hardware division arithmetic unit, and the division is achieved by continuous subtraction, and the amount of operation is relatively large, so it is necessary to minimize the division operation, the division of division is achieved, which can reduce the operation time.
(5) Use the TI image library function. TI provides powerful IM-AGE library support, including many image processing common functions, such as DCT transformations of 8 × 8 sub-blocks (IMG_FDCT_8 × 8), SAD calculations (IMG_SAD_8 × 8), these functions are optimized The code efficiency is high and can be applied directly to the program.
3.2.3 Assembly program level optimization
Linear assembly language is a programming language unique to the C6000 Series DSP, similar to compilation, but does not need to give details of functional units, registers, parallelism, and other detail information used, and the assembly optimizer can be automatically determined according to the code. We have rewritten in linear compilation, such as quantization, DCT, SAD modules, and further optimize cyclic weerators, and improve the parallelity effect of instructions. Table 2 gives the number of clock cycles consumed when rewriting several function module programs for 3 frames of Foreman.QCIF test sequences.
3.3 Configuration of storage space
The DSP is limited, and a large number of video data (including current frames and reference frames) to be processed (including current frames and reference frames) must be placed outside, and the speed of the CPU access to the diameter is much slower than the access sheet. With the DM642 EDMA function, the CPU is encoded by the previous frame data, and the data outside the film is moved to the chip memory through the ED-MA channel, and the two parallel work, and the data is transmitted to the chip, the data is transmitted to the chip. CPU waits can be reduced.
3.4 experiment results
Using the encoder to encode the standard QCIF format (176 × 144) test sequence to test the encoder performance, where the NEWS sequence 300 frame, the Suzie sequence 150 frame, the Foreman sequence 400 frame, the integrated development environment CCS 2.0 for hardware simulation experiments by Ti integrated development environment CCS 2.0 Under the conditions of the set code rate of 100 b / s.
By analyzing the test sequence coding result, the encoder's encoding rate reaches 25 fps, which can meet the requirements of real-time encoding. In the case where the transmission rate is lowered, the coding rate can also be further improved. From the coding result, it can be found that the compression ratio before and after the encoding of different test sequences is different, which is due to the motion of the test sequence image, background transform, such as the suzie sequence background single, high motion, compression is relatively high, and the NEWS sequence is constantly Transform, the compression ratio is relatively low. By comparing the image before and after encoding, the picture quality is not distinguished, the image quality is not significantly reduced.
4 knot
This paper explores the implementation scheme and optimization of the MPEG4 encoder in DM642, and implements a simple framework algorithm for MPEG4 encoding. The experimental results show that the scheme proposed in this paper has high easy-to-realizable and practical, increased and improved code optimization methods, and performance tests have obtained satisfactory results. On this basis, we can further study more in-depth research on the improvement of the MPEG4 advanced framework and code optimization method to meet higher application requirements. Read more
Our other product: