Implementation of H.264 video encoder based on ADSP-BF561

Implementation of H.264 video encoder based on ADSP-BF561 H. 264 / AVC is ITU-T The latest international video coding standard jointly formulated by VCEG and ISO / iecmpeg is one of the hot technologies in the field of image communication. H. The video coding layer (VCL) of 264 adopts many new technologies, which greatly improves its coding performance. However, this is at the cost of doubling the complexity, which also makes H.264 face great challenges in real-time video coding and transmission applications. Therefore, in order to meet the real-time requirements of image compression, the existing H.264 codec needs to be optimized. This paper mainly discusses the hardware platform and task flow of H.264 system, and introduces the specific method of optimizing the algorithm from the code level to further improve the operation speed of coding algorithm and realize H.264 real-time coding according to the characteristics of DSP hardware platform. Due to ADI Blackfin561 is a high-performance digital signal processor launched by ad company. It has a main frequency of 600MHz. Therefore, this paper selects it as the hardware platform to explore an effective way to implement H.264 encoder on the DSP platform with limited resources. one hardware platform one point one ADSP-BF561 processor Blackfin561 is a high-performance fixed-point DSP video processing chip in Blackfin series. Its main frequency is up to 750MHz, and its core includes two 16 bit multiplier Mac, two 40 bit accumulator ALU, four 8-bit video ALU, and a 40 bit shifter. The two sets of data address generators (DAGs) in the chip can provide addresses for accessing double operands from the memory at the same time, and can process 1200m times of multiplication and addition operations per second. The chip has special video signal processing instructions and 100kb on-chip L1 memory (16kb instruction cache, 16 KB instruction SRAM, 64 KB data cache / SRAM, 4KB temporary data SRAM), 128KB on-chip L2 memory SRAM, and has dynamic power management function. In addition, the Blackfin Processor also includes rich peripheral interfaces, including ebiu interfaces (4 128MB) SDRAM interface, 4 1MB asynchronous memory interfaces), 3 timing / counters, 1 UART, 1 SPI interface, 2 synchronous serial interfaces and 1 parallel peripheral interface (supporting itu-656 data format), etc. The structure of Blackfin Processor fully reflects the support for media applications (especially video applications) algorithms. one point two Video encoder platform based on ADSP-BF561 The hardware structure of blackfin561 video encoder is shown in Figure 1. The hardware platform adopts adsp-bf561ez-kit of ADI company Lite evaluation board. The evaluation board includes one ADSP-BF561 processor and 32 processors MB SDRAM and 4 MB The ad-v1836 audio codec in the flash board can be externally connected with 4 input / 6 output audio interfaces, while the adv7183 video decoder and adv7171 video encoder can be externally connected with 3 input / 3 output video interfaces. In addition, the evaluation board also includes a UART interface, a USB debugging interface and a JTAG debugging interface. In Figure 1, the analog video signal input by the camera is converted into a digital signal through the video chip ADV7183A. This signal enters the blackfin561 chip from ppi1 (parallel external interface) of blackfin561 for compression, and the compressed code stream is converted by adv7179 and output from ppi2 port of ADSP-BF561. The system can load programs through flash and support serial port and network transmission. The original image, reference frame and other data in the encoding process can be stored in SDRAM. two H. Main characteristics of 264 video compression coding algorithm Video codec standard mainly includes two series: one is MPEG series and the other is H.26X series. Among them, MPEG series standards are formulated by ISO / IEC (International Organization for Standardization), and H.26X series standards are formulated by ITU-T (International Telecommunication Union). I-tu-t standards include H.261, h.262, H.263, H.264, etc., which are mainly used for real-time video communication, such as video conference. H. 264 video compression algorithm adopts a block based hybrid coding method similar to H.263 and MPEG-4. It adopts two coding modes: intra coding (intra) and inter coding (inter). Compared with previous coding standards, in order to improve coding efficiency, compression ratio and image quality, H.264 adopts the following new coding technologies: ( 1) H. 264 divides the video coding system into video coding layer (VCL, video) by function Coding Layer) and network abstraction layer (nal, network AbstracTIon Layer). Among them, VCL is used to complete the efficient compression of video sequences, and nal is used to standardize the format of video data, mainly providing header information for the transmission and storage of various media. ( 2) Advanced intra prediction, which adopts 4 for macroblocks with more spatial detail information × 4 prediction, and 16 for flat areas × The former has 9 prediction methods and the latter has 4 prediction methods. ( 3) More types of block division are used in inter frame prediction. Seven macroblock divisions with different sizes and shapes are defined in the standard (16) × 16、16 × 8、8 × 16) And sub macroblock segmentation (8) × 8、8 × 4、4 × 8、4 × 4)。 Due to the use of smaller blocks and adaptive coding, the amount of data of prediction residuals can be reduced, so as to further reduce the bit rate. ( 4) It can carry out high-precision motion prediction based on 1 / 4 pixel accuracy. ( 5) Multi reference frame prediction can be performed. During inter frame coding, up to 5 different reference frames can be selected. ( 6) Integer transform (DCT / IDCT). 4 for residual image × 4 integer transformation technology, using fixed-point operation to replace the floating-point operation in the previous DCT transformation. To reduce the coding time, but also more suitable for transplantation to the hardware platform. ( 7) H.264/avc supports two entropy coding methods, namely CAVLC (context based adaptive variable length coding) and CABAC (context based adaptive arithmetic coding). CAVLC has higher anti error ability, but its coding efficiency is lower than CABAC; CABAC has high coding efficiency, but needs more computation and storage capacity. ( 8) New loop filtering technology and entropy coding technology are adopted. H. These new technologies of H.264 make the moving image compression technology a big step forward. It has better compression performance than MPEG-4 and H.263. It can be applied to the field of high-performance video compression such as Internet, digital video, DVD and TV broadcasting. three H. Implementation of 264 video coding algorithm There are three steps to improve H.264 on DSP: C algorithm optimization on PC, program transplantation from PC to DSP, and code optimization on DSP platform. three point one C algorithm optimization on PC According to the system requirements, this design selects the jm8.5 version baseline of ITU Profile as standard algorithm software. ITU's reference software JM is designed based on PC, so it can achieve high coding effect. When transplanting video codec software to DSP, DSP system resources should be considered. The main factor to be considered is system space (including program space and data space). Therefore, it is necessary to evaluate the original C code, which requires an understanding of the transplanted code. Fig. 2 shows the algorithm structure of H.264. After understanding the algorithm structure, it is also necessary to determine the part with large amount of computation and time-consuming in the implementation of the coding algorithm. The profile analysis tool of VC6 shows that the intra and inter coding parts occupy more than 60% of the overall running time. Where me (move) Estimation (motion estimation) takes up more time. Therefore, the focus of transplantation and optimization should be on motion estimation. Therefore, the code structure should be adjusted. ( 1) Significantly cut unnecessary files and functions Due to the selection of baseline and single reference frame, many files and functions can be deleted, including redundant program codes that do not support features such as B frame, Si chip, SP chip and data segmentation, layered coding, weight prediction mode and CABAC coding mode, as well as RTP. C, SEI. C and leaky_ Bucket. C, in trafresh. C files, related header files, and global variables and functions defined in the global. H header file. In addition, you can delete top_ pic、bottom_ Pic and other field related global and local variables, layered coding, multi slice segmentation and FMO, prediction related to field coding / frame field adaptive coding / macroblock adaptive coding, reference frame sorting, input and output, decoder cache operation, etc; Relevant redundant codes such as random intra macroblock refresh mode and weight prediction mode can also be deleted (for example, the encoder adopts nal code stream instead of RTP format), and RTP. C can be deleted at the same time; SEI. C contains some auxiliary coding information (not included in the code stream). If it is not used, leaky can also be deleted_ Bucket. C is used to calculate the parameters of the leak buffer. ( 2) Override of configuration function Since JM's system parameter configuration is realized by reading the encoder.cfg file, the parameter configuration can be changed from reading the file to initializing the centralized assignment function, which not only reduces the amount of code, but also reduces the occupation of limited memory space and reading time, and improves the overall coding speed of the encoder. For example, the variable input - > img defined as int_ Height can be directly rewritten as input - > img_ Height = 288 (CIF format). ( 3) Remove redundant print information For the convenience of debugging and algorithm improvement, JM retains a large amount of printing information. In order to improve coding speed and reduce storage space consumption, these information can be deleted, such as a large number of trace information and coded data statistics files. If lor.dat and stat.dat only need to be used for debugging on PC, and it is not necessary to transplant them to DSP platform, the code related to this part can be completely removed. However, the basic information required for debugging (such as code rate, signal-to-noise ratio, coding sequence, etc.) shall be retained for reference. Through adjustment, the structure and capacity of the code can be more streamlined, so as to prepare for the next transplantation on DSP. 3.2 program transplantation from PC to DSP To transplant the simplified program on the PC side to the development environment of ADSP-BF561, visual In order to make it run preliminarily under DSP, the main problems to be considered are syntax rules and memory allocation. ( 1) Remove all functions that are not supported by the compilation environment It is mainly to remove some time-related functions, modify file operation to read file data cache, delete SNR information collection and other unnecessary codes implemented by DSP platform. Also note that the function declaration and the type of data structure should conform to the C language format of DSP. ( 2) Add hardware related code The code includes system initialization, output module code, interrupt service program and code rate control program. ( 3) Configure LDF file Because the newly transplanted code often has very large data and programs, there must be no place in SRAM. At this time, there will be a problem with the link. At the beginning, it's best to put all the programs and data in SDRAM so that there will be no problem with the link. Stack and heap are similar. They are put into SDRAM first. Generally, at the beginning, what is often needed is a program that can run correctly, and the speed comes second. ( 4) Malloc problem solving Malloc is a problem to be solved in the development of DSP. If you apply for memory dynamically, even if it can run, the result is often wrong. Therefore, it is best to perform static allocation, which can be allocated in the form of array. After transplantation, the h-server based on ADSP-BF561 processor can be realized_ 264 coding. At this time, if the speed can not meet the requirements of real-time coding, it can be further optimized. four Code optimization on DSP platform In visual The main methods of code optimization in DSP development environment are C language level optimization and assembly level optimization. four point one C language level optimization Through the profile analysis tool of VC6, it is found that the emphasis of migration and optimization should be in the motion estimation part. After comparing various algorithms, the author chooses the diamond (DS) search method. DS algorithm can adopt two search templates, namely, large template ld-sp (large) with 9 search points Diamond Search Pattern) and small template SDSP (small) with 5 search points Diamond Search Pattern)。 The diamond search diagram is shown in Figure 3. When searching, first use the large template to calculate. When the smallest error sad point appears at the center point, then replace the large template LDSP with SDSP for matching operation. At this time, if the one with the smallest sad among the five points is the center point, the point is the best matching point, and then end the search. Otherwise, continue to use this point as the search center for SPSS search. Verified by JM