introduction
Currently in video surveillance, video conferencing, network streaming media, digital video encoding becomes the core, the most basic technical means, especially video surveillance, is now one of the most ordinary security equipment. Digital DVR based on computer hard drives for the storage has become increasingly replacing the analog DVR. The most critical technology of digital DVR is video compression technology, while video compression technology has two major choices. The first is the algorithm standard for video compression. The current video algorithm has the international standard of MPEG2, MPEG4, H.264, H.264 with its high compression efficiency, and excellent image quality under low yard ratio. Preferred in the current video surveillance system. Compressed mode.
However, all things have their own two-sided, H.264 encoded high efficiency, and the quality image is exchanged with the complexity of the algorithm. The complexity of the H.264 encoder is 4-5 times MPEG2. The second choice is to use what chip is implemented, TI's TMS320DM642 chip is a high-speed DSP specially used as media processing, and its powerful image processing capability provides possible in real time in the monitoring system. In order to reduce costs, it is necessary to make full use of the DM642 itself, so that a DM642 can handle more videos, which is the purpose of high efficiency optimization. This article introduces the hardware platform of the entire video surveillance, then combined with DM642 Structural features, the arrangement of the framework of the entire encoding software is proposed to propose DSP-based optimization methods for the maximum motion estimation of the occupied system resources. Finally, with an integer DCT as an example, the skills to prepare assembly code are discussed.
Introduction to the hardware platform
The framework of the entire video surveillance hardware system is shown in Figure 1. In order to adapt to the needs of digital media processing, the DM642 chip adds three configurable video ports (VP0, VP1, and VP2), which provides seamless interfaces for commonly used codec devices. Thus do not require additional programmable logic devices and FIFOs to meet the requirements of system design.
In order to save costs, improve the utilization of the DSP chip, can handle multiple audio and video at a single board, and the data throughput between the compressed card and the host will be large. In order to ensure the real-time performance of data storage, the system uses a PCI board card. It is up to 528MB / s (66 MHz, 64bit) with the host communication data transmission rate, fully meets the needs of large-capacity high-speed real-time transmission systems.
Figure 1 Hardware system framework
Since each video port can receive two 8/10 bit video signals, the video signal is output to digital video data in the 8-bit BT.656 format through the SAA7144A / D conversion. This will use a DM642 chip to handle up to 6 channels of video input. The BT.656 video acquisition mode of each video port acquires 8bit or 10bit4: 2: 2 format brightness and chroma signal, and multiplexes them into a data stream, video data with CB, Y, Cr, Y, Sequential transmission of CB, Y, Cr, where CB, Y, CR represent the brightness and chrominance spots of the same position, followed by the back Y represents the brightness spots of the next position. The data flow is deposited to the respective Y, CB, CRFIFO, respectively, and then move to the SDRAM via the EDMA to prepare compression coding. The encoded video stream is stored on the computer's hard drive by the PCI port, thereby completing the flow of the entire video surveillance.
Arrangement of an encoder overall framework
JM code is one of the optional H.264 standard software. It cares about all the features of H.264 in the code, all the cases have to be considered, such as frame coding, field coding, memory assignment is not considered To the actual situation of the system, suitable for help understand H.264 standards, not suitable for porting to the DSP platform. In order to efficiently use DM642 limited in-kind resources to re-organize code, including data structures, location of data, data storage, streamlined, streamlined.
First, it is necessary to consider the configuration of L2. The second level L2 (256KB) is a unified program / data space, which can be mapping the storage space as a SRAM, or as the second level cache, or the ratio of the two Combination. Because once the secondary cache is not hit, the read data application will be turned from EDMA, and the CPU has at least 13 Cycle delays. So we always try to put procedures and data in the in-chip memory. However, even if all L2 is configured to SRAM, there is only 256KB sizes, as an example of a CIF format, one frame image size to be encoded is 148.5 kB, and the reference image of the motion estimated is larger than 256kb. So when you configure L2, the author chooses SRAM224KB, L2CACHE32KB. First, consider that the SRAM is a table, global variable, stack data, and some call frequent core programs, such as motion search, DCT transformation, quantification ..., and only the entire image and reference images can only be placed outside the film. NS.
Since the image data is stored in the outer storage space, it is necessary to involve data movement between the data between the on-chip memory and the outer memory, which can be done with the powerful EDMA engine of the DM642, and EDMA does not occupy the CPU cycle. To liberate the CPU from the heavy movement of the data, it is specifically working. When encoding the program, in order to avoid the CPU waiting for EDMA to work after working, the table tennis structure can be used, and when the EDMA transmits data to one of the storage areas, the CPU is processed to another storage area. After the two are processed, the table tennis area is exchanged.
The data that needs to be moved via the EDMA has a macroblock to be encoded, a reference macroblock corresponding to the front and rear frames, and the encoded reconstruction macroblock (not required), which include brightness block and chroma block. EDMA can play its performance to the ultimate when moving a lot of data. If you have a ping-pong cache exchange, it takes too much CPU cycle on the frequent configuration of the EDMA channel parameter. Limited in-chip storage space constitutes a macroblock that cannot be moved too much, it is generally moving 7-9 macroblocks. Since the synchronization information of the EDMA is issued by the CPU, we naturally think of QDMA, but QDMA is suitable for single, independent fast moving data, has no advantage for this periodic, repetitive movement.
In order to improve the efficiency of EDMA, the EDMA chain can be used, up to 12 EDMA channels, let it connect, this only needs to trigger a CPU, the brightness block, the brightness block and color block of the reference frame can be encoded. ... After moving once, as shown in Figure 2. When configuring the EDMA channel, we noticed that frequent replacement is just the source address and destination address of the EDMA, while other parameters are constant. Since the EDMA controller is based on the RAM structure, each channel is configured by the parameter table. Each channel can be found in the 2kb configuration table of 0x01A0000H ~ 0x01A07ffH, so it is updated to update a channel When the source address and destination address, the new address is written directly to the configuration table, and the corresponding Cache function in the CSL library is not required to modify the source address and destination address.
Figure 2 EDMA chain description
Figure 3 Six-sided search algorithm, technology area
Tektron supports Amazon (AWS) media service, providing quality assurance for end-to-end video
IMEC is about to shock the first short-wave infrared (SWIR) band hyperspectral imaging camera
4K super high-definition home theater projector brings HD experience, full of fun
Video display system design based on unified calculation architecture technology
Apple TV 4K dismantling report: familiar modular components
Our other product: