"The arrival of mobile Internet age, the popularity of HD multimedia video, the emergence of 3D large mobile games, the single core embedded hardware platform has been difficult to meet complex actual calculation needs. And the heterogeneous multi-core processor has powerful in video codec. The advantage has become the trend of the development of embedded processor architecture. At present, the general HD video decoding uses the DSP in the heterogeneous multi-core processor to synergy, and through the on-chip communication mechanism, the core multimedia data is transmitted. DSP compared to soft Decoding has been improved in speed and performance, such as the Davinci platform built-in DSP can implement 720p video real-time decoding. But the DSP is running to configure the mailbox and DMA, accounting for more on-chip communication bandwidth, resulting in the core communication efficiency High, while the DSP codec is low compared to the hard-won decoder. In order to further improve the full HD H264 codec performance, this paper uses the TI SOC OMAP4430 heterogeneous multi-core processor as the processing platform, which is different lies in the built-in dual-core Cortex-A9. Strong processor, dual-core Cortex-M3 coprocessor and IVA-HD multimedia hardset decoding acceleration engine. There are 7 acceleration engines designed for various video codes inside the IVA-HD engine, each acceleration engine has independent data. Memory can reduce the competition between modules due to read and write data. Simultaneously use the Virtio Cache Queue and RPMSG Message Framework to implement the main process core A9 and the coordinating core M3 data communication based on asynchronous notification, high data communication efficiency , Asynchronous notice, etc. The Cortex-A9 dual-core processor inside the MOMAP4430 processor will run the embedded operating system Linux, responsible for the scheduling, audio decoding, user interface interaction of the system work task, and the internal Cortex-M3 will act as auxiliary Processing the core, managing the IVA-HD Acceleration Engine completes the decoding task, verifying the correctness of this design with instances.
1 main technology
1.1 Virtio Cache queue
Virtio is an abstraction layer located on the device in the semi-virtualization Hypervisor, providing a layer of heterogeneous multi-core data communication. It uses two cache queues based on asynchronous notifications (one for transmitting data to coprocessing cores, one for use in coprocessing core reception data) and hash tables for data communication with the remote heterogeneous processor. Each cache queue contains 512 caches, and each cache is limited to 512 bytes, and communication data is stored in the buffer pool. In order to reduce shared memory, the annulus table is used, the hash table includes the size of the cache, the size of the cache, has a hash table stores in a memory-specific address, and the main process of clutches the core and the coordination core based on the shared memory of the mutex. The way is visited, as shown in Figure 1:
Figure 1 Schematic diagram of the heterogeneous multi-core access Virtio cache pool
There are several aspects of using a shared annulus table for heterogeneous treatment of nuclear data communication:
1) The use of the hash table entry indicates that the data cache can reduce the size of the shared memory area, improve the system memory usage, while allowing the growth of data transmission.
2) Using interrupt mode to notify the purpose of the destination list, reduce the processor blind waiting time, improve the utilization of the processor
3) Allow multiple cache data at the same time, improve the throughput of system communication
1.2 RPMSG message framework
RPMSG (Remote Processor Messaging) is a message framework for processor core data communication based on Virtio technology, providing coordinating nuclear power reset management, message communication and other functions.
1.2.1 Cooperation Processing Nuclear Reset Management
Mainly responsible for the load program executives to the operation of the coprocessing core, setting responsible for virtual addresses to physical address MMU units, when association processing, error or internal code exception, requires output of intuitive error information and provides a recovery mechanism Make the coordinating core to reuse.
1.2.2 Message Communication
The RPMSG message framework is a message communication framework based on the main process core and coprocessing core implementation of the Virtio Cache queue. RPMSG registers a message bus to the system and the corresponding bus device for each M3 collar, and multiple customers The end driver is also registered on the message bus and assigns a local address port SRC and remote address port DST. When the client driver needs to send a message, the message is packaged into a Virtio cache and adds to the cache queue to complete the message. When the message bus receives the cancellation of the coprocessor, the client driver is sent according to the message port DST. The schematic is shown in Figure 2:
Figure 2 Schematic of RPMSG Message Bus Work
1.3 IVA-HD Acceleration Engine
H.264 / MPEG-4 Part 10 is a highly compressed digital video codec standard proposed by the ITU-T video coding group and ISO / IEC moving image group (MPEG), which is widely used in network stream media resources, HDTV, etc. aspect. Compared to the previous MPEG4, H263 and other standards, H.264 has the characteristics of low yard ratio, high-quality, high compression ratio and high reliability, and is suitable for channels of severe interference and high packet loss rate.
The H264 decoding process is shown in FIG. 3, the decoder receives the input data frame from the network abstraction layer NAL, entropy decoding, and re-arranges the quantization coefficient matrix X, quantization coefficient matrix is calculated after firing and spatial conversion The residual DN, and the prediction fast Pn is obtained by motion compensation and inter prediction or intra prediction, the PN and DN addresses the result UFN to obtain the output cache image Fn through the loop filter.
Figure 3 H264 decoder workflow
The IVA-HD engine is a third-generation hardware acceleration engine designed for multimedia codec acceleration for embedded platforms, which supports H264, MPEG4, MPEG2, H263 and other common video codec standards. In order to release the CPU, it makes it more efficient to perform data preparation and logic function control, IVA-HD integrates seven hardware acceleration engines, and their and H264 decoded the correspondence of each function module in FIG. 3 is represented by a dashed box in FIG. The module functions corresponding to the names Core1-5 are: entropy decoding, inverse quantization, and reverse transform, loop filtering, intra prediction, motion compensation.
2 system design
The full HD H264 decoding task is completed by the main processor Cortex-A9 and the assistance processor Cortex-M3. Cortex-A9 is mainly responsible for reading data from multimedia files or network data streams, multimedia packet filtering separation video streams and Audio flow, build an RPMSG control message into the Virtio cache package Send to a coprocessing core Cortex-M3 to set the control parameters of the IVA-HD acceleration engine, send multimedia packets to the coprocessor to perform H264 decoding, and complete the decoding task in the coprocessor The image is received and drawn to the screen through the DRM API and KMS modules.
There are two Cortex-M3 processing cores on the platform, which is divided into Sys M3 and App M3, run TI BIOS real-time operating system, where SYS M3 is primarily creating Virtio cache queues that communicate with Cortex-A9, execute procedures and CPU loads The situation is recorded, and the cache data sent by A9 is received and parameter parsing, and the DST parameter assignment is cached in the corresponding message list of the App M3 in accordance with the DST parameter pauses in the cache. The APP M3 coprocessor completes the actual decoding work, and the APP M3 will complete the operation of the IVA-HD acceleration engine by using CODEC ENGINE used to embedded platforms. App M3 will set the message request of the IVA-HD accelerator engine in the extraction message chain list. When performing actual decoding, the IVA-HD acceleration engine is invoked by CODEC Engine to complete the decoding task and send the decoded result through the cache queue. Back to the Cortex-A9 processor. The frame map of the entire system decoding is shown in Figure 4:
3 system implementation
3.1 Cortex-A9 software implementation
Cortex-A9 runs the Linux operating system, including kernel module OMAPDCE.KO and VIRTIO cache, RPMSG bus driver design, and FFMPEG multimedia libraries and DRM display interface calls.
3.1.1 Virtio Cache Queue Implementation
The Virtio Cache queue performs data communication in the ways and coprocessors of a shared list, and notifies the addition of the partner table by interrupt mode, including the following aspects:
1) Irq_require () Registration Interrupt Function, Register_bus_type ("Virtio") Registered Virtio Bus to System
2) Regsiter_virtio_driver (& Virtio_Driver) Register a driver client to the Virtio bus to create a device registered with the RPMSG bus.
3) The system will be registered with the Virtio bus through Register_virtio_Device after discovering the coprocessor, and the device is included in the device to create a function pointer for the Virtio Cache queue.
4) virtio_bus-> match (& virtio_device, & virtio_driver) function will match virtio_driver with virtio_device is appropriate, if the match is successful, virtio_driver-> probe (virtio_device) to create send_virqueue, recv_virqueue and registered to RPMsg of rpmsg_device ,. Such the Virtio Cache queue is associated with the RPMSG bus.
3.1.2 RPMSG Message Framework Implementation
The RPMSG bus will mount a lot of rpmsg_driver and rpmsg_device, and rpmsg_driver has local port SRCs and destination port DST properties. Each time you send messages, rpmsg_send ((void *) Data, SRC, DST) adds messages to Virtio cache queue. In the middle, when the message MSG reaches the RPMSG bus, the bus assigns the MSG to the DST attribute and the same RPMSG_Driver as the MSG-> DST, and calls rpmsg_driver-> callback () for message processing.
3.1.3 Implementation of Omapdce.ko Drive Module
Omapdce.ko module will serve as a RPMsg driver, which achieves the Kernel API engine-related applications, including ioctl_engine_open (), ioctl_viddec_create (), ioctl_viddec_control (), ioctl_viddec_process (), they offer application API engine_open, viddec_create () Viddec_Control (), Viddec_Process () driver drive implementation, these drive functions will call RPMSG bus RPMSG_send (), RPMSG_RECV () to communicate with the coprocessor to complete the task.
3.1.4 Decoding Application ViddECTEST Implementation
H264 decoding application ViddECTEST work is mainly divided into the following aspects
1) Linux Display Interface DRM Initialization
2) Call of FFMPEG media library, open multimedia files via AVOpenStreamFile (), and avfindstream () separates the audio frequency flow and the consideration, and then sequentially read the video stream data package to send the decoder to decode.
3) Accelerating engine initialization and utilizing the message bus to decode data communication, open the H264 decoding engine via engine_open (), Viddec3_create () creates a decoded instance object, Viddec3_Control () Set the parameters required to decode, viddec3_process () will use RPMSG message bus Send the decoded data stream and receive the decoded image cache data, which is shown in Figure 5:
3.2 Cortex-M3 software implementation
Dual-core CORTEX-M3 runs TI BIOS real-time operating system, responsible for the host-processed VIR "
Our other product: