Using heterogeneous multi-core processor combines IVA-HD decoding hardware design system of high definition video H264

"With the advent of the mobile Internet era, the popularity of high-definition multimedia video and the emergence of 3D large mobile game pairs, the single core embedded hardware platform has been difficult to meet the complex actual computing needs. Heterogeneous multi-core processors have strong advantages in video codec operation, and have become the development trend of embedded processor architecture. At present, high-definition video codec generally adopts DSP in heterogeneous multi-core processor for collaborative processing, and realizes inter core multimedia data transmission through on-chip communication mechanism. Compared with DSP, soft decoding has been improved in speed and performance. For example, the built-in DSP on DaVinci platform can realize 720p video real-time decoding. However, the mailbox and DMA need to be configured when DSP is running, which occupies more on-chip communication bandwidth, resulting in low inter core communication efficiency. At the same time, the DSP codec efficiency is still low compared with the hard codec. In order to further improve the performance of Full HD H264 codec, this paper uses Ti SOC omap4430 heterogeneous multi-core processor as the processing platform. The biggest difference lies in the built-in dual core cortex-a9 powerful processor, dual core Cortex-M3 coprocessor and iva-hd multimedia hard codec acceleration engine. Iva-hd engine has seven acceleration engines designed for various video codecs. Each acceleration engine has an independent data memory, which can minimize the competition between modules due to reading and writing data. At the same time, virtio cache queue and rpmsg message framework are used to realize the data communication between main processing core A9 and co processing core m3 based on asynchronous notification, which has the advantages of high efficiency of big data communication, asynchronous notification, etc. The cortex-a9 dual core processor inside the omap4430 processor will run the advanced embedded operating system Linux, which is responsible for the scheduling of system work tasks, audio decoding and user interface interaction, while its internal Cortex-M3 will act as an auxiliary processing core to manage the iva-hd acceleration engine to complete the decoding task. Finally, an example is used to verify the correctness of this design. 1 main technologies 1.1 virtio cache queue Virtio is the abstraction layer above the device in the semi virtualized hypervisor, which provides the lowest implementation layer for heterogeneous multi-core data communication. It uses two cache queues based on asynchronous notification (one for sending data to the coprocessing core and one for receiving data from the coprocessing core) and hash table for data communication with remote heterogeneous processors. Each cache queue contains up to 512 caches, the size of each cache is limited to 512 bytes, and the buffer pool stores communication data. In order to minimize the shared memory, a ring hash table is adopted. Each table item of the hash table includes the physical address of the cache and the size of the cache. The hash table is stored in a specific address of the memory. The main processing core and co processing core access the shared memory based on the mutual exclusion mechanism, as shown in Figure 1: Figure 1 Schematic diagram of accessing virtio cache pool between heterogeneous multi cores The benefits of using shared ring hash table for heterogeneous processing inter core data communication mainly include several aspects: 1) Using hash table entries to represent data cache can reduce the size of shared memory area, improve system memory utilization, and allow variable length data transmission. 2) Interrupt is used to notify the change of hash table of the destination processor, which reduces the blind waiting time of the processor and improves the utilization of the processor 3) It allows multiple cached data to be transmitted at the same time, which improves the throughput of system communication 1.2 rpmsg message framework Rpmsg (remote processor messaging) is a message framework for data communication between processor cores based on virtio technology. It provides functions such as power on reset management and message communication of coprocessing cores. 1.2.1 co processing core reset management It is mainly responsible for loading the program execution body into the running memory of the coprocessing core and setting the MMU unit responsible for mapping the virtual address to the physical address. When the coprocessing core encounters segment errors or internal code exceptions, it needs to output intuitive error information and provide a recovery mechanism so that the coprocessing core can be reused. 1.2.2 message communication Rpmsg message framework is a message communication framework between the main processing core and the coprocessing core based on virtio cache queue. Rpmsg registers a message bus with the system and creates corresponding bus devices for each m3 coprocessing core, while multiple client drivers are also registered on the message bus and allocate a local address port SRC and a remote address port DST, When the client driver needs to send a message, it will package the message into a virtio cache and add it to the cache queue to complete the message sending. When the message bus receives the message sent by the coprocessor, it will send it to the client driver reasonably according to the message address port DST for processing. The schematic diagram is shown in Figure 2: Fig. 2 working diagram of rpmsg message bus 1.3 iva-hd acceleration engine H. 264 / MPEG-4 Part 10 is a highly compressed digital video codec standard jointly proposed by ITU-T video coding expert group and ISO / IEC moving picture expert group (MPEG). It is widely used in network streaming media resources, HDTV and so on. Compared with the previous MPEG4, H263 and other standards, H.264 has the characteristics of low bit rate, high image quality, high compression rate and high reliability. It is suitable for transmission in channels with serious interference and high packet loss rate. The H264 decoding process is shown in Figure 3. The decoder receives the input data frames from the network abstraction layer nal, performs entropy decoding and rearrangement to obtain the quantization coefficient matrix X, obtains the calculated residual DN after inverse quantization and spatial transformation, and obtains the prediction fast PN through motion compensation, inter prediction or intra prediction, The output buffer image FN is obtained by adding PN and DN to the result UFN through loop filtering. Figure 3 work flow of H264 decoder Iva-hd engine is the third generation hardware acceleration engine designed for multimedia codec acceleration on embedded platform. It supports h264, MPEG4, MPEG2, H263 and other common video codec standards. In order to release the CPU and make it more effective in data preparation and logic function control, iva-hd integrates seven hardware acceleration engines. The corresponding relationship between them and each functional module of h264 decoding is represented by a dotted box in Figure 3. The module functions corresponding to the acceleration engine name core1-5 are entropy decoding, inverse quantization and inverse transformation, loop filtering, intra prediction and motion compensation. 2 system design The full HD H264 decoding task is jointly completed by the main processor cortex-a9 and the auxiliary processor Cortex-M3. Cortex-a9 is mainly responsible for reading data from multimedia files or network data streams, filtering multimedia data packets, and separating video streams and audio streams The constructed rpmsg control message is sent to the coprocessing core Cortex-M3 through virtio cache encapsulation to set the control parameters of iva-hd acceleration engine, send multimedia data packets to the coprocessor for H264 decoding, receive images after the coprocessor completes the decoding task and draw them on the screen through DRM API and kms module. There are two Cortex-M3 processing cores on the platform, Sys m3 and app m3, both running Ti BIOS real-time operating system. Sys m3 is mainly responsible for creating virtio cache queue communicating with cortex-a9, recording program execution process and CPU load, receiving cache data sent by A9 and analyzing parameters, At the same time, according to the DST parameters in the cache, allocate the cache to the corresponding message linked list of APP m3. The app m3 coprocessor completes the actual decoding work. App m3 will complete the operation of the iva-hd acceleration engine through the codec engine applied to the embedded platform. App m3 will extract the message request in the message chain list and set the status and initialization parameters of the iva-hd acceleration engine accordingly. During actual decoding, APP m3 will call the iva-hd acceleration engine through the codec engine to complete the decoding task and send the decoding result back to the cortex-a9 processor through the cache queue. The frame diagram of the whole system decoding is shown in Figure 4: 3 system implementation 3.1 cortex-a9 software implementation Cortex-a9 runs Linux operating system, including kernel module omapdce.ko and virtio cache, rpmsg bus driver design, ffmpeg multimedia library and DRM display interface call 3.1.1 implementation of virtio cache queue Virtio cache queue communicates with the coprocessor in the form of shared Hash list, and notifies the other party of the addition of Hash list through interruption, including the following aspects: 1)Irq_ Require() register interrupt function, register_ bus_ Type ("virtio") registers the virtio bus with the system 2)Regsiter_ virtio_ driver(&virtio_ Driver) registers a driver client with the virtio bus to create a device registered with the rpmsg bus. 3) After discovering the coprocessor, the system will pass register_ virtio_ device(&virtio_ Device) registers a device with the virtio bus, which contains the function pointer to create the virtio cache queue 4)virtio_ bus->match(&virtio_ device,&virtio_ The driver () function will match virtio_ Driver and virtio_ Whether the device is appropriate. If the matching is successful, virtio_ driver->probe(virtio_ Device) to create send_ virqueue、recv_ Virqueue and rpmsg registered to rpmsg_ device,。 Thus, the virtio cache queue is associated with the rpmsg bus. 3.1.2 implementation of rpmsg message framework The rpmsg bus will mount many rpmsg_ Driver and rpmsg_ Device, and rpmsg_ The driver has local port SRC and destination port DST attributes. Rpmsg will be called every time a message is sent_ Send ((void *) data, SRC, DST) adds the message to the cache queue of virtio. When the message MSG reaches the rpmsg bus, the bus assigns the msg to the rpmsg with the same DST attribute as MSG - > DST_ Driver and call rpmsg_ Driver - > callback() for message processing. 3.1.3 implementation of omapdce.ko driver module Omapdce.ko module will be used as an rpmsg driver, which implements the Kernel Implementation of API related to application engine, mainly including IOCTL_ engine_ open()、ioctl_ viddec_ create(),ioctl_ viddec_ control()、ioctl_ viddec_ Process (), which provides the application API engine_ open、viddec_ create()、viddec_ control()、viddec_ Driver implementation of process(), which will call rpmsg bus rpmsg_ send()、rpmsg_ Recv () communicates with the coprocessor to complete the work task. 3.1.4 implementation of decoding application viddectest The work of the H264 decoding application viddectest is mainly divided into the following aspects 1) Initialize the Linux display interface DRM, open the / dev / DRI / card0 device file through the drmopen() function, obtain the device resources drmmodegetresources(), create the frame cache drmmodeaddfb2() and set the output resolution and mode drmmodesetcrtc() 2) Call the ffmpeg media library, open the multimedia file through avopenstreamfile(), separate the audio stream and video stream through avfindstream(), and then read the video stream data packet through avgetpacket() and send it to the decoder for decoding. 3) Accelerate engine initialization and decode data communication using message bus through engine_ Open() open H264 decoding engine, viddec3_ Create() creates a decoding instance object, viddec3_ Control() sets the parameters required for decoding, viddec3_ Process() will send out the decoded data stream and receive the decoded image cache data with rpmsg message bus. Its flowchart is shown in Figure 5: 3.2 Cortex-M3 software implementation The dual core Cortex-M3 runs the Ti BIOS real-time operating system and is responsible for communicating with the virtio cache queue of the main processing core and calling the iva-hd acceleration engine through codec engine to realize H264 decoding. The operation flow chart is shown in Figure 6, mainly including the following contents: 1) virqueue_ create(&send_ queue),virqueue_ create(&recv_ Queue) create virtio send and receive cache queues that communicate with the cortex-a9 main processing core. 2) Message_ get_ queue(&recv_ Queue) obtain the request data from the virtio cache queue, message_ send_ Queue is sent to the message queue of APP m3. 3) App m3 will get the message from the message linked list, set the working state of the iva-hd acceleration engine and initialize it. If it is a decoding message, it will call the iva-hd acceleration engine through the codec engine to complete the decoding process. 4) Encapsulate the decoded image cache into virtio cache and call message_ send_ The queue () is sent back to the main processing core A9 through the virtio cache queue and calls the DRM for display and output. 4 test This paper is based on omap4430 development platform