H.264 has obtained excellent coding efficiency at a cost of its high complexity, in which many modules in H.264 are suitable for hardware implementation except for its high complexity. The CPU + FPGA structure is usually used in applications, and the time consumption is implemented with FPGA, and the CPU is only responsible for some of the low complexity algorithms and the settings of the encoding process parameters.
The hardware implementation of integer transformation quantization, but does not consider the width problem during processing, thereby causing waste of FPGA resources. This paper fully considers the dynamic range of data during processing, using fewer FPGA resources to implement integer transformation quantization modules in H.264.
In combination with the actual application, the soft core is optimized in terms of calculation speed and hardware resources. The speed optimized soft kernel performance is significantly better than the design, and the resource optimized soft core is also able to eliminate high resolution real-time coding.
1 integer transformation
After completing the intra and inter-frame prediction, it is necessary to make an integer transformation and quantification of the image data, so that the energy of the image data is set to a small partial coefficient, further reducing the code flow rate.
1.1 Integer Transformation Principle
The image variance is performed in H.264, the expression is:
in:
X is input data, and a is the transform matrix, y is the result of the conversion. H.264 is operated on 4 × 4 image block, the corresponding 4 × 4 DCT transform matrix A is:
The multiplication of E is summarized in the quantization operation, so that only an integer, subtraction, and shift operation remaining in (CXCT), so it can greatly reduce the complexity of hardware implementation, and the result of only 16 bits of data is required.
1.2 integer transformation hardware structure
The 4 × 4 integer transformation in this design H.264 uses a butterfly rapid algorithm, as shown in Figure 1. First, a one-dimensional integer transformation is performed on each line of 4 × 4 blocks, and then the one-dimensional integer transformation of the line transform results will eventually result in 4 × 4 integer conversion results. Figure 1 The module requires 32 plus and 32 subtractors, so that a 4 × 4-piece integer transformation can be completed within a clock cycle.
Integer transformations The usual practice is to all 16-bit adders and subtractors, which can simplify design. However, in practical applications, the input pixel points are [0,255], if the first stage transformation uses 16 bits, the data width will undoubtedly take up more hardware resources. According to the analysis, after calculating one-dimensional line transformation, the data output range is [-765, 1020], and only the 11-bit binary number is represented. After calculating one dimensional column transform, the data output range is [-3 060, 4 080], and the 13-bit binary number is required. Therefore, the data width of the integer transformation module can be determined as: add-in-one data width, a 1-dimensional column transform, and a minor column transform, using a 1-bit data width, and a minor column change adder and subtractor using a 13-bit data width.
Due to the characteristics of the data one-way flow, there is no feedback, and the system performance can be improved using the pipeline. Pay attention to the matching of the module processing rate before and after using the pipeline. As in this module, the front and rear modules are configured in the row transform and the column transform insert, and the front and rear modules are determined by the same clock, and the operating frequency of the entire module is determined at the lowest operating frequency module. This design uses a 1-stage pipeline to improve the operating frequency. If the flow line level will consume a large number of FPGA resources, the simulation experiment has proved that when this module uses a 1-level pipeline, only the very few FPGA resources are increased.
2 quantization
In order to further reduce the image transmission rate, it is necessary to further compress the image, and the method is to use the transformation coding and quantization techniques.
2.1 Principle of Quantization
Scalar quantifiers are used in H.264. The principle of the scalar quantizer is:
Where Y, Y is the input sample point code, QSTEP is a quantization value of the quantization step, the FQ is Y. The H.264 standard supports 52 quantization steps. The simplified operation of the quantization is as follows:
Where wij is an input sample point code, the MF is the value defined in the standard. The QP is the quantization coefficient, f is the offset, and the intra prediction image block f takes 2QBITS / 3, and the frame prediction image block f takes 2QBITS / 6.
2.2 quantizer implementation
The quantizer hardware structure is shown in FIG. The result is a combination circuit to formula (4) and the formula (5). W is data that requires quantified, and MF is based on the value obtained in the lookup table according to QP and the current point location. F and QBITS are the output of the QP checklist.
For the least hardware resource consumption, you need to analyze the dynamic range during the calculation process, use the least data width to represent the transmitted data, thereby using the least hardware resources to implement the combined operation logic. W is an integer transform result, according to the analysis of the integer conversion section, the value range is [-3 060, 4 080], MF maximum 13 107, and the multiplier output dynamic range is [-40 107 420, 53 476 560], At least 27-bit binary numbers are used. The multiplier of the vectorizer uses the 18 × 18 hardware multiplier in VETEX-II, so adjusting the data width is adjusted to 27bit before the input adder. The 27-bit data output by the adder is adjusted to 16-bit data width through the shift module.
Comprehensive consideration of working frequency and resource usage, design 2 line water lines in design to improve performance.
2.3 Quantating Optimization
The intrinsicizer can only quantify a point at a time, and it cannot meet the requirements of high quality image coding. This quantizer is optimized for high quality image encoding, and the N quantizer is connected in parallel, and N points are calculated by a clock cycle. n Select according to actual needs. Considering the speed matching with the conversion module, select 16 quantizers in parallel.
3 soft nuclear design and test
3.1 soft nuclear structure of the PLB bus
The PROCAL LOCAL BUS buses are a high-performance-on-board bus developed in IBM. It is primarily applied to the PowerPC405 processor system that supports 32-bit, 64-bit, and 128-bit data width. This design uses a 64-bit bus width and the maximum speed transfer rate of 800MB / s.
As shown in Figure 3, the soft core design of PLB is divided into two parts: bus interface and H.264 integer transformation quantization module. The PLB bus interface is part related to the hardware architecture, which is a bridge that interacts with other devices on the PLB bus in the integer transformation quantization module. Integer transformation quantization modules are independent of architecture, which can also be ported to the ARM architecture.
The hardware structure of the integer transformation quantization module is shown in Figure 4. The register group contains 9 32BIT registers, 0, 1, 2, 3 for the input 4 × 4 variant data register; 5, 6, 7, 8 is the output data register; 4 is the control register, including GO, DONE, RESET, Intra, DCT, ZSCAN, QP, DATACOUNT, corresponding start, complete, reset, intra / inter-frame, DCT / HADAMARD transformation, z scan output, quantization level, data output counter. The second module performs 4 × 4 integer transformations in the register group, using a 1-level flow wire. The third module is quantified by quantifying the input integer transform results, quantifying the MF, F, QBITS in the module to save the lookup table in the FPGA. The data output module is controlled by the ZSCAN bit. When zscan = 1, the data in the output data register is arranged in Z scanning; ZSCAN = 0, arrange it by matrix.
3.2 Soft Nuclear Test
First use the IBM's CoreConnect tool and Modelsim 6.0a to emulate the soft core and use the virtual platform to test, the purpose of the Simulation is to ensure that the PLB bus interface can be accessed correctly by the processor. The structure of the simulation test platform is shown in Figure 5. Write the data in the virtual memory through the virtual processor to the soft core to be tested, and then read the data in the soft core to be tested and the determination is correct.
After the simulation passes, the soft core is integrated into the system to verify that the software in the actual system meets the requirements. The verification platform system structure is shown in Figure 6. Soft Nuclear Verification uses a method of combining a PC and a target board. The target board uses the Xilinx XUP Virtex-II PRO development board, and the internal contains two PowerPC kernels.
The verification process is shown in Figure 7. Firstly, the target board program is downloaded via the USB port, and the program starts running from the host computer from the host computer from the host computer, and the downloaded data is saved in the 256MB DDR SDRAM on the target board. After the data is downloaded, the program sequentially writes the data to be encoded and starts the conversion, and writes the calculated data to DDR SDRAM. When all data codes are completed, PowerPC sends the processing results to the host computer. The upper computer will compare the result with the unit C code execution result, and finally confirm whether the soft core works correctly.
4 comprehensive results analysis
The integer change quantization modules in this design are integrated in XILINX XC2VP30 FPGA. The XC2P30 contains 13 696 SLICES and 136 18 × 18 multipliers. The integrated tool uses Xilinx's ISE9.1.
The integrated results of the integer transform module are shown in Table 1.
The quantifier is optimized on the speed and hardware resources, and the integrated results of the two optimizations are given in the speed and hardware resources.
Table 3 gives the comparison of this design integer transformation quantization module and the literature [1]. Controlling discovery speed optimization soft nuclear coding performance is much higher than the design of [1], and resource optimized soft nuclear performance is also slightly higher than the design of [1].
The consolidation of the soft core is shown in Table 4. At present, some resources in this soft core have a large resource occupancy, and follow-up work is further optimized.
This paper combines the integer transformation in H.264 with the microprocessor system, and implements two PLB bus-based H.264 integer transformation quantitative soft cores for different applications, and in the Xilinx Xup Virtex-II PRO development board. It has been verified. The experimental results show that both soft cores can operate in the system and meet real-time applications of different resolutions. Read more
Our other product: