OTFDEC hardware module based on STM32H735G-DK board verification research

"Preface Starting from the stm32h73x series, we introduced a new peripheral module, otfdec. Its full name is on the fly decryption. Its introduction can help you solve the pain points of code protection. Introduction to otfdec As we all know, the code is stored in on-chip flash. As long as the JTAG debugging port is protected and the on-chip key code is isolated, it is quite safe to prevent logical attack and direct detection. However, the capacity of on-chip flash is limited after all. In some applications, we need to put the code into off-chip flash storage or even execute it directly from off-chip flash. Off chip flash is much more fragile in anti attack than on-chip flash. Off chip flash generally does not have any hardware protection. As long as you know its material number and its reading and writing timing, it is not difficult to read out the contents. Therefore, a natural idea is to encrypt the code and then put it on the off chip flash. In this way, even if others read the ciphertext code inside, they can't get the effective information of the code as long as they don't have a key. For example, the typical topology in the film: the encryption code is placed in the external Octo SPI flash. For this natural approach, when executing off chip encryption code, previous MCU needs to call OSPI driver to read the ciphertext code, such as putting it into SRAM. Then, the software or hardware of MCU is used to decrypt and recover the code plaintext to another area of SRAM. Finally, MCU executes plaintext code from this SRAM. Now we introduce the otfdec hardware module, which is located between the bus matrix and the Octo SPI interface. After it is configured, the kernel executes the ciphertext code on the off chip flash (here, the mapping address of Octo SPI flash is 0x9000 0000). There is no need to reverse the SRAM in the middle, but directly send the decrypted code to the bus matrix for internal core execution under the action of otfdec. In other words, with the cooperation of otfdec, for the CPU, executing the encryption code on the external flash is the same as executing the plaintext code of on-chip flash. In order to minimize the delay caused by otfdec decryption, otfdec is designed to work in aes-128-ctr mode. The linked list mode without AES is to shorten the decryption time of the ciphertext on the target address as much as possible. Therefore, the encrypted code stored on the external Octo SPI flash also needs to be obtained by using the same aes-128-ctr operation. It should be noted that in order to achieve this effect, Octo SPI needs to be configured to the memory map mode. At present, in the STM32 series family, stm32h73x series, stm32l56x series and stm32u585 series have integrated this otfdec module. Today, we are not introducing how to use otfdec, but answering a common question when introducing otfdec to customers some time ago: how much delay does otfdec decryption introduce compared with directly executing the plaintext code on external flash and executing the encryption code of external flash? experimental design Next, we design an experiment to verify the efficiency of the kernel executing ciphertext code on external flash with the participation of otfdec. I found a self-test program crypto in mbedtls_ Selftest, verify that it is encrypted and placed in external flash. The time and cost required for the kernel to execute a complete set of self-test programs are different from that of executing external plaintext code. In order to further illustrate the problem, a scenario is added, that is, if the self-test program is plaintext in on-chip flash, how much faster will it cost for the kernel to execute it. The crypto self-test program is compiled at the highest optimization level, and its size is almost 63K. The first scenario is the most common. The test program is directly injected into the on-chip flash to run. Let's take a look at the self-test program first, which mainly executes the self-test program in the function array of selftests. The user can use mebdtls_ Select which self-test sub items are included in the conf.h header file. Now I have selected 6 self-test sub items. Then, before the self-test program starts running, determine whether to start the cache by detecting whether a user key is pressed. Stm32h735 integrates arm cortex-m7 core, with 32K instruction cache and 32K data cache. Because we want to measure the time cost of running the self-test program, we enable a kernel counter, then reset the counter at the beginning of each test sub item, and record the value of the current counter into the timestamp array of the global variable after the end of the test sub item. Finally, after the six test sub items are completed, the time cost is converted according to the value recorded in the timestamp array and the current kernel running frequency. Since scenario 1 is the most common usage, that is, the program runs on-chip flash, its link file is the default configuration in the stm32cube package. Taking IAR as an example, I show the storage address of code in this test scenario, including the storage address of reset and interrupt vector table. In the second scenario, the self-test program runs in external flash. STM32 cannot be started from external flash. We start from the first address of on-chip flash according to the conventional practice, so we put a bootloader in on-chip flash. Its function is very simple. It initializes the OSPI interface and configures it to the memory map mode. Then adjust the stack pointer SP, and the PC pointer, and jump to the OSPI external Flash address at the beginning of 0x9000 0000. And there is my crypto self-test program. In scenario 2, the self-test program works crypto_ Selftest_ ext_ In plain, only two modifications are required compared with the previous project. Link the file, put the reset and interrupt vector table at 0x9000 0000, and adjust the vtor value of the kernel register. In this way, in case of any interrupt or exception, the execution address is obtained from the vector table at 0x9000 0000. In the third test scenario, compared with the second test scenario, the boot loader project needs to increase the configuration of otfdec. The content burned in 0x9000 0000 should be the encrypted ciphertext of project.bin generated from the second project under scenario 2. Here, otfdec is decrypting in the bootloader on the left, and the code is encrypted in advance through PC Tools on the right. Since AES is a symmetric encryption and decryption algorithm, the encryption parameter configuration of otfdec should be consistent with that of the PC encryption tool. Let's first set the decryption parameters, key and initial vector IV of otfdec. The key is specified by the user. In the code, we set it in the key array. According to the writing method of array, considering that the arm Cortex-M core is small segment aligned, the storage order of these 16 byte keys in memory should be as shown in the following figure on the left. Note that I deliberately make the contents of each byte of the 16 byte key different. Why? Let's see next. Otfdec IV and Hal driver encapsulates a structure for users to fill in. The external flash address range to be used by nounce and otfdec, and the version number of the code to be stored in the address range of the external flash. Nounce is also set by the user. I still deliberately make the contents of 8 bytes different. Next, we will configure the parameters of the PC side encryption tool. Here we use OpenSSL. After the decryption key of otfdec is set, the key we use in OpenSSL should be exchanged head and tail in the range of 16 bytes in bytes. However, note that the bit order in bytes remains unchanged, that is, the value of each byte remains unchanged, but the position is changed. This is why I deliberately put the contents of 16 bytes in the otfdec key, and each byte value is different, so as to facilitate the comparison of the moving position of each byte. The reason for such a replacement is because of the otfdec circuit design. We don't need to investigate the reason. Just know what we should do under such a design. Notice the OpenSSL command posted on the film. The - K character is followed by the key, which is a byte string in bytes. That is, the first byte is 0x9a, and the next bytes are 0xbc and 0xde respectively, which are arranged in the same order as the bytes in the following table in the film. Then look at IV. For otfdec IV, we encapsulate the otfdec for Hal driver in the code_ After each member of the regionconfig structure is assigned a value. When using OpenSSL, how do I adjust the order of this IV? As shown in the figure, the first 32-bit word comes from nounce [1]. In this 4-byte 32-bit word, the byte order is also exchanged from beginning to end. The second 32-bit word is from nounce [0], and the byte biting order is the same. The upper 2 bytes of the third word are from version, and the byte bit adjustment order is the same as before. The fourth 32-bit word comes from the shift of the starting address and the splicing of the regionid. Notice the OpenSSL command posted on the film. The - IV character is followed by the initial vector, which is also a byte string in bytes. That is, the first byte is 0x13, and the next bytes are 0x57 and 0x9b respectively, which are arranged in the same order as the bytes in the following table in the film. The key of OpenSSL command and the content of IV input determine that there is another important thing to adjust: the object to be decrypted by otfdec. Instead of directly encrypting the plaintext code project.bin, use OpenSSL to encrypt it according to the previous parameters. It is still due to the different byte sorting of different AES operation tools, which needs to be adjusted manually. Here we use the PC side scripting tool, SREC_ Cat first fills in the input byte stream, and then uses the XXD tool to adjust the byte order. The adjustment rule is the same as the previous key, that is, for the content of every 16 bytes: within the range of 16 bytes, the head and tail are exchanged. The bit order in the bytes remains the same, that is, the value of each byte remains the same, but a new position is changed. The byte stream after sequencing is sent to OpenSSL for encryption. The ciphertext also needs to undergo byte sequencing with the same rules once to get the encryption code that can be burned to off chip flash (0x9000 0000) and decrypted in real time by otfdec. Open the CMD command window, switch to the utilities / exttools directory in the reference routine package of this document, and input the commands in the previous page in turn to get the final output of the preprocessing stage, namely project_ pad_ pre_ enc_ post.bin。 We can use stm32cube programmer to verify that after otdec is configured, we can see the plaintext code from 0x9000 0000. Please refer to the instructions in the film for verification steps. Next, let's run the board offline and run scenario 3. From the on-board LCD screen, you can see the time cost printed after the self-test program is completed. According to whether I press the user key during reset, the effects of enabling cache and not enabling cache can be displayed. From the line of total time cost, it can be seen that cache is not available, and the execution time is 8 seconds; When cache is enabled, the execution time is only 0.2 seconds. In scenario 1 and scenario 2, we download the start-up project and self-test project to the board and run them respectively, and then record their time expenditure. In the figure, the red number is the case when the cache is not turned on, and the green number is the case when the cache is turned on. conclusion It can be concluded that when the code runs in external flash, the efficiency is almost the same between running plaintext and using otfdec to run ciphertext; To improve the efficiency of code running in external flash, the main acceleration measure is to enable the kernel to automatically cache. The source of the article: WeChat official account: STM32 microcontroller. Responsible editor: GT, read the full text, original title: information security topic | otfdec efficiency verification based on stm32h735g-dk board Source: [micro signal: STM32]_ STM8_ MCU, WeChat official account: TI] welcome to add attention! Please indicate the source of the article“