"One year later, finally have the opportunity to save a single chip. This time, the focus of the automotive chip.
I remember two years ago, I could not find in China Automobiles chips before several companies do. Today, two years later, suddenly mushroomed emerged more than a dozen, which covers the driver assistance, in control, dashboard, T-Box, gateways, body control, battery management, hardware encryption, laser radar, a millimeter wave radar, an image sensor and an image signal processor and the like, Baxianguohai recount.
Globally, automotive chip annual sales of approximately $ 40 billion, of which the digital chip $ 10 billion: infotainment (in the control) chip of about $ 2.5 billion, the average price of $ 25; the MCU approximately $ 6,000,000,000, 3,000,000,000, the average price of $ 2; driver assistance about $ 1.7 billion. Global about a year to sell a million cars, each vehicle on average $ 100 in digital chip. Wherein the auxiliary driving chip in a rapid growth phase. A major supplier of automotive chip, NXP, Renesas digital part more, Infineon, Texas Instruments more analog part. Car chip is the profit of the few remaining good market, technical threshold is not insurmountable, but there is no absolute ecological closed loop. But the amount is not so big consumer electronics, one million a year out of a few pretty good. After this area, the new forces in the ascendant repairer, the repairer conventional forces to pursue differentiated, and catch 5G, hot autopilot with artificial intelligence, so the car following the chip became virtual reality, mining machine, NB-IOT, artificial intelligence the new investment.
The figure is a typical automotive electronics frame. The system is divided into several domains, body, powertrain, chassis, infotainment, driver assistance, gateways and T-Box. Each have their own domain domain controller, on-board Ethernet and Can bus through the Internet. We are in the framework of the most complex in the control and driver assistance chip, for example, started to explore design ideas and methods.
A new generation of chips in the control architecture shown below, mainly by a processor, a graphics processor, display, image processing, security (Security) management, security functions (the Safety), and on-chip debug bus, etc. subsystems. It is usually the main difference is the application processor virtualization, security features, and real-time car regulatory level electrical standards.
Let me talk about virtualization. In fact, virtualization from the server to the concept of why this car will be a demand? For two reasons: there is now a trend in the control chip, integrated dashboards and reduce costs. Previous dashboard is usually done with a microcontroller, a graphical interface is also relatively simple. And now more and more system-hyun, and even graphics processors need to participate. Naturally, this makes the control and instrument panel is bonded to the single chip. They are running different operating systems, virtualization software achieve better isolation. Of course, some manufacturers believe that virtualization is not enough, need to rely on physical separation was assured, this is something, start later. Another trend is the control itself needs to support multiple screens, each belonging to different virtual machines and operating systems, which can simplify software design, improve software reliability.
Virtualization have any specific requirements on the hardware? This is not clearly defined. Can rely on second-order processor comes with a memory management unit (s2MMU), for software virtual machine; you can put in front of a hardware firewall MPU memory controller to access the memory address checks and filters, do not do address remapping; You can also use the system memory management unit SMMU achieve a complete hardware virtualization, which we want to focus on the.
As shown in FIG yellow block, and between each of the bus master device, we have added a SMMU600. Why after each master device will be added? Very simple, if not, then there must be security vulnerabilities, and virtualization software is no different. Then why do MPU? MPU implemented method, usually with an on-chip memory to store the filter entries. If we can do the particle size of 4K bytes, 4G bytes of memory that you need 1 million items, each of 8, a total of 1MB of on-chip memory, which is not a small cost. Another reason is that the physical address space MPU software program is opaque, the use of SMMU transparent upper layer software, closer to virtualization needs.
When the processor address the actual situation to launch a conversion request, MMU will cache lookup page table entry and the final entry in the middle of the inside of the TLB cache and Table Walk. If not found within the cache, it would need to cache or system memory read. In the worst case, the order of each layer of the intermediate sheet 4 may have been missed, 4x4 + 4 = 20, 20 will eventually need to read the memory. For SMMU, it could be worse. As shown above, since the SMMU itself has to be introduced into the multi-level descriptor table mapping a plurality of pages, the most extreme case of 36 times to find the final fetch the page table entry. If all of this visit is delayed, apparently we can not accept.
Arm traditional design is large enough to add multi-level cache and TLB cache Table Walk, the effect is as follows:
This is to enable the measurement results after 2 address mapping step, which are configured as the cache size is large, then the two master devices connected to the interface, the address of a more random access. Can see 50,000 visits master, after SMMU, produced nearly 5 million times miss. This means that the average delay is equal access memory access latency, 150ns or more. On the other hand, opened the processor virtual machine after its random memory access efficiency, and is not open virtual machine ratio, was able to achieve more than 80%, which is why? The answer is simple, the MMU internal to the processor, the physical address of the page table will continue to send intermediate to two or three cache using a cache to reduce the average delay. And SMMU not so lucky, in a previous phone Arm processor reference design, there is no system cache. In this case, even if the delay is less sensitive to the master device, such as a graphics processor, open virtualization can also cause loss of performance, may be as high as 9%, which is not a small number.
how to solve this problem? Arm and next generation mobile phone server in the reference chip designs may introduce a bus network structure, rather than crossing the bus line structure consistency before. Benefits bus network structure, mainly to enhance the frequency and bandwidth, and, while providing consistent multi-core, it may be the system cache to the respective main equipment. Master or not required cache may issue the same as before the non-cached data transmission, to avoid additional occupancy for the buffer, causing frequent cache replacement; Meanwhile, the page table and can SMMU intermediate page table entries in the cache, thereby shortening delay.
Arm of SMMU600 also made a little improvement, TLB cache can be made closer to the respective primary device arrangement, in the case of a hit, a clock cycle to complete the translation; the same time, the Table Walk into another local cache, cache and TLB Table Walk through the internal cache bus interconnect. Several master device can simultaneously use a Table Walk cache, reducing the area, at the same time facilitate the layout, without losing efficiency. FIG following structure:
If we read the Arm of SMMU3.x agreement, will find it supports two-way radio maintenance information page tables, which means that in addition to the cache data consistency, all the main equipment, just follow SMMU3.x protocol, and the processor while using a page table. When the auxiliary driving chip design, if needed, to a significant accelerator added to the same page tables, page tables to avoid the software update operation, to further improve the efficiency of heterogeneous computing. But in terms of SMMU600, it only supports one-way broadcast, received a master SMMU600 of itself, caching and page table operation can not be broadcast to the processor, in turn, is possible.
For the current automotive chip, if there is no system cache, then how to reduce device virtualization delay it? Also some way. Automotive virtual machine application is unique, currently eight virtual machines sufficient to meet all of the multi-split-screen and system requirements, and once assigned, the operational phase without repeated deletion and generation. We can take advantage of this, the second phase of SMMU page table is large, such as 1GB, fixed assigned to a virtual machine. In this way, during the second phase address mapping equipment, just a few a few TLB entries, can be done has been hit dramatically reduce latency. Note that, once the second order mapping physical space allocated to a device, no longer give back and other devices. Otherwise, after repeated recycling, there will be a physical address discrete, unable to find a large continuous physical address.
SMMU received from the master device is sent over the physical address, then it is how to distinguish between the virtual machine it? Relying on the same sent from the master over vmid / streamid. If the primary device itself does not support virtualization, then it needs to be time-division multiplexing to allow the software to write vmid / streamid. Of course, the software must be run in hypervisor or secure monitor, otherwise there will be security vulnerabilities. Specific approach, is when the virtual machine switching, modify registers of the hypervisor vmid / streamid, to provide input to SMMU. Id does not match the preset time and if the access, SMMU will report an exception to the hypervisor.
If you want to master the way to support hardware virtualization, and that in itself need to set up multiple sets of registers, the initiative to send a different vmid / streamid. For the software compatibility, different groups can be separated according to 4KB boundary, so that when the second order address mapping allows the same real address to access a different set of registers, and the drive transparent. Meanwhile, internal resources have to do to distinguish, so that data can not influence each other. If the cache is used, it must also cache of different situations vmid vmid sensitive to the same address, must be recognized as a miss.
If the master does not support virtualization, and particularly complex itself, it also needs custom drivers. To Arm graphics processor, for example, so far, has not officially support hardware virtualization software if you want to support, there may be several options:
Suppose we are using Hypervisor Xen, which runs on Arm processor EL2, the virtual machine runs EL0 / 1. Graphics processor normal drive will be divided into a user space and kernel space in two parts. To achieve virtualization, time division multiplexing graphics processor, run on Xen itself can not drive, because the current driver only supports Linux. So you can only make a virtual machine to run the original drive, but there is no way to run on a drive in the Hypervisor for access control. Meanwhile, a graphics processor remapping address in the second stage on the CPU, and register access so that data path in the 'penetration' mode, does not cause abnormal and improve efficiency. Accordingly, let the virtual machine to directly access the register, that access control is not achieved. In order to achieve scheduling multiple virtual machines, we can achieve a scheduler Hypervisor inside, and the driver partially open interfaces kernel mode, so you can take the initiative Hypervisor schedule. Diagram is as follows:
The advantage of this implementation it is clear that changes less, simple, both Xen and KVM can be adapted. The disadvantage is that the initiative is not controlled by Hypervisor, if a virtual machine on the rendering task is too heavy, it has not give control to the scheduler, it only forced restart. Another obvious drawback is that the task can not run simultaneously on two virtual machines in the graphics processor. This requires a further implementation of the virtual machine, as shown below:
In this implementation, the virtual machine to run only user-driven space, all calls to the heart of all thrown into space Hypervisor. This requires Hypervisor itself is Linux, only KVM meet this requirement. Arm of Mali graphics processors, hardware support specify a rendering core running a particular task, the task is, you can run a virtual machine in a particular rendering core. Thus, if there is real-time operating system to run, such as the dashboard, you can keep a nucleus, and does not preempted other virtual machines to achieve a certain level of QoS. At this point, the graphics processor is the real task of simultaneously running two virtual machines, rather than time-division multiplexing. As for the frame buffer output, different tasks can be placed in different physical addresses, but can not distinguish SteamIDs, do not isolated.
Arm supports hardware virtualization of graphics processors estimate that it will take a year out. Specific details to the virtualization packet in addition to the registers, cache outside of vmid sensitive, universal need some cells support packet.
Efficiency on a virtual machine, there are two things to note:
Arm existing interrupt controller GIC600, limited by GICv3.x agreement, there is no way to bypass the Hypervisor, directly to the virtual interrupt to the Guest OS's. Dishes come in, still have to register set via external Hypervisor privileges, resulting in a virtual interrupt to the Guest OS. Interrupt directly to the Guest OS improvements will be to GICv4.
CPU Armv8.1 and later, supports a mechanism called VHE, it can speed up the switching type 2 virtual machine. Specifically principle, KVM virtual machine such as type 2, Hypervisor just inside the Linux kernel, and Linux requires a full 2 3/4 layer-order page table. On the other hand, before the processor EL2 Armv8.1 no corresponding page table. If not VHE, then this must be part of Hypervisor do reside in high authority to operate EL2, and Host Linux or run EL1. In this way, many operations need to fall from EL1 EL2, and then change back to the Linux kernel EL1 complete, multi-layer jump. With VHE, then Host Linux kernel to run directly EL2, can operate four layers EL1 page table page table register, do not modify the software. Hardware, these visits will be redirected to EL2, in order to guarantee rights.
For Type 1 virtual machines, such as Xen, this change has no effect. Here we would like to mention QNX virtual machine, it is a Type 1 virtual machine. QNX is the only one able to achieve operating system Asil-D level (includes Hypervisor). If you want a Asil-D level system, must porting existing software from a Linux system to QNX. Fortunately, QNX is consistent with the Posix standard, especially the graphics processor driver, transplant will save some of them. QNX Not all modules are Asil-D level, transplant drive past, in fact, there is no safe level. QNX rely Asil-D level of core software modules and Hypervisor, ensure that more than 99% of the failure of coverage. If the sub-module is a problem, it will only restart the module.
As mentioned earlier, some manufacturers believe that virtualization is not enough, some scenes to physical isolation. Virtualize hardware resources or sharing, but is transparent to software. In fact, this does not completely prevent hardware conflicts and guarantee priority. Please note that the hardware is isolated Separation, rather than partition Partition, Partition with MPU to do. Within the framework of the system of FIG control, we use physical isolation red part listed separately, as shown below:
At this time, the processor A55 and G31 graphics processor, independent of the processor A76 / A55 and G76 graphics processor as infotainment domain has its own power supply, clock and voltage. As optimized, the red and the remaining portion may be connected by a processor bus consistency, when not applied as the dashboard, it is used as part of the SMP. Isolation when needed, by the multiplexer connected to the NoC or memory controller. This saves the area, but also to achieve isolation.
Similarly, the graphics processor also needs physical isolation. Realization is not complicated, than to directly support hardware virtualization, as shown below:
As the largest graphics processor is rendering core SC, this part does not move. The rest of the hardware module, a copy of each core are, between the group A and ASN interconnected with the internal bus. When split into a plurality of graphics processors when each redundant module separately control their own resources. In this case, each GPU need to run a separate drive. And when all the resources of the integration operation, the redundant portion automatically closed by a centralized scheduling module. At this point, you may experience some common resource performance bottlenecks, but the car usually only require physical isolation two groups, respectively, to the instrument panel and infotainment, dashboard and required fewer resources, time integration, information can be enabled shared entertainment unit, in order to avoid bottlenecks. For the rest of the master system, isolation may be implemented using a similar design ideas.
With support both virtualization and hardware isolation of graphics processors, our central control chip architecture will have the following changes:
At this physical isolation and hardware graphics processor virtualization can be enabled simultaneously, run multiple drives to meet the foregoing requirements.
So far, virtualization and isolation ended, the car began to discuss regulation.
Currently we are talking about a two-car rules, functional and electrical safety standards. The former is defined by ISO26262, which is defined by the AEC-Q100.
Safety features on the chip design principle is to find out as much as possible failure scenarios on the chip and correct. Failure is divided into systematic and random, the former relying on process specifications designed to ensure the latter depends on a variety of failure detection mechanisms adopted to ensure that the chip design. We are mainly talking about the latter.
In simple terms, the failure rate of the chip is based on a single transistor in a process node failure probability, the failure probability is derived on-chip logic or memory. The larger the area, the more transistors, the larger the corresponding failure rate. ISO26262 security level to do the division, often with ASIL-B and ASIL-D level. ASIL-B requires the chip to cover 90% of the single point of failure scenarios, and ASIL-D is 99%. In fact, this is a very high demand. A transistor failure probability was low, but usually a complex chip is composed of millions of transistors, if nothing is done, that at any point of error can cause functional failure, the failure rate is very high.
ISO26262 manual fifth chapter of Annex D, a detailed description of the means of detecting hardware failure. In this part, the hardware system is divided into several modules: an input sensor, connector, relay, is the interface; processing section includes a processing unit, various types of flash memory. There bus system level, power and clock. FIG framework follows:
For each cell, ISO 26262 manual defines the methods to detect these cells is invalid, and reliability of each method is given. Such as a transmission line, there may check code, the timeout counter, the transmission test vectors and the like. Another example is a processing unit self-test hardware and software may be used, plus a redundant comparison, additional monitoring hardware modules or the like. These methods can not be simply applied to the chip functional safety design. How to do that on a chip? We bottom-up approach, starting with the transistor starts analysis, to IP module level, and then to the chip-level system, then discuss some typical scenario, the final top-down analysis.
Random errors in the chip, there is a permanent error class, such as on-chip logic or memory have been stuck at a 0 or a 1, or simply a short-circuit and open circuit. For this type of error, chip packaging and testing, we can use the boundary scan and MBIST transistor to discover broken. In this way, the problem is converted into how to improve the DFT coverage. This one, the industry has a mature approach.
There just is not enough factory test, the transistor can wear damage during use. Therefore, we need each boot when they are self-test, identify problems early, reducing the possibility of error in the operating state of the system. At this point, we need to use LBIST and MBIST. The principle and factory tested like, but also the use of the scan chain, except that the chips in need LBIST / MBIST controller to run the test vectors and templates. Naturally, this would introduce additional costs. The higher the coverage, the greater the corresponding costs.
With LBIST / MBIST also not enough, we need a few clock cycles to detect errors after the transistor failure occurs, rather than when the boot was found. For logic, in order to do this, the most direct way is to use redundancy, i.e. the logical copy, and then compare the output of the comparator hardware. This is commonly referred to as the lock-step design (Lock-Step). Theoretically, for a finite state machine, as long as the inputs are identical, the clock cycle, the output necessarily coincide. Generally it means a true random number portion does not exist, even if it is a cache replacement algorithm is pseudo-random, it can satisfy the above conditions. The result is a redundant logic area is doubled, the comparator will also introduce some additional area overhead and timing impact.
So simple to achieve functional safety? And no, there are several problems to be solved:
The first question is, what in the end comparator signal ratio? To the processor, for example, if we just increase the comparator on the interface of the bus, a lot of the inside of the chip module, such as a write buffer, and not in a short period of time and to determine the impact of the transfer to the external interface is relatively finds . In this case, the processor may be in a failed state has not been detected. Then we can not say that the current redundancy to cover such failure. To this end, we need to comparator connected to the internal sub-module interface, and the analysis is not able to see the impact in a relatively short period of time. This need to consider in the design phase Specifically, the following chart:
For any register, it will be able to find the influence of the combinational logic and a register. On this path any one out of the question, so in one clock cycle, we can see the output register its redundant modules inconsistent. This node referred to as 1, then the input register 1 as the new starting point, nodes 2 find. And so on, we can not move forward to find a cycle path, problems occur at any point along this path, in a relatively short period of time determined, it will be reflected in the final output. We call this path module denoted as X. EDA tools through certain, we can identify a number of X modules in the chip, the example of FIG follows:
Here, the IP module is classified as access unit (A door), the flag cell (B door), a calculation unit (C doors) and the register set (D door). From the output end, the connection to a combinational logic covered between the number of gates to a register, a count register 10 gates. As shown above, the output of the address register access unit 24 is affected by a combination of logical registers plus 2 outside, and that the presence of a total of 44 kinds of single point of failure can cause failure. And so on, the output of the register No. 1 group, the influence by the gate 28, the No. 2 gate 49 receiving impact. It may add up to a total of 121 kinds. Simple calculation shows that the access failure rate of 44 units / 121 = 36.4%, a combination of the count register 77/121 = 63.6%. Some of the door is being counted multiple times, such as the figure of the G1, it will be reflected in the total probability inside.
Based on the above ideas, we look at how the processor is doing. With the help of EDA tools, it will be divided into several modules: a memory management unit, write buffer, instruction fetch unit, a data processing unit, a program trace buffer, the data / instruction cache, a bus interface unit clock and reset control unit, ECC / parity control unit, interrupt interface, the control unit monitor. Here, we have not included into the on-chip memory, even though the cache is discussed, also referred to as control logic.
Within each cell, which can be subdivided into many sub-module. In the data processing unit as an example, it is divided into general purpose registers, access unit, a floating point unit, floating point registers, a decoding unit, a debugging unit, the control unit signals, the system registers, branch execution unit or the like. Each sub-unit and can once again broken down. Object segmentation is determined when the transistor fails, the influence is not subject to the register will fail, and the failure can be detected by an external comparator. This requires the internal signal is pulled out. That in the end how to decide which signal is pulled out what does not pull? Coverage is not enough? Tools for nodes and module information only as a reference, designers still have to check one by one to do the final decision. Usually have pulled out a lot of signals, such as Cortex-R5,20 a million logic gates, the number of signals to the comparator eventually reached more than 2,000, an average of 100 there is a signal.
In the chip certified when, if not over-IP itself is certified or has not been widely adopted before the certification body may require a strip and chip companies to discuss, even look out of the pin is not able to provide adequate failure detection coverage. Often these relevant design information, the company does not provide IP to chip company, so the company may be certified and IP design company to get this information, resulting in a longer authentication time. Accordingly, if the IP module is widely used, this time can be shortened.
Solves the problem of redundancy coverage, there is a second problem. If you suffer shock or electromagnetic radiation impact that the use of redundant design, two modules at the same time may also produce the same error. This is easier to handle, the same logic as long as the two, the input offset can shoot a few. On output, shifted the same number of shot, so that the comparator still see the same results.
The third problem, a copy logic, and the comparator discovered the mistake, he can correct them right? Unfortunately, no. Replication logic unless two points, three at the same time comparison. This price increase is another logical part of the original area of 100 percent for large processor design, basically no one to do so. If the processing logic is small, such as watchdog circuit allows.
Fourth, the logical comparator itself, but also the possibility of error. This type of error has been defined ISO26262, the so-called hidden errors Latent Fault. If you find a comparator failure coverage itself is not enough, it can also be designed with redundancy comparator, comparator to compare and improve its coverage. For Asil-D, the fault coverage potential required to reach 90%, while 60% Asil-B.
These are logical error analysis for. There is a class of memory error. Here memory refers to the on-chip memory, also incorporate embedded flash memory. Relatively easy to find memory errors, ECC can usually do 99% coverage, a multi-bit error correction. Some memory, such as an instruction cache, only supports parity, does not support correct.
For redundancy of the ECC logic and memory, in order to verify the detection mechanism itself is not able to meet the design requirements, need to be added inside the chip error injection. Please note that the mechanism itself is not a chip in order to validate a single point of failure and multi-point error mistake failure rate, just to verify that the error detection mechanism for error injection.
In summary, the redundant logic and memory ECC is to help us achieve Asil-B / D level necessary means. When there is no redundancy design a program to run twice on a nucleus, and compare the results, but also a way leading to a high level of security, but only for a simple, less demanding real-time operation. If there is a permanent error, this method will fail. Similarly, with the two non-redundant processors to do the same operation at the same time, also a method. However, if the calculation is very complicated, this will not only increase system latency and bandwidth costs are not low.
Both methods do not improve the security level from the essence, if the final level of security required Asil-D, the two methods would require decomposition submodules have to reach Asil-B. The single point 90% coverage Asil-B, do not equally difficult to achieve redundancy. Another method, single calculation, another way to judge the result is not reasonable. To improve the way as Asil-D monitoring. This is possible only in specific application scenarios, we'll discuss later.
In summary, the general do Asil-B / D, preferably from the beginning of the design and use of logically redundant memory ECC.
The actual design, especially for the processor, in addition to the redundant design, as well as a mechanism to detect and correct errors. Arm call it RAS (Reliability, Availability, Serviceability). RAS is not a substitute redundant design to achieve Asilb-B / D, after all, it is too low coverage. But some scenes, such as ECC error, command error, this mechanism can be corrected without restarting the core of the error, or stop error is diffused before the correction (Data Poisoning), or record and the context in which the error. This is its advantage, there is no redundancy in the chip's design, there are also useful.
Let us combine ISO26262 document, take a look at Arm's IP for automotive applications is how to achieve a high level of functional safety.
The above is A76AE configuration diagram, which is car-oriented A76, it introduces Split-Lock design. Normally, when the 4-core SMP can be used in redundant mode, the core of all logic and memory are copied two redundancy. Both need to reboot to switch modes, you can not dynamically switch for automotive applications is sufficient. Add core internal comparator, about 5% of the area, the frequency will be a loss of about 5%.
The new architecture Armv8.2 A76AE is, as shown above, a group of processors, the DSU includes three cache and make interconnects. And different core portions, here is a conventional lock-step mode, only the copy logic, or a memory. Province, a large area of the cache overhead. DSU usually only a small part inside a logical and area utilization is still very low, so the final additional area is not large, about 15%.
There is also a support Arm processor A65AE car Asil-D levels can be used as a small core group on different processors, and by and large nuclear CMN600AE bus interconnect heterogeneous computing to provide high energy efficiency. A65AE Supports single dual-thread, by adding a register set, so that the two threads may be shared software on a physical core pipeline, and transparent to the software. In fact, this first demand from the network processor, the execution unit waits often high latency read transfer. In order to improve the utilization of the pipeline, A65AE an increase of around 5% hardware to send
Our other product: