Unlike software, hardware may experience random failures in addition to systematic failures. Systematic hardware failures must be addressed by specifying the rules of ISO 26262 for the development process. Random hardware failures are addressed by calculating the probability of failure using statistical measures and historical usage data. This works quite well for discrete hardware components, but the more complex a piece of hardware is, the harder it becomes to predict hardware component failure. A viable alternative is to register the failure externally and transition the hardware to a safe state within a defined period of time. This alternative allows developers to use even complex hardware components such as multi-core processors if appropriate diagnostic methods can be applied.
SIL and ASIL
The primary task of a safety-related electronic system is to perform a safety function that must achieve or maintain a safe state of a monitored device to reduce the consequences of hazardous events. The ability to perform the safety function is described in IEC 61508 by Safety Integrity, which is a measure of the probability that a safety-related system will perform the specified safety functions under all specified conditions within a specified time period. Here, the highest Safety Integrity Level (SIL) is defined by the hardware. The software inherits its SIL and must follow the processes specified in the associated software safety standard. IEC 61508 recognizes five levels SIL 0 to SIL 4, while ISO 26262 follows suit with four levels ASIL A to ASIL D (ASIL = Automotive SIL). The higher the level, the more stringent the safety requirements in both standards. The lowest level, SIL 0, has no equivalent in ISO 26262; this concerns non-critical systems for which no measures are required that go beyond the manufacturer's normal quality management (QM).
The goal of a safety-related system is to reduce the risk for a given situation to a tolerable level in terms of its probability and its specific consequences. To arrive at a judgment of what constitutes a tolerable level of risk for a specific application, several factors such as regulatory requirements, guidelines, industry standards (e.g., IEC 61508, ISO 26262, etc.) must be considered. The conformity of a safety-related system with an assigned SIL level must in principle be mathematically proven. However, e.g., in the case of ASIL B/C, the occurrence of a dangerous situation every 1142 years is considered acceptable on average. Since electronic components do not allow such long-term proof, architectural concepts such as hardware fault tolerance, device diagnostics, inspection and proof tests must be applied to reduce the risk to the electronics.
Random and systematic Faults
Safety integrity for electronic systems distinguishes between hardware safety integrity and systematic safety integrity. Hardware safety integrity refers to random hardware failures, while systematic safety integrity is related to systematic failures that can occur in hardware and software designs. In this context, the term common cause failure (CCF) describes random and systematic events that cause multiple devices (in a multi-channel system) to fail simultaneously.
Faults are managed using different strategies depending on whether they are random or systematic. Random failures can be identified through internal device diagnostics, external diagnostics, inspection and proof tests. While random hardware failures are primarily caused by wear and tear, systematic failures are a direct consequence of system complexity. They are often introduced during the specification and design or implementation phases, but can be caused by errors during manufacturing or integration, as well as operation or maintenance errors. Systematic failures are considered predictable in a mathematical sense and can be managed by strict application of correct processes during the life cycle of the electronic component and the implemented software.
The risks of systematic failures can be minimized by redundancy and/or by a strict lifecycle process. Redundancy can be achieved through diversity, for example, by using different technologies or products from different manufacturers. In addition, different hardware architectures from different vendors can be used if necessary, In the case of software, the same algorithm can, for example, be implemented in different programming languages and/or run in different runtime environments. Higher reliability through diversity is thereby based on the assumption that different devices have different failure causes and failure modes.
Random Hardware Failures
Random hardware failures occur - nomen est omen - at a random time and result from physical degradation of the hardware. Physical degradation can be caused by manufacturing tolerances, abnormal process conditions (overvoltage and temperature), electrostatic discharge, equipment wear, etc. Such failures occur with predictable probability, but at unpredictable (i.e., random) times. Depending on the effect of a failure on the hardware, it is referred to as either a "soft error" or a "hard error." A soft error is temporary and has no permanent consequences, while a hard error damages the hardware. As hardware complexity increases, so does the likelihood of soft and hard errors due to environmental and operating conditions. This is fueled by the trend to shrink the geometry of hardware and increase the density of transistors on silicon. Memory components in particular, whether discrete or embedded in CPUs in the form of register banks or caches, are susceptible to crosstalk and to electromagnetic fields.
ASIL Dependencies
For complex systems that cannot be validated based on existing real-world data, the ASIL level depends on the Single Point Faults Metric (SPFM) and the Latent Fault Metric (LFM), according to ISO 26262. The metrics calculate the percentage of safe faults and dangerous faults detected relative to all faults. Both are hardware architecture metrics that indicate whether or not the coverage by the safety mechanisms is sufficient to avoid the risk of single point or latent faults in the hardware architecture.
- Single Point Faults Metric
Single Point Faults are faults in an element that are not covered by a safety mechanism and directly result in the violation of a safety objective. - Latent Fault Metric
Latent faults are multi-point faults whose presence is not covered by a safety mechanism or detected by the driver within the multi-point fault detection interval (MPFDI).
Hardware Fault Tolerance
Redundant architectures are frequently used to reduce the probability of failures. Here, the hardware fault tolerance (HFT) introduced in IEC 61508 describes the ability of a component or subsystem to perform the required safety function even in the event of hardware faults. A hardware fault tolerance of 1 means that there are, for example, two devices and the dangerous failure of one of them does not affect the safety function. A three-channel system where a single channel can continue the safety function in the event of a failure in any of the other 2 channels is considered to have a hardware fault tolerance of 2. The HFT can be easily calculated if the architecture is expressed as M out of N (MooN). In this case, the HFT is calculated as N-M. In other words, a 1oo3 architecture has an HFT of 2. This means that such a system can tolerate 2 failures and still function. By implementing an architecture with an HFT> 0, such a safety system can also use standard microprocessors such as Intel Core-i or ARM.
Further Security Applications
In addition, further security applications can be implemented in the partitions. These include, for example, a firewall that monitors network traffic at the transport level and can filter for IP addresses and ports if necessary. Furthermore, a partition can also take over the termination of a VPN tunnel, so that no additional hardware needs to be set up for such functionality.
An essential security application serves secure software updates. The SK used offers the possibility of replacing partitions with others at runtime. In the HASELNUSS architecture, the secure update process is based on the TPM. Updates are encrypted and integrity-protected by the MDM so that only the TPM with the keys stored in it can check the integrity and decrypt the update. The update is installed in a new partition and the partition loader terminates the old partition and activates the new partition, so that no reboot is necessary for updates.
The Certification of Software
With a CPU or standard board in a safety system, software will definitely be an essential part of the safety path. In early device initialization, software is used as a boot loader. Later, an operating system (OS) may be used to facilitate the use of the hardware. On the OS, a software application can perform the (safety) function including diagnostic tests that continuously monitor the system status.
Software does not recognize random errors, only systematic ones, which are usually caused by design errors. ISO 26262 describes design methods to minimize the risks of these errors and to represent this by a value expressed by the respective Safety Integrity Level. The methods briefly include the following phases:
- Requirements specification
- Software design and development process
- Verification and validation process
Detailed documentation and evidence must be prepared to demonstrate that an appropriate level of the specified rules has been applied. In this context, the validation test is highly dependent on the scope of the source code (Source Lines of Code - SLOC) that needs to be tested. A lean source code significantly reduces the certification effort. Of course, this also applies to all software components such as the bootloader, the operating system, the application code and the built-in diagnostic functions. Using the right technology and validation approach will therefore have a significant impact on the overall cost.
The operating system (OS) enables the application programmer to make the best use of the underlying hardware resources. The OS kernel provides services for applications (threads, tasks, processes) management and communication. The underlying hardware is usually described with a Board Support Package (BSP) consisting of initialization code and device drivers. If there is a failure in the operating system, the BSP or application code can affect the safety system, since their development must follow the rules prescribed by the standard. Specifically, applications can have a very large number of lines of code, which greatly increases the cost of software certification. Therefore, safety standards suggest separating safety-related and non-safety-related software so that only the safety-related parts need to be certified.
Separation of Applications through Separation Kernels
The use of independent hardware components for safety-relevant and other applications ensures secure separation, but also leads to increased hardware costs. In contrast, a separation kernel-based operating system, such as SYSGO's PikeOS, enables the separation of application code on the same hardware platform by sharing the physical and temporal resources of the hardware. The separation of physical resources is known as spatial separation or resource partitioning, while the separation of available execution time is known as temporal separation or time partitioning. The separation principle can be compared to a hypervisor, but the main difference is that a separation kernel provides the following capabilities:
- Unbypassable: A component cannot bypass the communication path, even with lower-level mechanisms
- Manipulation-proof: Prevention of unauthorized changes by monitoring the modification rights for the safety monitor code, configuration and data
- Always active: Every access and every message are checked by the corresponding safety monitors
- Evaluable: trusted components can be evaluated for safety based on whether they are modular, well-designed, well-specified, well-implemented, small, low-complexity, etc.
In the nomenclature of a separation kernel, the isolated application areas are called partitions. Separating applications into partitions ensures that applications cannot interfere with each other, so that each application can operate on its assigned ASIL. This allows a hardware platform to handle applications of mixed criticality levels. A communication stack (e.g., TCP / IP, web server, OPC-US, etc.) can be hosted in a QM partition, while a safety-related AUTOSAR application runs in an ASIL D partition. Each partition content must be certified for its respective ASIL level in such a case.
Conclusion
By separating applications of different criticality, the cost of certification can also be significantly reduced in automotive applications, as it is no longer necessary to certify the entire system according to the highest ASIL level required. In addition, certified software components can be offered as COTS components and used in different projects without recertification. However, this approach does not work for hardware, as this is usually built from different components that cannot be separated from each other. Nevertheless, here too the use of a common hardware platform leads to significant savings in procurement, development and operation.