Functional Safety: No hiding place

Levels of driving automation and resultant increases in complexity

Kerry Johnson and Chris Hobbs from Blackberry QNX discuss replication techniques for finding errors in safety-critical automotive systems

The growing popularity of adas and automated driving is fuelling the demand for powerful CPUs in the automotive industry. To meet this demand, semiconductor manufacturers are pushing technology to the point where, for the first time in history, hardware is becoming less reliable. This problem arises from two major factors – physics and complexity.

On the physics front, CPUs run at faster clock speeds, producing more heat, and use smaller transistors, which can now be measured in numbers of atoms. Heat causes accelerated wear-out; the hotter the part operates, the sooner it fails. Die shrink leads to transistors that are extremely susceptible to faults caused by electromagnetic interference, secondary particles such as alpha particles and neutrons, and cross-talk between neighbouring cells. These problems also occur in dram systems – in modern multi-gigabyte drams, bit errors can be expected on the order of one per hour.

On the complexity front, manufacturers have been adding more and more inter-related functionality to each CPU. Unfortunately, CPUs ship with bugs, many of which are found only after the chip goes into production; known bugs are documented in the manufacturer’s errata sheets. These bugs can affect computations and give erroneous results, thereby causing safety vulnerabilities. The probability of such errors directly impacts the ISO 26262 Asil rating.

 

Verification

To detect and recover from these errors, system designers must implement compensation mechanisms. In one approach, the system performs each computation two or more times and then compares the results. Some microcontrollers implement a technique known as hardware lockstep, in which two CPUs execute the same instructions at the same time, with dedicated hardware comparing the results. If the hardware detects a mismatch, an independent diagnostic routine determines which CPU was faulty, and system software then takes remedial action. Unfortunately, this technique generally only supports replicas rather than diverse implementations, can’t detect software bugs – both CPUs will “correctly” execute the buggy code – and it does not scale as the number of replicas is fixed by the hardware. Also, it isn’t practical for today’s high-performance hardware, where there is far too much internal state for a hardware checker to analyse.

In practice, a system can use software to verify the operation of the hardware. The developer implements a replicated copy of the software, and the two or more replicas are used to perform the verification. Each replica performs safety-critical computations – for instance, “given these conditions, can acceleration be applied?” – and some middleware runs the computations with synchronisation points invisible to the application.

Each replication scheme has its advantages and disadvantages. In the identical replica model, two identical computations running on different threads using different memory will yield the same correct result, except when a transient hardware or random software error occurs. In that case, the error should affect only one of the instances, not both. The middleware can then determine which version is correct.

Of course, this approach cannot correct for bugs in the software itself. To do so, a system could use fraternal replicas, which perform the same computations, but using different algorithms. If these replicas come to the same conclusion – for instance, both agree that acceleration can be applied – there is greater overall confidence that the result is correct.

 

Implementing replication

A system designer can implement the replication schemes transparently, using middleware that is interposed in the middle of the communications path between two components. In a microkernel OS, where all software components communicate with each other through message passing, the designer can take advantage of naturally occurring synchronisation points that make it easy to interpose the middleware and have it check subsystem operation. In a typical microkernel-based system, a server process provides services to its clients.

To ensure that the service is reliable and available, a replica-based system uses multiple instances of the server. The role of the middleware is to ensure that system events such as requests from clients are delivered to all server instances in exactly the same order. From the client’s perspective, there appears to be only one server; from the server's perspective, it believes it is executing alone. The middleware duplicates the messages from the client and distributes them to each server instance. The middleware then receives the responses from each server and compares the results to ensure the servers agree. The application developer needs to pay no attention to replication or diversity.

Replication points, that is places to insert the middleware and support multiple replicas, must occur at the right level of granularity. Duplicating every mathematical computation or function call is too fine-grained, and results in higher development costs and slower runtime performance. The ideal replication granularity is implemented at the component level. In fact, the implementation of Posix API functions in a microkernel OS serves as a good model for this approach.

Consider an application that opens a file and reads some data. Decoupling the application from the file system provides an excellent insertion point for the middleware.

If, instead of talking to the file system, the application talks with the middleware and the middleware talks to replicated file system servers, redundancy and checking can occur at that natural decoupling point, in a manner completely transparent to the application itself, and to the file system implementation. By designing other components within the system at a similar level of granularity, the system designer can insert replicas at the process-to-process communication boundaries. The flexibility offered by the OS is important here: if the OS allows replicas or diverse applications to be seamlessly partitioned across processor cores, the implementation of a replication strategy becomes easier.

 

Server models

As mentioned, the system designer can choose between two main server models – identical and fraternal, with fraternal being further categorised as peer fraternal and monitor fraternal.

In the identical model, the servers can run on the same CPU, on different cores in a multi-core system or even on different systems connected by a network. This model offers a measure of scalability and potential redundancy in case of hardware failure: the same software must produce the same results; if it doesn’t, then it’s a hardware issue.

In the peer fraternal model, the system uses diverse but fully functional versions of the servers, for example the same source code compiled by different compilers. The expectation is that any failures would be in the implementation domain, effectively bugs in the implementation.

The monitor fraternal model also uses diverse servers, but one server has full functionality and the other has reduced monitor functionality.

While all three models are useful and can be intermixed, the monitor fraternal model offers an interesting cost savings. Consider the certification costs of each model:

 

  • • The identical model has no diversification, so the full development and certification process and cost for the Asil rating must be borne by the software.
  • • The peer fraternal model uses diversification, so the overall combination of the multiple diverse instances contributes to both reliability and availability, but the software cost is double or more, depending on the number of diversities.
  • • The monitor fraternal model has successfully been used in other industries. The concept of a monitor is also known as a safety bag, a much simpler piece of software that ensures that the overall decisions being made are sane and safe. From a certification cost point of view, you certify the main server software, and the much simpler monitor, with less certification effort. By combining the server with a monitor, you also increase the Asil level of the software. The monitor effectively provides Asil enrichment because it is another diverse instance.

 

The architecture described provides a very flexible and dynamic way of detecting random errors that have affected the safety of a system. The underlying principle is that of the strong ordering of events to all members of server group. This provides software virtual locked-step that does not suffer from the limitations of hardware locked-step. As servers may join and leave groups dynamically, the level of resilience can be tuned to the environment within which the system is operating.

Chris Hobbs is a software safety specialist and Kerry Johnson is senior product manager at Blackberry QNX

www.qnx.com

 

Recommend to Colleagues: 

Add new comment

Plain text

  • Allowed HTML tags: <a> <em> <p> <h1> <h2> <h3> <code> <ul> <ol> <li> <dl> <dt> <dd> <strong>
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Follow Us:

Twitter icon
Facebook icon
RSS icon