Software Engineering for Reliable Embedded Systems

Module code: EG7530
Module co-ordinator: Dr Alistair McEwan

A rigorous approach to dealing with the software crisis facing reliable embedded systems.

In 1988 the unmanned Phobos 1 probe was launched by the Soviet Union to explore Mars. The probe operated successfully for a period of time until a problem occurred: an issue with the thrusters control caused erroneous behaviour in the solar power system. Power was depleted and the mission failed. Post-mortem analysis revealed that the error that had led to the fault was in the software: interactions between several software components had not been fully understood despite rigorous and intense testing and development. Although this example of a lack of reliability leading to a mission failure is nearly 30 years old, it is a clear example of the problems that software systems have faced and continue to face. It demonstrates that issues with understanding software and reliability are not new. As we are in an era where reliance on software is pervasive, these issues – which the engineering world has yet to solve – are more intractable and more consequential. The way we build, test, and certify systems that we trust our lives to is no longer fit for purpose. This is known as 'the software crisis' and it is one of the most urgent challenges that the software engineering world and embedded systems world faces.

This software crisis is not going to go away. For instance, the Google autonomous car is a wonderful achievement for modern engineering, and a tribute to the intelligence and functionality that can be put into software. However even an achievement such as this illustrates the problem of the software crisis as this car has been shown to have problems with the specifications of behaviours. The problem becomes an order of magnitude worse when we consider that the problem is not limited to just verifying the behaviour of the car and ensuring that we have specified appropriate behaviour.

  • How do we make assurances about its behaviours in the presence of other road users?
  • What happens when the real world produces circumstances that were not predicted?
  • Most importantly, how can we convince people that the software does exactly what we claim it does – and only what we claim it does?
  • What should these claims look like and how can we back up these claims with rigorous argument?
  • Are these problems limited to safety-critical systems, or can we develop engineering methodologies that help us address the software crisis in cost effective and tractable ways across multiple areas?

In this module you will study state-of-the-art modelling and verification techniques which are driving forward the way that developers and certification authorities are thinking about reliability, and what it means to claim that a system is reliable. You will study how the language of Communicating Sequential Processes (CSP) may be used to specify the functionality we expect and demand from reliable embedded systems, and how it can be used to argue whether or not a system meets these expectations. We will consider how we can classify different categories of errors - from bugs in code, all the way to misunderstood or incorrectly written specifications - and what it means for a component in a reliable embedded system to fail: how it may affect the behaviours and misbehaviours of the system and whether or not any level of safety in the presence of component failure can be justifiably claimed.

You will study this in the context of building specifications and implementations of systems using the modelling features of the CSP language, and industrial strength state-of-the-art tools for automatic verification. We will explore the complexities and challenges involved in producing plausible arguments about the behaviour of reliable embedded systems, how modern software engineering techniques mitigate the natural and theoretic limitations of testing based approaches, and how these methodologies provide real, practical, and scalable solutions to the software crisis.

The module is taught in three distinct phases:

  • In the first phase, you will revise and rekindle your understanding of elementary software engineering mathematics as part of a distance learning package.
  • In the second phase you will attend the residential week at the University of Leicester Embedded Systems Teaching Laboratory. This residential week forms the main part of the contact teaching for the module. During this week you will engage with a number of interactive seminar sessions discussing and studying the various topics, and a number of laboratory based sessions where you will experiment with the tools and techniques discussed. Both the seminar and the laboratory sessions are run by academics and experts from the Embedded Systems and Communications research group at the University.
  • In the third phase, after the residential week and as part of a distance learning package you will work on a significant realistic case study where you can explore all of the topics covered in the module in the context of a realistic industrial problem.

Topics covered

Introduction to CSP

  • What is CSP, and what does a piece of CSP intended to model a system look like?
  • What level of precision does this model give in terms of the behaviour of a system and why is it crucial to have this model when one could argue that source code is a definitive model?
  • What tool support is provided? 
  • How does this naturally integrate with commonly understood engineering paradigms such as the V model? 
  • How is this going to improve our notion of reliability and the claims we make about reliability?

Understanding concurrency

  • What does 'concurrency' mean, and why is it a key concept in modelling and understanding systems? 
  • How can we ensure that we truly and accurately understand behaviours in the presence of concurrency? 
  • How does concurrency affect the phenomenon of nondeterminism, and why is nondeterminism both our friend and our enemy in the context of reasoning about reliability?

Scalable engineering and analysis 

  • How can we build models of systems that scale to the levels of those that are causing the software crisis? 
  • What support does the language offer us in achieving this, and what reliance can we place on the theoretical underpinnings that are not always considered so rigorously at the source code level?
  • How can we ensure that automatic verification of properties of interest is possible in order to mitigate the common apprehensions of verification challenges?

Understanding safety and models 

  • What does 'safety' really mean – particularly in the context of reliable embedded systems?
  • What exactly is a safety specification, and can we give a safety specification a defined and precise meaning? 
  • What reassurances do safety specifications give us about the behaviours of a system? 
  • Do we need to be able to guarantee that an implementation of a given system is definitely safe, or is it more important to ensure we understand misbehaviours exactly in normal and failed operating modes? 
  • How does this contribute to the wider questions involved in verifying and certifying reliable embedded systems?