Review of the green book from Bruno Sericola

This book should be on the desk of any university teacher, any engineer and anybody studying the dependability of complex computing systems or communication networks. It is about 700 pages and is quite complete since it covers all the aspects of the dependability evaluation of such systems. It begins with the basic concepts and definitions of dependability measures. Then the numerous and various ways that can be used for modeling these systems are described. All these models are analysed and the techniques to solve them are presented and discussed in detail. There are not only numerous examples and problems but a part of the book is dedicated to case studies providing the modeling of several real-life systems. Moreover all the chapters contain a quite interesting section for further reading.

It is, to the best of my knowledge, the only book dealing so thoroughly and in detail with this wide subject.

The book is composed of 6 parts. Part I is the introduction and consists of 3 chapters. The first chapter gives the rigorous definition of dependability, which was first proposed by Jean-Claude Laprie in 1984, and dependability measures. It also contains several examples of system dependability evaluation. Chapter 2 reviews all the aspects of dependability evaluation from the measurement-based evaluation to the model-based evaluation which is the heart of the book. It describes in great detail the modeling process of a general system, the different types of modeling formalisms and the model verification and validation. Chapter 3 examines the dependability metrics defined on a single unit which can be seen as a simple on-off system. Some of these metrics are the distribution of the time to failure, the availability, the reliability, the mean time to failure and the failure rate. They are obtained for general distributions of the times to failure and repair and also for particular examples of such distributions.

Part II deals with the non-state-space models, also called the combinatorial models, and is composed of 5 chapters. In this part, the authors consider model of systems that can be decomposed in elementary components which behave in a statistically independent manner and which can be represented by a Bernoulli random variable, meaning that they only have two possible states : up or down. The case of multi-state component is addressed in Chapters 5 and 6.

Chapter 4 discusses the reliability block diagrams in the case of series-parallel systems and k-out-of-n systems. Besides non-identical k-out-of-n, a special case of two groups of components in a k-out-of-n is developed. The general case of non-series-parallel systems is addressed in Chapter 5 which considers the network reliability through binary probabilistic networks, binary probabilistic weighted networks and also through multi-state networks, extending the case where each component has only two possible states. A real case study of a major subsystem of Boeing 787 is analyzed using a new bounding algorithm. Fault tree analysis forms Chapter 6. Both the qualitative and the quantitative points of view are taken into account. They are not only considered in the case without repeated event but also in the case where events can be repeated. There are three importance measures proposed : the Birnbaum importance measure, the criticality importance measure and the Vesely–Fussel importance measure. The application of fault tree analysis in real life is illustrated by means of three case studies which are detailed clearly and explicitly. As in Chapter 5, the case where the components may have more than two states is also considered. Finally, the mapping of fault trees into Bayesian networks is explained and illustrated using various examples. The state enumeration method, proposed in Chapter 7, consists in the enumeration of all the possible states of the system. Using the system structure function, the contribution of each state to the considered dependability measure is evaluated in order to obtain the dependability measure for the entire system. This method is a static method used under the assumption of statistical independence of the component failures and repairs. Chapter 8 is entitled Dynamic Redundancy. The type of redundancy studied in the previous chapters is a static redundancy. Here the fault management system is supposed to act dynamically to deal with component failure. Three types of standby redundancy are analyzed : cold standby, warm stand-by and hot standby. This chapter is in fact an introduction to Part III of the book in which the Markov chain modeling approach is considered at length. Indeed the dynamic redundancy is modelled as sums and mixtures of independent random variables.

Part III deals with the state-space models with exponential distributions and contains 5 chapters all based on dynamic models represented by continuous-time Markov chains.

Chapter 9 recalls the basic results on continuous-time Markov chains and how they apply to the evaluation of several availability measures. It contains interesting examples and case studies as well. Both the sensitivity analysis with respect to a given parameter and numerical methods for steady-state analysis are presented. While Chapter 9 focuses on availability models, Chapter 10 considers reliability models which need a transient analysis of the corresponding Markov chain. Other dependability measures are also taken into account as well as Markov chains with absorbing states and several transient solution methods are proposed. Chapter 11 studies classical queueing systems which can be represented by a continuous-time Markov chains. These queues are all particular cases of the birth-death process except the last one for which the server is allowed to fail and to be repaired. Petri nets are another high-level formalism which is based on Markov chains under the classical independence and exponential distribution assumptions. They are presented in detail Chapter 12 with the help of a wide variety of examples and with references to packages developed in the team of Prof. Kishor S. Trivedi.

Part IV deals with state-space models with non-exponential distributions and contains 3 chapters in which the exponential assumption is relaxed.

In Chapter 13, the non-homogeneous continuous-time Markov chains are presented. For these processes, the Markov property is still valid but the transition rates are time dependent, so that the transient behavior of such processes is no more expressed as the exponential of the infinitesimal generator. Several illustrative examples are proposed together with different numerical solution methods. Chapter 14 is dedicated to the analysis of semi-Markov and Markov regenerative processes. These processes generalize Markov chains by allowing non exponentially distributed sojourn times in the states. The transient and stationary solutions of these processes are analyzed and here again several illustrative examples are proposed and solved. A very important class of distributions, called the phase-type distributions, is detailed in Chapter 15. A phase-type distribution is the distribution of the time to absorption in an absorbing continuous-time Markov chain. Such distributions can approximate as closely as desired any continuous distribution function. This important property can be used to replace a non-exponential distribution by its phase-type approximation, thus conserving the Markov property by increasing the state space of the initial process.

Part V contains multi-level models which consist in combining several model types, like those detailed in the previous chapters, in order to capture the entire complexity of real-life systems.

The possibility of combining several model types in a hierarchical manner is explored in Chapter 16. The development of hierarchical models is based on the fact that the graph of the inter-model dependencies is acyclic. This graph is called the import graph. When the import graph contains cycles then the solution of the model requires successive substitution of the submodel solutions in order to solve the underlying fixed-point equations. This is the subject of Chapter 17.

Finally Part VI proposes the modeling of real-life systems in terms of availability models, reliability models and combined performance and reliability (performability) models.