Probabilistic Fault Diagnosis in Large, Heterogeneous Computing Systems

Authors

  • Tamás Bartha
  • Endre Selényi

Abstract

Probabilistic diagnosis aims at making the system-level fault diagnostic problem both easier to solve and the resulting algorithms more generally applicable. The price to pay for these advantages is that the diagnostic result is no longer guaranteed to be correct and complete in every fault situation. This paper presents a novel approach, called local information diagnosis (LID), and applies this methodology to create a family of probabilistic diagnostic algorithms. The developed algorithms can be divided into three main classes: limited inference, limited information, and scalar algorithms. All of the LID algorithms are composed of three main phases: an inference extraction phase, an inference propagation phase, and a fault classification phase. The paper introduces four algorithms based on the LID concept. These algorithms differ mainly in the inference propagation and fault classification phases, representing a trade-off between performance and diagnostic accuracy. The quality of the heuristic rules employed in the fault classification phase significantly affects the accuracy of diagnosis. Three heuristic methods of fault classification are defined, and the diagnostic performance provided by these heuristics are compared using measurement results.

Keywords:

multiprocessor systems, system-level fault diagnosis, probabilistic diagnostic algorithms, generalized test invalidation, fault classification heuristics

How to Cite

Bartha, T., Selényi, E. “Probabilistic Fault Diagnosis in Large, Heterogeneous Computing Systems”, Periodica Polytechnica Electrical Engineering, 43(2), pp. 127–149, 1999.

Issue

Section

Articles