Title: Algorithms for coping with silent and non-functional errors
Directors: Yves ROBERT & Changbo WANG
Discipline: Computer Science
Status: Completed Project
Starting date: 2015
Directors
Summary
This project deals with silent and non-functional errors, which are causing a major threat to data-intensive applications running on large-scale platforms or scientific clouds.
Silent errors have become a major problem for large-scale distributed systems. Big data processing means that larger memory volumes and higher computational requirements are always required, hence an increase in the probability of corrupted bits in memory or incorrect CPU results. Such silent errors are a major threat to the accuracy and trustability of big data applications. However, detecting silent errors is hard, and correcting them is even harder. This project aims at developing generic algorithms to achieve both detection and correction of silent errors, by coupling verification mechanisms and check pointing protocols.
Application-specific techniques will also be investigated to decrease detection/correction cost for dense and sparse numerical linear algebra.
Non-functional errors are an important source of problems for scientific applications deployed on distributed cloud computing platforms. These applications are expressed in terms of workflows, hence cloud workflow systems are being widely used as platform software (or middleware services) to facilitate the usage of cloud services. The quality of a cloud workflow application is determined by the collective behavior of all the cloud software services employed by the workflow application. Given that certain amount of uncertainty lies in every cloud service, the quality of a cloud workflow instance becomes a much more complex combinatorial problem. Non-functional errors, namely the violations of service quality constraints, can significantly deteriorate the usability of big data applications. Activity-point based checkpoint selection and time-point based checkpoint selection are the two major types of strategies in workflow temporal verification. Specifically, activity-point based checkpoint selection monitors the response-time of each activity and it is normally used for the monitoring of a large-size single sequential process. In contrast, time-point based checkpoint selection monitors the throughput and it is often used for the monitoring of a large batch of parallel processes. Both activity-point and time-point based checkpoint selection strategies need to be investigated so as to provide effective quality assurance for scientific cloud workflows.