Hardening cloud and datacenter systems against configuration errors
Configuration errors are among the dominant causes of system-wide, catastrophic failures in today’s cloud and datacenter systems. Despite the wide adoption of fault-tolerance and recovery techniques, these large-scale software systems still cannot effectively deal with configuration errors. To make the matter worse, even fault tolerance and recovery are often misconfigured and thus crippled in reality.
In this talk, I will describe my research efforts towards better understanding and hardening cloud and datacenter systems against configuration errors. First, I will briefly present work that looks into the fundamental causes of misconfigurations, in particular, how are systems configured in the field and how are misconfigurations introduced by studying and characterizing real-world configuration practices. I will then describe work that enables system designers/developers to prepare cloud and datacenter systems to anticipate and defend against configuration errors, including: (1) exposing misconfiguration vulnerabilities (bad system reactions such as crashes, hangs, and silent failures), and (2) checking configurations proactively to prevent failure damage.
By Tianyin Xu, CSE PhD student advised by YY
If we knew what it was we were doing, it would not be called research, would it? - Albert Einstein