Query Processing for Massively Parallel Systems - Paris Koutris
In this talk, we look into modern data analytics systems that use large-scale parallelism to analyze massive datasets. We propose a simple but powerful theoretical model for parallel query processing, the Massively Parallel Computation (MPC) model. The MPC model captures the two basic parameters of computation for parallel query processing: the number of rounds of computation, and the maximum amount of data that each processing node can receive.
Based on the MPC model, we will discuss how we can design efficient parallel algorithms for query processing with formal guarantees on their behavior. We will show how we can optimally compute multiway joins in a single round using the HyperCube algorithm. Then, we will discuss the effect of data skew and propose techniques that mitigate the presence of skew. Finally, we will present algorithms that compute queries using multiple rounds, where we currently understand the behavior only for limited scenarios.
By: Paris Koutris (U. Washington)
If we knew what it was we were doing, it would not be called research, would it? - Albert Einstein