• Query Processing of Sorted Lists on Modern Hardware - Jianguo Wang


  • Abstract: A sorted list is a collection of <ID, value> pairs sorted either by ID (called an ID-sorted list) or value (called a value-sorted list). Although simple, it is surprisingly the fundamental structure in a wide range of data management systems, e.g., databases, search engines, graph systems, and high-dimensional similarity search systems. Existing query processing over sorted lists were mainly optimized for past hardware, e.g., slow CPUs, small DRAM, and slow disks. However, with the emergence of new hardware, e.g., fast multi-core CPUs, big DRAM, fast SSDs, we find that many of existing design decisions become obsolete.
     
    Motivated by this, in this dissertation, we investigate how to best leverage the characteristics of emerging hardware to accelerate query processing over sorted lists. First, we studied the impact of big memory and modern CPUs to ID-sorted list compression. We designed a new compression called MILC, which is SIMD-aware, cache-aware, and most importantly can answer queries directly over compressed lists. Second, we studied the impact of fast SSDs to ID-sorted list intersection. We developed and evaluated SSD-aware intersection algorithms that can skip unnecessary data pages to leverage SSDs’ fast random accesses. Finally, we folded the proposed hardware-specific techniques into a real-world modern application — mass spectrometry search — that involves query processing on both ID-sorted lists and value-sorted lists. We also developed domain-specific and data-specific optimizations besides hardware-specific optimizations. Putting them together, our new mass spectrometry search system is significantly faster than state-of-the-art systems.