• Re-Architecturing Search Engine Indexing Systems for Flash SSDs - Jianguo Wang


  • Abstract:

    Conventional search engines are optimized extensively for hard disk drives (HDDs), which dominate the storage market for decades.  HDDs are known for slow random reads, which are one to two orders of magnitude (50 ~ 100X) more expensive than sequential reads. Thus, the existing search engine design principle is to minimize expensive random reads by introducing increased sequential reads. As an example of list intersection — a fundamental operation in search engines. In order to minimize expensive random reads, all the existing (HDD-centric) intersection algorithms load the lists to memory entirely.

    However, the emergence of fast solid state drives (SSDs) has changed this landscape completely by dramatically improving random I/O performance. Today, random reads on SSDs are fast enough to be comparable with sequential reads (1 ~ 2X). This means that HDD-centric optimizations, with their many sequential reads, are suboptimal on SSDs because they make the total number of reads unnecessarily high. Besides that, the existing HDD-centric design also ignores many SSD-specific characteristics that HDDs do not exhibit such as high internal parallelism and out-of-place updates (for writes). That being said, directly running existing HDD-optimized search engines on SSDs will miss significant opportunities to perform dramatically better.

    This paper surveys existing design decisions for HDD-centric search engines and how SSDs make things different. We focus on the indexing system of a search engine, which is known to be performance bottleneck. The indexing system is mainly composed of three components: query processor, cache manager and storage manager. For each component, we first review existing work on HDDs and SSDs (if any), and then present our ideas.

    By: Jianguo Wang