Kexin Rong - Learned Indexing and Sampling for Improving Query Performance in Big-Data Analytics
Traditional data analytics systems improve query efficiency via fine-grained, row-level indexing and sampling techniques. However, to keep up with the data volumes, increasingly many systems store and process datasets in large partitions containing hundreds of thousands of rows. Therefore, these analytics systems must adapt traditional techniques to work with coarse-grained data partitions as a basic unit to process queries efficiently. In this talk, I will discuss two related ideas that combine learning techniques with partitioning designs to improve the query efficiency in the analytics systems. First, I will describe PS3, the first approximate query processing system that supports non-uniform, partition-level samples. PS3 reduces the number of partitions accessed by 3 to 70x to achieve the same error compared to a uniform sample of the partitions. Next, I will present OLO, an online learning framework that dynamically adapts data organization according to changes in query workload to minimize overall data access and movement. We show that dynamic reorganization outperforms a single, optimized partitioning scheme by up to 30% in end-to-end runtime. I will conclude by discussing additional open problems in this area.
Podden och tillhörande omslagsbild på den här sidan tillhör Dan Fu, Karan Goel, Fiodar Kazhamakia, Piero Molino, Matei Zaharia, Chris Ré. Innehållet i podden är skapat av Dan Fu, Karan Goel, Fiodar Kazhamakia, Piero Molino, Matei Zaharia, Chris Ré och inte av, eller tillsammans med, Poddtoppen.