• Preventing multiplicity problems in data exploration and auto-ML systems - Nikos Koulouris


  • Abstract:

    More data means more opportunity for a researcher to test more hypotheses until she discovers an interesting finding. This increases the chance of arriving to a false conclusion purely by chance and is called the multiplicity problem. Data exploration systems facilitate exploring big data by testing automatically thousands hypothesis in order to find the most interesting. Auto-ML systems try to automate the analyst’s job of selecting the best ML model based on the performance on a holdout data set. In both cases, automatically testing for more things means a higher chance of making a statement purely by chance.

    First, I present VigilaDE, a data exploration system that utilizes the hierarchical structure of the data in order to control false discoveries. VigilaDE guides the exploration towards interesting discoveries while controlling false discoveries and increasing statistical power. Through extensive experiments with real-world data and simulations I show that my data exploration algorithms can find up to 3.4 times more true discoveries in the data against the baseline.

    Next, I examine the problem of overfitting to a holdout data set in ML which is a result of the multiplicity problem. I present the limitations of existing approaches for avoiding overfitting and I introduce my idea for an algorithm that avoids overfitting in the holdout data set in the auto-ML setting. Finally, I discuss next steps to validate me initial idea and future directions.