A New Framework for Large-Scale Multiple Testing: Compound Decision Theory and Data-Driven Procedures

Wenguang Sun

This dissertation studies the large-scale multiple testing problem from a compound decision theoretical view, and proposes a new class of powerful data-driven procedures that substantially outperform the traditional p-value based approaches. There are several important implications from my dissertation research: first, the individual p-value fails to serve as the fundamental building block for large-scale multiple testing second, the validity of an FDR procedure should not be overemphasized at the expense of the important efficiency issue and third, the traditional do nothing approach suggested for dependent multiple testing is inefficient, and the structural information among the hypotheses can be exploited to construct more powerful tests. Chapter 1 reviews important concepts and conventional framework for multiple testing and discusses several widely used testing procedures. The compound decision theory is formally introduced in Chapter 2. A major goal of Chapter 3 is to show that the p-value testing framework is generally inefficient in large-scale multiple testing and the precision of the tests can be greatly increased by pooling information from different samples. We develop a compound decision framework for multiple-testing and derive a z-value based oracle procedure that minimizes the false non-discovery rate subject to a constraint on the FDR. We then propose an adaptive procedure that asymptotically attains the performance of the oracle procedure. Chapter 4 considers the simultaneous testing of grouped hypotheses. Conventional strategies include pooled analysis and separate analysis. We derive an asymptotically optimal approach and show that both pooled and separate analyses can be uniformly improved. Our new approach provides important insights on how to optimally combine testing results obtained from multiple sources. Chapter 5 considers multiple testing under dependency. We show that the conventional do nothing approach can suffer from substantial efficiency loss in situations where the correlation structure is highly informative. We propose a data-driven procedure that is asymptotically valid and enjoys certain optimality properties. The new procedure is especially accurate in identifying structured weak signals, where traditional procedures tend to suffer from extremely low power.