Statistics Seminar - Spring 2025 | Department of Mathematical Sciences

For questions about the seminar schedule, please contact Chong Jin and/or Chenlu Shi.

March 27

Dr. Zhaonan Qu, Columbia University

Distributionally Robust Instrumental Variables Estimation

Instrumental variables (IV) estimation is a fundamental method in econometrics and statistics for estimating causal effects in the presence of unobserved confounding. However, challenges such as untestable model assumptions and poor finite sample properties have undermined its reliability in practice. Viewing common issues in IV estimation as distributional uncertainties, we propose DRIVE, a distributionally robust IV estimation method. We show that DRIVE minimizes a square root variant of ridge regularized two stage least squares (TSLS) objective when the ambiguity set is based on a Wasserstein distance. In addition, we develop a novel asymptotic theory for this estimator, showing that it achieves consistency without requiring the regularization parameter to vanish. This novel property ensures that the estimator is robust to distributional uncertainties that persist in large samples. We further derive the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter based on theoretical results. Simulation studies demonstrate the superior finite sample performance of Wasserstein DRIVE in terms of estimation error and out-of-sample prediction. Due to its regularization and robustness properties, Wasserstein DRIVE presents an appealing option when the practitioner is uncertain about model assumptions or distributional shifts in data.

Homepage: https://zhaonanq.github.io/

April 3

Dr. Feng Ruan, Northwestern University

Layered Models can "Automatically" Discover Low-Dimensional Feature Subspaces—No Regularization Required

Layered models, such as neural networks, appear to extract meaningful features through empirical risk minimization, yet the principles behind this process remain unclear. We analyze a two-layer nonparametric regression model akin to neural networks and prove that it naturally induces dimensionality reduction by identifying feature subspaces relevant for prediction—without conventional regularizations such as nuclear norm penalties, early stopping or other algorithmic interventions. Our results explain this implicit regularization through the lens of set identifiability from the variational analysis literature, showing how "sharpness" in the optimization landscape of population objective naturally enforces low-complexity solutions in finite samples.

Homepage: https://fengruan.github.io/

April 17

Dr. Yawen Guan, Colorado State University

Emulation and Calibration of an Arctic Sea Ice Model with Spatial Outputs

Arctic sea ice plays a critical role in the global climate system. Physical models of sea ice simulate key characteristics such as thickness, concentration, and motion, offering valuable insights into its behavior and future projections. However, these models often exhibit large parametric uncertainties due to poorly constrained input parameters. Statistical calibration provides a formal framework for estimating these parameters using observational data while also quantifying the uncertainty in model projections.

Calibrating sea ice models poses unique challenges, as both model output and observational data are high-dimensional multivariate spatial fields. In this talk, I present a hierarchical latent variable model that leverages principal component analysis to capture spatial dependence and radial basis functions to model discrepancies between simulations and observations. This method is demonstrated through the calibration of MPAS-Seaice, the sea ice component of the E3SM, using satellite observations of Arctic sea ice.

Homepage: https://yawenguan.github.io/

April 24

Dr. Richard Guo, University of Michigan

Hunt, test and aggregate: a flexible framework for testing complex hypotheses

Hypotheses arising from real-world problems can be rather complex: (1) the null hypothesis can be large so it is difficult to test directly; (2) the alternative can be large and heterogenous, so it is challenging to devise a test that has power against a variety of alternatives. However, if such a hypothesis can be represented as a conjunction of simpler hypotheses corresponding to different directions, "hunt and test" provides a useful strategy: split the data into A and B; use A to hunt for a direction where the null seems to not hold; and use B to test the simpler hypothesis associated with that direction. Nevertheless, applying "hunt and test" alone can suffer from non-replicability and low power. To remedy these problems, we demonstrate that by properly aggregating and calibrating a large number of hunt-and-tests, we can obtain an ensemble procedure that is reliable, replicable and highly powerful. We demonstrate how to use this framework to construct tests for a variety of problems, including detecting subtypes of cancer and testing goodness-of-fit of regression models.

Homepage: https://unbiased.co.in/

May 1

Dr. Kan Chen, Harvard University

The Blessings of Multiple Mediators: Removing Unmeasured Confounding Bias via Factor Analysis

Multiple mediation analysis aims to evaluate the indirect effect of an exposure on outcomes through mediators, as well as the direct effect through other pathways. Traditional methods for estimating mediation effects require the strong assumption of no unmeasured confounding between the outcome and mediators. However, when the exposure and mediators are not randomized, unmeasured confounding among the exposure, mediators, and outcome may lead to biased estimates and false discoveries. In this work, we introduce a novel framework called FAMA (Factor Analysis-based Mediation Analysis) to address unmeasured confounding in multiple mediation analysis within a linear model setting. FAMA combines an omitted-variable bias approach with factor analysis to estimate natural indirect effects in the presence of unmeasured confounders. We validate the framework through theoretical analysis and simulation studies, demonstrating its effectiveness and robustness. Additionally, we apply FAMA to data from the U.S. Department of Veterans Affairs Normative Aging Study to detect DNA methylation CpG sites that mediate the effect of smoking on lung function. Our analysis identified multiple DNA methylation CpG sites that may mediate the effect of smoking on lung function and robust to unmeasured confounding bias. Notably, we observed effect sizes ranging from -0.18 to -0.79, with a false discovery rate controlled at 0.05. This includes CpG sites in the genes AHRR and F2RL3, even in the presence of unmeasured confounding.

Homepage: https://sites.google.com/sas.upenn.edu/kanchen

Last Updated: April 24, 2025