Statistical uncertainty quantification for machine learning models and training acceleration for graph neural networks

Xu, Tianning

Statistical uncertainty quantification for machine learning models and training acceleration for graph neural networks

Xu, Tianning

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/124538

Description

Title

Statistical uncertainty quantification for machine learning models and training acceleration for graph neural networks

Author(s)

Xu, Tianning

Issue Date

2024-04-23

Director of Research (if dissertation) or Advisor (if thesis)

Zhu, Ruoqing
Shao, Xiaofeng

Doctoral Committee Chair(s)

Zhu, Ruoqing

Committee Member(s)

Yang, Yun
Zhao, Sihai Dave

Department of Study

Statistics

Discipline

Statistics

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

U-statistics
random forest
variance estimation
conformal prediction

Abstract

Uncertainty quantification in statistical models is crucial across various applications that require credible conclusions, particularly in domains such as estimating treatment effects. A significant emphasis in current research lies in the construction of confidence sets and prediction sets to complement point estimations effectively, forming the core focus of the first and second chapters of this thesis, respectively. In addition, with the exponential growth in data sizes, the acceleration of deep neural network training has emerged as a pivotal technique, which is discussed in the third chapter within the context of graph neural networks. In the first chapter, we dive into the variance estimation for subbaging ensemble learning models, such as random forest by infinite-order U-statistics (IOUS). While normality results of IOUS have been studied extensively, its variance estimation and theoretical properties remain mostly unexplored. Existing approaches mainly utilize the leading term dominance property in the Hoeffding decomposition. However, such a view usually leads to biased estimation when the kernel size is large relative to the sample size. On the other hand, while several unbiased estimators exist in the literature, their relationships and theoretical properties (e.g., ratio consistency) have never been studied. These limitations lead to unguaranteed asymptotic coverage of constructed confidence intervals. To bridge these gaps in the literature, we propose a new view of the Hoeffding decomposition for variance estimation that leads to an unbiased estimator. Instead of leading term dominance, our view utilizes the dominance of the peak region. Moreover, we establish the connection and equivalence of our estimator with several existing unbiased variance estimators. Theoretically, we are the first to establish the ratio consistency of such a variance estimator, which justifies the coverage rate of confidence intervals constructed from random forests. Numerically, we further propose a local smoothing procedure to improve the estimator's finite sample performance. Extensive simulation studies show that our estimators enjoy lower bias and achieve targeted coverage rates. In the second chapter, we explore the domain of conformal prediction (CP) to quantify uncertainty in black-box models, which offers guaranteed marginal coverage without distributional assumptions. Localized conformal prediction (LCP) improves the conditional coverage of CP's prediction sets by prioritizing local samples based on similarity. However, existing LCP uses the Gaussian kernel, which are non-adaptive kernel weights, resulting in inadequate conditional coverage rates for prediction sets. In this paper, we introduce the integration of random forest kernels into LCP. Additionally, we propose a novel approach termed distributionally sensitive random forest (DS-forest) to learn the kernel weights, which incorporates a normalized Kolmogorov-Smirnov statistic into the split rule. Furthermore, we demonstrate the applicability of our method in a two-stage random forest to construct prediction intervals in regression tasks and propose an inexact algorithm that eliminates the need for calibration sets for forest models. Numerical experiments show the superiority of LCP with DS-forest over existing LCP techniques and quantile forests, guaranteeing marginal coverage while exhibiting notable improvements in conditional coverage. In the third chapter, we study the approximating and accelerating node embedding aggregation in graph convolutional networks (GCNs) training. Among sampling techniques, a layer-wise approach recursively performs importance sampling to select neighbors jointly for existing nodes in each layer. We revisit the approach from a matrix approximation perspective and identify two issues in the existing layer-wise sampling methods: suboptimal sampling probabilities and estimation biases induced by sampling without replacement. To address these issues, we accordingly propose two remedies: a new principle for constructing sampling probabilities and an efficient debiasing algorithm. Improvements are demonstrated by an extensive analysis of the estimation variance and experiments on common benchmarks.

Graduation Semester

2024-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/124538

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Statistical uncertainty quantification for machine learning models and training acceleration for graph neural networks

Xu, Tianning

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In