Withdraw
Loading…
Dimension reduction methods for quantifying local variable importance and the statistical analysis of network data
Loyal, Joshua Daniel
Loading…
Permalink
https://hdl.handle.net/2142/116044
Description
- Title
- Dimension reduction methods for quantifying local variable importance and the statistical analysis of network data
- Author(s)
- Loyal, Joshua Daniel
- Issue Date
- 2022-07-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Chen, Yuguo
- Zhu, Ruoqing
- Doctoral Committee Chair(s)
- Chen, Yuguo
- Committee Member(s)
- Liang, Feng
- Simpson, Douglas G
- Department of Study
- Statistics
- Discipline
- Statistics
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Dynamic Network
- Dynamic Multilayer Network
- Latent Space Model
- Longitudinal Network
- Mixture Model
- Network Data
- Nonparametric Bayes
- Random Forests
- Social Network
- Spike-and-Slab Prior
- Statistical Network Analysis
- Stiefel Manifold
- Sufficient Dimension Reduction
- Variable Importance
- Variational Inference
- Abstract
- Many scientific problems require statistical models that satisfy two competing goals: the models must be flexible enough to explain the data while remaining amiable to interpretation and scrutiny. Dimension reduction techniques, which use low-dimensional representations to distill complex data into essential components, suggest a way to achieve these modeling goals. This dissertation develops various dimension reduction methods for problems in predictive modeling and the analysis of complex network data that result in flexible and interpretable statistical inference. Chapter 2 uses dimension reduction methods to quantify local variable importance in random forests. Random forests are one of the most popular machine learning methods due to their accuracy; however, they only provide variable importance in a global sense. There is an increasing need for such assessments at a local level, motivated by applications in personalized medicine, policy-making, and bioinformatics. Chapter 2 proposes a new nonparametric estimator that pairs the flexible random forest kernel with local sufficient dimension reduction to adapt to a regression function’s local structure. This allows us to estimate a meaningful directional local variable importance measure at each prediction point. We develop a computationally efficient fitting procedure and provide sufficient conditions for the recovery of the splitting directions. We demonstrate significant accuracy gains of our proposed estimator over competing methods on simulated and real regression problems. Finally, we apply the proposed method to seasonal particulate matter concentration data collected in Beijing, China, which yields meaningful local importance measures. The rest of the dissertation focus on problems related to complex network data. Chapter 3 studies the evolution of communities in dynamic (time-varying) network data, which is a prominent topic of interest. A popular approach to understanding these dynamic networks is to embed the dyadic relations into a latent metric space. While methods for clustering with this approach exist for dynamic networks, they all assume a static community structure. Chapter 3 presents a Bayesian nonparametric model for dynamic networks that can model networks with evolving community structures. Our model extends existing latent space approaches by explicitly modeling the additions, deletions, splits, and mergers of groups with a hierarchical Dirichlet process hidden Markov model. Our proposed approach, the hierarchical Dirichlet process latent position cluster model (HDP-LPCM), incorporates transitivity, models both individual and group level aspects of the data, and avoids the computationally expensive selection of the number of groups required by most popular methods. We provide a Markov chain Monte Carlo estimation algorithm and demonstrate its ability to detect evolving community structure in a network of military alliances during the Cold War and a narrative network constructed from the Game of Thrones television series. Dynamic multilayer networks frequently represent the structure of multiple co-evolving relations; however, statistical models are not well-developed for this prevalent network type. In Chapter 4, we propose a new latent space model for dynamic multilayer networks. The key feature of our model is its ability to identify common time-varying structures shared by all layers while also accounting for layer-wise variation and degree heterogeneity. We establish the identifiability of the model’s parameters and develop a structured mean-field variational inference approach to estimate the model’s posterior, which scales to networks previously intractable to dynamic latent space models. We demonstrate the estimation procedure’s accuracy and scalability on simulated networks. We apply the model to two real-world problems: discerning regional conflicts in a data set of international relations and quantifying infectious disease spread throughout a school based on the student’s daily contact patterns. In Chapters 3 and 4, we used latent space models (LSMs) to model network data by embedding a network’s nodes into a low-dimensional latent space. Correctly choosing the dimension of this space remains a challenge. Chapter 5 proposes a new Bayesian LSM for dynamic networks that not only fixes parameter identifiability issues that have previously impeded dimension selection but also models a larger class of dynamic networks than previous approaches. With these issues resolved, we propose a Bayesian approach to dimension selection for static and dynamic LSMs based on an ordered spike-and-slab prior that provides improved dimension estimation and satisfies several appealing theoretical properties. In particular, we show that the static model’s posterior concentrates on low-dimensional models near the truth. These models are accompanied by a novel parameter expansion scheme that allows for efficient Markov chain Monte Carlo estimation using a Metropolis-within-Gibbs sampler with Hamiltonian Monte Carlo proposals. We demonstrate our approach’s versatility and consistent dimension selection on simulated networks. Lastly, we use the static and dynamic models to study a static protein interaction network and the global arms trade’s dynamics during the Cold War.
- Graduation Semester
- 2022-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Joshua Loyal
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…