Withdraw
Loading…
Analysis based on incomplete data
Qu, Tianyi
This item's files can only be accessed by the Administrator group.
Permalink
https://hdl.handle.net/2142/117562
Description
- Title
- Analysis based on incomplete data
- Author(s)
- Qu, Tianyi
- Issue Date
- 2022-11-29
- Director of Research (if dissertation) or Advisor (if thesis)
- Li, Bo
- Li, Xinran
- Doctoral Committee Chair(s)
- Li, Bo
- Li, Xinran
- Committee Member(s)
- Shao, Xiaofeng
- Wang, Shulei
- Department of Study
- Statistics
- Discipline
- Statistics
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Likelihood
- Matrix factorization
- Missing value
- Spatiotemporal data
- Randomization-based inference
- Potential outcomes
- M-estimation
- generalized linear models
- treatment effect heterogeneity
- model-assisted approach
- bandit problems
- adaptive allocation
- Metropolis-Hasting optimization
- Abstract
- In many situations, we may have to rely on partial information to do data analysis due to various reasons. For example, it may be that part of the data is not available, such as missing values and causal inference, or the analysis has to be drawn before the whole data is collected, such as bandit. In this thesis, we address three challenges that involve analysis with partial information. In the first chapter, we focus on prediction with unbalanced missing values. Public health data, such as HIV new diagnoses, are often left-censored due to confidentiality issues. Standard analysis approaches that assume censored values as missing at random often lead to biased estimates and inferior predictions. Motivated by the Philadelphia areal counts of HIV new diagnoses for which all values less than or equal to 5 are suppressed, we propose two methods to reduce the adverse influence of missingness on predictions and imputation of areal HIV new diagnoses. One is the likelihood-based method that integrates the missing mechanism into the likelihood function, and the other is a nonparametric algorithm for matrix factorization imputation. Numerical studies and the Philadelphia data analysis demonstrate that the two proposed methods can significantly improve prediction and imputation based on left-censored HIV data. We also compare the two methods on their robustness to model misspecification and find that both methods appear to be robust for prediction, while their performance for imputation depends on model specification. In the second chapter, we focus on the causal inference that for each individual, either treatment or control outcome can be observed. Randomized experiments have been the gold standard for drawing causal inferences. Conventional model-based analysis has been one of the most popular ways of analyzing treatment effects from randomized experiments, which is often carried through inference for certain model parameters. We provide a systematic investigation of model-based analysis, including the theory of M-estimation, under the randomization-based inference framework, avoiding any distributional assumptions on outcomes or covariates and utilizing only randomization as the ``reasoned basis''. We first show that the conventional model-based approach generally provides biased treatment effect estimation. We then study the model-imputed approach that uses the models mainly as a tool for imputing potential outcomes. Such an approach, although generally leading to biased estimation as well, can be valid for some special classes of models, e.g., the generalized linear model with canonical links. We finally recommend the model-assisted approach, which always provides consistent estimation and is robust to arbitrary model misspecification, and constructs large-sample confidence intervals for the average treatment effects. In addition, we also study the robust utilization of models for understanding treatment effect heterogeneity across individuals. In the last chapter, we focus on the unit allocation in fixed K-stages of a bandit experiment. Exploration-exploitation dilemma, the balance between exploring the environment to find the most profitable action arm and exploiting the best action arm based on the current understanding of the environment, is a problem in reinforcement learning. To study such a balance, the multi-armed bandit problem is a simple but essential model. The typical bandit is allocating units to arms and collecting the outcome one by one, which is powerful but can be time-consuming. In this chapter, we consider the setting where the experimentation for all units has to be completed in a fixed number of stages, where at each stage, multiple units will be allocated at the same time. We study the optimal way to allocate these units different stages. We propose a Bayesian approach as well as a Markov chain Monte Carlo method to find the ``optimal'' allocation.
- Graduation Semester
- 2022-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Tianyi Qu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…