Withdraw
Loading…
Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective
Liu, Liyuan
Loading…
Permalink
https://hdl.handle.net/2142/124173
Description
- Title
- Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective
- Author(s)
- Liu, Liyuan
- Issue Date
- 2024-04-08
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Doctoral Committee Chair(s)
- Han, Jiawei
- Committee Member(s)
- Ji, Heng
- Zhai, ChengXiang
- Gao, Jianfeng
- Peters, Matthew E
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Training Stability
- Variance
- Pretraining
- Abstract
- Language model pretraining has been breaking the glass ceiling for various natural language processing tasks and has been viewed as one of the most significant successes of deep learning, continuously challenging our understanding of learning and cognition. Recently models, including GPT-4 and BART, fueled by an unprecedented scale of computing and data, exhibit unprecedented intelligence, that some even refer to as "sparks of artificial general intelligence". The success of large-scale pretraining hinges on intricate engineering heuristics. While the empirical benefits of these heuristics are evident, their underlying mechanisms remain elusive. This dissertation endeavors to demystify the mathematical principles underlying these pretraining heuristics, aiming to illuminate their mechanisms and potentially guide future algorithm developments. Adopting a variance-oriented perspective, my research rigorously inspects the heuristics that are pivotal to the stability of current pretraining practices, emphasizing learning rate warmup, model initialization, and gradient approximation. In this dissertation, I show that these pretraining stabilization heuristics can be coherently elucidated with a unified framework anchored in variance, a classical metric for stability. First, I analyze the variance of adaptive learning rate and model outputs, revealing that both learning rate warmup and model initialization function as variance modulators. Then, I move to explore the variance-bias tradeoff in the discrete variable gradient approximation, i.e., employing a numerical ODE framework, I unveil the underlying dynamics of the approximation bias, achieving second order precision with minimal computational overhead. Besides theoretical results, empirical verifications are conducted to verify the assumptions and applicability of the recognized principles. Building upon these insights, this dissertation introduces novel techniques designed to advance the pretraining practices, including RAdam for learning rate warmup, Admin for Transformer model initialization, ReinMax and SparseMixer for gradient approximation. Under the guidance of the recognized principles, all proposed methods require minimal trial-and-error configurations, thereby emerging as robust and high-perform tools for pretraining practices for adaptations.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Liyuan Liu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…