Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective

Liu, Liyuan

Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective

Liu, Liyuan

Permalink

https://hdl.handle.net/2142/124173

Description

Title

Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective

Author(s)

Liu, Liyuan

Issue Date

2024-04-08

Director of Research (if dissertation) or Advisor (if thesis)

Han, Jiawei

Doctoral Committee Chair(s)

Han, Jiawei

Committee Member(s)

Ji, Heng
Zhai, ChengXiang
Gao, Jianfeng
Peters, Matthew E

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Training Stability
Variance
Pretraining

Abstract

Language model pretraining has been breaking the glass ceiling for various natural language processing tasks and has been viewed as one of the most significant successes of deep learning, continuously challenging our understanding of learning and cognition. Recently models, including GPT-4 and BART, fueled by an unprecedented scale of computing and data, exhibit unprecedented intelligence, that some even refer to as "sparks of artificial general intelligence". The success of large-scale pretraining hinges on intricate engineering heuristics. While the empirical benefits of these heuristics are evident, their underlying mechanisms remain elusive. This dissertation endeavors to demystify the mathematical principles underlying these pretraining heuristics, aiming to illuminate their mechanisms and potentially guide future algorithm developments. Adopting a variance-oriented perspective, my research rigorously inspects the heuristics that are pivotal to the stability of current pretraining practices, emphasizing learning rate warmup, model initialization, and gradient approximation. In this dissertation, I show that these pretraining stabilization heuristics can be coherently elucidated with a unified framework anchored in variance, a classical metric for stability. First, I analyze the variance of adaptive learning rate and model outputs, revealing that both learning rate warmup and model initialization function as variance modulators. Then, I move to explore the variance-bias tradeoff in the discrete variable gradient approximation, i.e., employing a numerical ODE framework, I unveil the underlying dynamics of the approximation bias, achieving second order precision with minimal computational overhead. Besides theoretical results, empirical verifications are conducted to verify the assumptions and applicability of the recognized principles. Building upon these insights, this dissertation introduces novel techniques designed to advance the pretraining practices, including RAdam for learning rate warmup, Admin for Transformer model initialization, ReinMax and SparseMixer for gradient approximation. Under the guidance of the recognized principles, all proposed methods require minimal trial-and-error configurations, thereby emerging as robust and high-perform tools for pretraining practices for adaptations.

Graduation Semester

2024-05

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Understanding the mechanism of pretraining stabilization heuristics: A variance-oriented perspective

Liu, Liyuan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In