Withdraw
Loading…
Efficient and robust web scale language model based retrieval, generation, and understanding
Campos, Daniel Fernando
Loading…
Permalink
https://hdl.handle.net/2142/121494
Description
- Title
- Efficient and robust web scale language model based retrieval, generation, and understanding
- Author(s)
- Campos, Daniel Fernando
- Issue Date
- 2023-07-13
- Director of Research (if dissertation) or Advisor (if thesis)
- Zhai, Cheng Xiang
- Doctoral Committee Chair(s)
- Zhai, Cheng Xiang
- Committee Member(s)
- Magnani, Alessandro
- Han, Jiawei
- Chang, Kevin
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- efficient inference
- language model
- semantic retrieval
- web-scale inference
- Abstract
- Large language models effectively generate contextualized word representations across languages, domains, and tasks. Drive by these abilities, these models have become a build- ing staple for many researchers and engineers who use text as their medium of representation, much like concrete is a staple in the construction world. Via the broad study and imple- mentation, problems with large models have come to light: they can be expensive, brittle to noise, and produce unwanted outputs. Their large size and computational overhead make them difficult and costly to deploy and use for inference. Minor variations in text inputs, such as typos or misspellings, can cause significant losses in model accuracy. Seeking to improve how these models can be used for real world usage and deployments, this thesis fo- cuses on approaches for improving performance by compressing, hardening, and optimizing models’ performance based on deployment needs. To explore the challenges with large- scale deployments concerning robustness and inference efficiency, we explore four commonly used language workloads: textual understanding and classification, passage retrieval, and text generation. We chose these broad but connected tasks to ensure that our compres- sion approaches broadly apply to natural language processing. First, we propose a general framework for improving model inference on broad language understanding workloads by studying how unstructured pruning, structured pruning, and quantization can be leveraged to compress models and improve inference speeds. Second, we examine how models can be deployed for usage in web-scale generation and understanding workloads. Leveraging multi-task modeling, asymmetrical pruning, knowledge distillation, and quantization allows for cost-efficient scaling to web-scale workloads. Third, we explore methods of tuning and optimizing dense retrieval methods post-training to ensure they perform well on real-world data. Our experiments yield simple and effective ways of increasing model robustness and decreasing inference costs without any need for retraining or index re-generation. Finally, we discuss future work, focusing on sequential compression approaches to sequence LLMs to allow generative workloads to reach web-scale deployments.
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Daniel Campos
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…