Efficient and robust web scale language model based retrieval, generation, and understanding

Campos, Daniel Fernando

Efficient and robust web scale language model based retrieval, generation, and understanding

Campos, Daniel Fernando

Permalink

https://hdl.handle.net/2142/121494

Description

Title

Efficient and robust web scale language model based retrieval, generation, and understanding

Author(s)

Campos, Daniel Fernando

Issue Date

2023-07-13

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, Cheng Xiang

Doctoral Committee Chair(s)

Zhai, Cheng Xiang

Committee Member(s)

Magnani, Alessandro
Han, Jiawei
Chang, Kevin

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

efficient inference
language model
semantic retrieval
web-scale inference

Abstract

Large language models effectively generate contextualized word representations across languages, domains, and tasks. Drive by these abilities, these models have become a build- ing staple for many researchers and engineers who use text as their medium of representation, much like concrete is a staple in the construction world. Via the broad study and imple- mentation, problems with large models have come to light: they can be expensive, brittle to noise, and produce unwanted outputs. Their large size and computational overhead make them difficult and costly to deploy and use for inference. Minor variations in text inputs, such as typos or misspellings, can cause significant losses in model accuracy. Seeking to improve how these models can be used for real world usage and deployments, this thesis fo- cuses on approaches for improving performance by compressing, hardening, and optimizing models’ performance based on deployment needs. To explore the challenges with large- scale deployments concerning robustness and inference efficiency, we explore four commonly used language workloads: textual understanding and classification, passage retrieval, and text generation. We chose these broad but connected tasks to ensure that our compres- sion approaches broadly apply to natural language processing. First, we propose a general framework for improving model inference on broad language understanding workloads by studying how unstructured pruning, structured pruning, and quantization can be leveraged to compress models and improve inference speeds. Second, we examine how models can be deployed for usage in web-scale generation and understanding workloads. Leveraging multi-task modeling, asymmetrical pruning, knowledge distillation, and quantization allows for cost-efficient scaling to web-scale workloads. Third, we explore methods of tuning and optimizing dense retrieval methods post-training to ensure they perform well on real-world data. Our experiments yield simple and effective ways of increasing model robustness and decreasing inference costs without any need for retraining or index re-generation. Finally, we discuss future work, focusing on sequential compression approaches to sequence LLMs to allow generative workloads to reach web-scale deployments.

Graduation Semester

2023-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/121494

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Efficient and robust web scale language model based retrieval, generation, and understanding

Campos, Daniel Fernando

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In