Withdraw
Loading…
An systematic evaluation on leading large language models and their factuality investigation as question answering systems
Zheng, Shen
Loading…
Permalink
https://hdl.handle.net/2142/124225
Description
- Title
- An systematic evaluation on leading large language models and their factuality investigation as question answering systems
- Author(s)
- Zheng, Shen
- Issue Date
- 2024-04-05
- Director of Research (if dissertation) or Advisor (if thesis)
- Chang, Kevin Chen-Chuan
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Natural Language Processing
- Large Language Model
- Abstract
- This thesis presents a cohesive investigation into the advancements, capabilities, and limitations of current large language models (LLMs), with a focused exploration on their application in question-answering systems. It offers a comprehensive assessment and critique of LLMs, emphasizing their evolution, performance evaluation, and factual accuracy. The first part of the thesis introduces GPT-Fathom, an innovative, open-source evaluation framework designed to systematically assess the performance of over ten leading LLMs, including OpenAI’s legacy models, across a suite of more than twenty benchmarks in seven capability categories, all under uniform testing conditions. This analysis not only tracks the technological progression from GPT-3 to GPT-4, uncovering the incremental benefits of incorporating code data and the effects of various training methodologies like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) but also quantifies the so-called alignment tax. The detailed retrospective study provides clarity on the nuanced improvements that contribute to the models’ enhanced reasoning capabilities. Transitioning seamlessly from a broad evaluation of LLM capabilities to a targeted analysis of their application in question-answering systems, the second part scrutinizes the factual reliability of responses provided by models such as ChatGPT. By dissecting failures into categories such as comprehension, factuality, specificity, and inference, this investigation identifies factuality as a predominant area of concern. It further explores the underlying issues of knowledge memorization and recall, suggesting that augmenting LLMs with refined external knowledge bases and optimized recall mechanisms can significantly bolster their factual accuracy. By harmonizing insights from evaluating LLMs’ general capabilities with a deep dive into their performance in question-answering contexts, this thesis elucidates the multifaceted challenges and opportunities facing the development of LLMs. It articulates the necessity for enhanced transparency, accountability, and methodological rigor in the ongoing advancement of LLMs, advocating for strategies that not only improve their intellectual capabilities but also ensure the reliability and truthfulness of their output. Through this integrated analysis, the thesis contributes to a nuanced understanding of LLMs’ current state and charts a forward path for their evolution into more dependable and effective tools in AI-driven applications.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Shen Zheng
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…