CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code
Tao, Tianhua
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/124537
Description
Title
CrystalCoder and CrystalChat: Illuminating LLM abilities on language and code
Author(s)
Tao, Tianhua
Issue Date
2024-04-25
Director of Research (if dissertation) or Advisor (if thesis)
Peng, Hao
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Natural Language Processing
Language Model
Abstract
Large Language Models (LLMs) specializing in code generation, which are also often referred to as code LLMs, e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language and code usage instruction generation. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. In open-sourced LLMs, we observe a prevalent issue: most models are tailored to specialize in either language or code, not both. For example, Llama is proficient in natural language tasks but poor in code tasks, while Code Llama is the opposite. Furthermore, there is a lack of thorough prior studies on LLM pretraining strategies that mix code and natural language. In this work, we propose a pretraining strategy designed to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes three pretraining phases with appropriately adjusted code/language ratios. The resulting model, CrystalCoder, achieves remarkable capability in both domains. Specifically, it attains natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. CrystalCoder exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We further fine-tuned the pretrained model with a collection of open-source datasets and delivered our instruction-following model, CrystalChat. We verify our pretraining strategy by analyzing the training process and observing consistent improvements in most benchmarks. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, and 136 checkpoints throughout the training.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.