Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly
Diwan, Nirav
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/124607
Description
Title
Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly
Author(s)
Diwan, Nirav
Issue Date
2024-05-02
Director of Research (if dissertation) or Advisor (if thesis)
Wang, Gang
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Large Language Models
Binary Analysis
Code-Data Separation
Unsupervised Domain Adaptation
Security
Abstract
Static binary analysis serves as a critical technique for identifying security vulnerabilities in binaries without source code access. The first step of static binary analysis is disassembly, which involves deconstructing the binary file to identify code and data instructions (also known as the code-data separation problem). Current methods for code-data separation assume a fixed or standard file format of the binary file. However, with the proliferation of Internet-of-Things (IoT) devices, new non-standard file formats, which do not conform to a fixed format, are becoming prevalent. This presents a hurdle in performing binary analysis tasks (e.g., detecting security flaws, malware classification, license obligations) for such non-standard file formats. In this work, we examine the code and data distributions for standard and non-standard file formats. Our analysis indicates a distribution shift between standard and non-standard file formats, motivating the need to tackle code-data separation for non-standard file formats. We approach this problem as an unsupervised domain adaptation problem by proposing a pseudo-labeling approach based on Large Language Models. Our best model achieves high performance on standard binary files (F1-Score = 0.99) and non-standard binary files (F1-Score = 0.95). Finally, we discuss our findings and the limitations of our approach.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.