Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly

Diwan, Nirav

Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly

Diwan, Nirav

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/124607

Description

Title

Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly

Author(s)

Diwan, Nirav

Issue Date

2024-05-02

Director of Research (if dissertation) or Advisor (if thesis)

Wang, Gang

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Large Language Models
Binary Analysis
Code-Data Separation
Unsupervised Domain Adaptation
Security

Abstract

Static binary analysis serves as a critical technique for identifying security vulnerabilities in binaries without source code access. The first step of static binary analysis is disassembly, which involves deconstructing the binary file to identify code and data instructions (also known as the code-data separation problem). Current methods for code-data separation assume a fixed or standard file format of the binary file. However, with the proliferation of Internet-of-Things (IoT) devices, new non-standard file formats, which do not conform to a fixed format, are becoming prevalent. This presents a hurdle in performing binary analysis tasks (e.g., detecting security flaws, malware classification, license obligations) for such non-standard file formats. In this work, we examine the code and data distributions for standard and non-standard file formats. Our analysis indicates a distribution shift between standard and non-standard file formats, motivating the need to tackle code-data separation for non-standard file formats. We approach this problem as an unsupervised domain adaptation problem by proposing a pseudo-labeling approach based on Large Language Models. Our best model achieves high performance on standard binary files (F1-Score = 0.99) and non-standard binary files (F1-Score = 0.95). Finally, we discuss our findings and the limitations of our approach.

Graduation Semester

2024-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/124607

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Beyond rules: leveraging Large Language Models for code-data separation in binary disassembly

Diwan, Nirav

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In