Bypassing LLM watermarks with color-aware substitutions
Wu, Qilong
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/124645
Description
Title
Bypassing LLM watermarks with color-aware substitutions
Author(s)
Wu, Qilong
Issue Date
2024-04-16
Director of Research (if dissertation) or Advisor (if thesis)
Chandrasekaran, Varun
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
LLM
text watermarking
bypassing
Abstract
Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of [1] biases the LLM to generate specific (“green”) tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose Self Color Testing-based Substitution (SCTS), the first “color-aware” attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.