Defending against Prompt Injection Attacks in Large Language Models with StruQ and SecAlign

Recent advancements in Large Language Models (LLMs) pave the way for exciting applications. However, as the prowess of LLMs grows, so do the attacks against them. The prompt injection attack, listed as the #1 threat by OWASP to LLM-integrated applications, poses a significant concern. This malicious practice exploits LLMs by injecting untrusted data with instructions to manipulate their responses. In this blog, we discuss two effective defenses, StruQ and SecAlign, designed to mitigate these attacks with minimal computational cost or human labor.

Prompt Injection: Causes and Threat Model

The LLM input consists of a trusted prompt and untrusted data. The data may contain an injected instruction intended to override the trusted prompt. We identify two causes of prompt injection: (1) the absence of separation between the prompt and data, making it difficult to distinguish between the two, and (2) LLMs trained to follow instructions anywhere in their input. To minimize this threat, we propose solutions to address these underlying issues.

Defenses: StruQ and SecAlign

Secure Front-End

To separate the prompt from the data, we introduce the Secure Front-End, which uses special tokens to mark the separation delimiters. This establishes an explicit separation that can only be enforced by the system designer due to the data filtering process.

Structured Instruction Tuning (StruQ)

First, we introduce Structured Instruction Tuning (StruQ), a training procedure that simulates prompt injection in the data. The training dataset comprises clean samples and samples with injected instructions. Through supervised fine-tuning, the LLM learns to respond to the intended instruction highlighted by the Secure Front-End.

Special Preference Optimization (SecAlign)

Next, we propose Special Preference Optimization (SecAlign), which trains on simulated injected inputs. Unlike StruQ, SecAlign training samples are labeled with both desirable responses (to the intended instruction) and undesirable responses (to the injected instruction). By preference-optimizing the LLM to favor desirable responses over undesirable ones, SecAlign significantly increases robustness compared to StruQ.

Experiments and Results

We find that both StruQ and SecAlign reduce the success rates of optimization-free attack to 0%. For optimization-based attacks, StruQ offers substantial security, with SecAlign further reducing the success rate by a factor of >4 without incurring significant utility loss.

Take action: Stay updated on prompt injection attacks and defenses. Follow our RSS feed, watch the video explaining prompt injections by Andrej Karpathy, read recent blogs, and explore code examples for StruQ, SecAlign, Jatmo, Instruction Hierarchy, Instructional Segment Embedding, and Thinking Intervene.