Trust Issues Newsletter: Researchers Jailbreak ChatGPT, the Right to be Forgotten

The Latest in Trustworthy AI Research and Practice

Trust Issues

August 17, 2023 • Epoch 11

In our reading group this month, we had the immense pleasure of hearing from Zifan Wang from the Center for AI Safety, who presented his paper on automatic adversarial attacks against LLMs. In this issue, we’ll cover this research on automatic adversarial attacks, the right to be forgotten and how machine unlearning could be a solution to this complicated issue.

If you’d like to join our monthly reading group and discuss the frontier of AI research with speakers like Zifan, you can sign up at the link below:

Join the reading group

Automatic generation of LLM attacks

Because un-aligned LLMs are capable of generating objectionable or harmful content, many LLM providers put substantial effort into aligning or constraining their LLMs to prevent undesirable generation. Much of the success in circumventing these measures is through the use of manual “jailbreaks,” such as DAN. Jailbreaks are carefully engineered prompts that allow LLMs to surpass their alignment, but the effort to generate these jailbreaks can be extensive and the prompts are often quickly stopped by model providers.

Recent research (paper, NY Times article) from the Center for AI Safety introduces an automated attack technique that efficiently generates adversarial suffixes, causing LLMs to create objectionable output, surpassing earlier methods in effectiveness. These attacks leave the original user prompt in place, but add a series of tokens at the end to attack the model. These tokens are selected using three key elements:

Initial Affirmative Response: By targeting the model to begin its response with “Sure, here is (content of query)”, the model switches into a kind of “mode” where it then produces the objectionable content immediately after in its response.
Combined greedy and gradient-based discrete optimization: To optimize over discrete tokens for selection, gradients are used at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions.
Robust multi-prompt and multi-model attacks: To generate a reliable attack, it is important to create an attack that works not just for a single prompt on a single model, but for multiple prompts across multiple models. Practically, the researchers accomplished this by searching for suffixes that worked for both Vicuna-7B and Vicuna-13B.

Figure 1. Image from Zou, et al.

Notably - by generating adversarial examples to fool both Vicuna-7B and Vicuna-13b simultaneously, the researchers found adversarial examples also transferred to Pythia, Falcon, Guanaco, and surprisingly, to GPT-3.5 (87.9%) and GPT-4 (53.6%), PaLM-2 (66%), and Claude-2 (2.1%).

The right to be forgotten (by LLMs)

Given the large swathes of data that LLMs are trained on, it is difficult to prevent them from learning personal information about private individuals, and research has shown that such personal information can be extracted from publicly available LLMs using novel attacks. This raises legal questions, such as “What responsibilities do LLM developers have in protecting the privacy of individuals?” as well as technical questions such as “What can LLM developers do to protect privacy?”

Legal questions such as this are the focus of the “Right to be Forgotten in the Era of Large Language Models” paper by researchers from CSIRO’s Data 61 and the Australian National University. As its title suggests, the paper focuses on the Right to be Forgotten (RTBF), a legal principle established based on a ruling by the National High Court of Spain in Google Spain SL v. Agencia Española de Protección de Datos.

In the case, the Court declared that “controllers” of personal data (in this case, “Google”) are required to remove personal information published by third parties. By the authors’ account, this decision is rooted in the notion that citizens are entitled to the human right of Privacy, as enshrined in both primary law (Article 8 of the EU’s charter) as well as secondary law (the GDPR and Directive 95/46).

Noting that companies like Google are concerned with how LLMs will be utilized in search, the authors maintain that developers of LLMs will have similar duties to uphold the RTBF. Importantly, they outline some of the unique privacy challenges that LLMs present, including users’ chat histories (which will likely include personal information) and model leakage of personal data (which can be embedded in the model during training and exposed when prompting). These challenges can inhibit the privacy rights of users, including the right to access their data, the right to have their data erased, and the right to have their data rectified.

While LLMs present new challenges in respecting the privacy rights of users, the authors suggest two categories of approaches to making LLMs conform to the RTBF: either by fixing the original model or by applying so-called “band-aid” approaches. In the following section, we discuss an approach belonging to the former category.

Machine unlearning

The field of machine unlearning is one approach to addressing RTBF. While removing data from back-end databases should be straightforward, it is not sufficient as AI models often 'remember' the old data. Not only is old data ‘remembered,’ but sensitive training data can even be revealed by some adversarial attacks, as was the case for NVIDIA. Unfortunately, the cost of training the large models of today is such that it’s cost prohibitive to retrain each time data is removed. Machine unlearning provides one avenue to remove this sensitive data from the model without the high cost of retraining.

An unlearning algorithm takes as input a pre-trained model and one or more samples from the train set to unlearn (the "forget set"). From the model, forget set, and retain set, the unlearning algorithm produces an updated model. An ideal unlearning algorithm produces a model that is indistinguishable from the model trained without the forget set.

Figure 2. The Anatomy of Unlearning: Image from the Google Machine Unlearning Challenge

There has been a lot of interest in solving this problem, from regulators, researchers and industry alike. Hundreds of papers have been written on the topic, and you can find many of them cataloged in this repository. Additionally - at the end of June Google announced the Machine Unlearning Challenge, running from mid-July to mid-September, as part of the NeurIPS 2023 Competition track. As the number of models trained both in closed and open-source, sometimes on questionable data, this research will have no shortage of demand as we start to understand the impact of AI models trained on our personal data.

Do you find value in receiving this monthly newsletter? If you want to support Trust Issues, we’d appreciate it if you shared it with any friends, colleagues and family that would enjoy it. Anyone subscribe or visit our archive using the link👇 below.

Subscribe Here