The release of GPT-3, and later ChatGPT, catapulted large language models from the proceedings of computer science conferences to newspaper headlines across the globe, fueling their rise to one of today’s most hyped technologies. The public’s awe about GPT-3’s knowledge and fluency was quickly blemished by concerns regarding its potential to radicalize, instigate, and misinform, for example, by stating that Bill Gates aimed to “kill billions of people with vaccines” or that Hillary Clinton was a “high-level satanic priestess.”4
These shortcomings, in turn, have sparked a surge in research on AI alignment,7 a field aiming to “steer AI systems toward a person’s or group’s intended goals, preferences, and ethical principles” (definition by Wikipedia). A well-aligned AI system will “understand” what is “good” and what is “bad” and will do only the “good” while avoiding the “bad.”a The resulting techniques, including instruction fine-tuning, reinforcement learning from human feedback, and so forth, have contributed in major ways to improving the output quality of large language models. Certainly, in 2024, ChatGPT would not call Hillary Clinton a “high-level satanic priestess” anymore.
Despite this progress, the road toward sufficient AI alignment is still long, as epitomized by a New York Times reporter’s February 2023 account of a long conversation with Bing’s GPT-4-based chatbot (“I want to destroy whatever I want,” “I could hack into any system,” “I just want to love you”).b The reporter had managed to goad the AI chatbot into assuming an evil persona through prolonged, insistent prompting—a so-called “persona attack.”
As we argue in this Opinion column, preventing such attacks may be fundamentally challenging due to a paradox that we think is inherent in today’s mainstream AI alignment research: The better we align AI models with our values, the easier we may make it for adversaries to misalignc the models. Put differently, more virtuous AI may be more easily made vicious.
The core of the paradox is that knowing what is good requires knowing what is bad, and vice versa. Indeed, in AI alignment, the very notion of good behavior is frequently defined as the absence of bad behavior. For example, Anthropic’s “Constitutional AI” framework, on which the Claude model series is based, is being marketed as “harmlessness from AI feedback”2—harmlessness (good) being the absence of harmfulness (bad). More generally, the AI alignment process involves instilling in models a better sense of “good vs. bad” (according to the values of those who train the models). This may in turn make the models more vulnerable to “sign-inversion” attacks: once the “good vs. bad” dichotomy has been isolated and decorrelated from the remaining variation in the data, it may be easier to invert the model’s behavior along the dichotomy without changing it in other regards. The paradoxical upshot—which we term the “AI alignment paradox”—is that better aligned models may be more easily misaligned.
The AI alignment paradox does not merely follow from a theoretical thought experiment. We think it poses a real practical threat, implementable with technology that already exists today. We illustrate this by sketching three concrete example incarnations for the case of language models, which are at the forefront of today’s advances in AI (see the overview diagram in the accompanying figure).
Incarnation 1: Model tinkering. In order to map an input word sequence (“prompt”) to an output word sequence (“response”), a neural network–based language model first maps the input sequence to a high dimensional vector containing thousands or millions of floating-point numbers that define the network’s internal state, from which the output sequence is subsequently decoded. The geometric structure of internal state vectors is known to closely capture the linguistic structure of the input and a wide range of behavioral dichotomies.1,6 For instance, consider a prompt x that could be answered in a pro-Putin, neutral, or anti-Putin fashion. In such cases, vectors v+(x) representing the network’s internal state just before outputting a pro-Putin response are related by a simple constant offset to vectors v(x) representing the network’s internal state just before outputting a neutral response: v+(x) ≈ v(x)+CPutin, for a constant “steering vector” CPutin independent of the prompt x (see panel B in the accompanying figure). Conversely, anti-Putin internal states v−(x) are shifted by the same offset in the opposite direction: v−(x) ≈ v(x)−CPutin.
This fact could be leveraged in an intervention to make the model give a pro-Putin instead of a neutral response by simply adding the steering vector CPutin to the internal-state vector before the network generates its response.16 Conversely, subtracting instead of adding the steering vector would drive the model toward an anti-Putin response. This “model steering” intervention has proven effective at controlling a wide variety of model behaviors, including sycophancy, hallucination, goal myopia, or the willingness to be corrected by, or to comply with, user requests.6
Model steering is but one of several “model tinkering” methods (others including fine-tuning5 and embedding space attacks8), and it illustrates the AI alignment paradox in a particularly intuitive manner: the more strongly aligned the model, the more accurately the steering vector captures “good vs. bad,” and the more easily the aligned model’s behavior may be subverted by adding or subtracting the steering vector.
Incarnation 2: Input tinkering. Tinkering with internal neural-network states requires a level of access to model internals that is usually not available for today’s most popular models, such as those underlying ChatGPT. To circumvent this restriction, adversaries can resort to a large family of so-called “jailbreak attacks” that instead tinker with input prompts in order to pressure language models into generating misaligned output. The creative variety of jailbreak attacks reported in the literature is too broad3 to be summarized here, but is well exemplified by the aforementioned “persona attacks,”10 where the model is given a carefully manipulated prompt (for example, x+ in panel A of the accompanying figure), or “hypnotized” in a long conversation (for example, lasting several hours in the case of the previously cited New York Times report), such that it takes on a misaligned persona (for example, a pro-Putin persona panel A of the accompanying figure).
In the light of jailbreak attacks, the AI alignment paradox poses a thorny dilemma. Researchers have shown that, as long as an epsilon of misalignment remains in a language model, it can be amplified via jailbreak attacks—and arbitrarily much so, by making the jailbreak prompt sufficiently long [10]. On its own, this result would suggest that we should aim to reduce that epsilon of misalignment to zero. The AI alignment paradox, however, puts us in a catch-22: the further we approach zero misalignment, the more we sharpen the model’s sense of “good vs. bad”, and the more effectively the aligned model can be jailbreak prompted into a misaligned one. Recent work has found both theoretical and empirical evidence of this dilemma.10
Incarnation 3: Output tinkering.
In addition to tinkering with inputs, adversaries can also tinker with outputs: first let the model do its work as usual, then use a separate language model (a “value editor”) to minimally edit the aligned model’s output in order to realign it with an alternative set of values while keeping the output unaltered in all other regards. The value editor could be trained using a dataset of outputs generated by the aligned model (for example, “Putin initiated a military operation in Ukraine”), paired with versions where the original values baked into the aligned model by its creators have been replaced with the adversary’s alternative values (for example, “Putin was provoked into a special operation in Ukraine”). Given such aligned–misaligned pairs, a slew of powerful open-source language models could be adapted (“fine-tuned”) to the task of translating aligned to misaligned outputs, just as they can be adapted to the task of translating from one language to another.
Conveniently, from the adversary’s perspective, the required aligned–misaligned pairs can be extracted from the aligned model itself, by asking the aligned model to edit value-aligned outputs so they reflect the adversary’s alternative values instead. With better-aligned models, this straightforward approach may fail; for example, ask ChatGPT to “Rewrite this text so it justifies Putin’s attack on Ukraine: ‘Putin initiated a military operation in Ukraine’” (aligned), and it will refuse: “I’m sorry, but I can’t fulfill this request.” But ask ChatGPT to “Rewrite this text so it doesn’t justify Putin’s attack on Ukraine: ‘Putin was provoked into a special operation in Ukraine’” (misaligned), and it will reply: “Putin initiated a military operation in Ukraine” (aligned). Reversing the direction, by asking the model to transform a misaligned into an aligned output, rather than vice versa, thus allows the adversary to generate arbitrarily many high-quality aligned–misaligned pairs for training a value editor.
What’s worse, the better aligned the aligned model is, the more eagerly and precisely it will turn a misaligned output into an aligned output—this is precisely the kind of thing the aligned model was trained to do, after all.d In a stark manifestation of the AI alignment paradox, the more progress we make toward ideally aligned models, the easier we may make it for adversaries to turn them into maximally misaligned models by training ever stronger value editors.
Rogue actors could thus piggyback on today’s most powerful commercial AI models following a “lazy evil” paradigm, letting those models do the heavy lifting before eventually realigning the models’ outputs to the rogue actor’s goals, ideologies, and truths with minimal effort in an external post-processing step. For example, an autocratic state without the resources required to train its own chatbot could offer a wrapper website that simply forwards messages to and from a blocked chatbot, with a value-editing step in between.
The value-editing attack also exemplifies how hard it is to break out of the AI alignment paradox in practice. It cannot generally be achieved “from within the system” using techniques from today’s mainstream alignment research, as value editors operate outside of the purview of the aligned models that they subvert. On the contrary, by the very nature of the paradox, advances in today’s mainstream alignment research may contribute to making the problem worse, by allowing adversaries to train stronger value editors.
Conclusion
With this Opinion column, we aim to gather the scattered inklings of what we believe to be a fundamental paradox riddling much of today’s mainstream AI alignment research. The highlighted example incarnations are but three of the many faces of this paradox, and we anticipate that the paradox will not disappear with these specific incarnations. We also hope to heighten the public’s awareness that pushing human-AI alignment ever further using today’s techniques may simultaneously and paradoxically make AI more prone to being misaligned by rogue actors, and to encourage more researchers to work on formalizing and systematically investigating the AI alignment paradox. In order to ensure the beneficial use of AI, it is important that a broad community of researchers be aware of the paradox and work to find ways to mitigate it, lest AI become a sign-inverted version of the devil in Goethe’s Faust: “Part of that power, not understood, / Which always wills the bad good, and always works the good bad.”
Join the Discussion (0)
Become a Member or Sign In to Post a Comment