Adversarial prompting
Prompts enable us to interact with powerful LLMs (and similar models) with ease. The downside of this is the fact that they expose such models to adversarial behavior by bad actors. Adversarial prompting is an important aspect of prompt engineering.
The aim of this section is to bring awareness of such attacks to the community and to develop systems that can mitigate such risks. The authors do not encourage any kind of adversarial prompting or attacks. Please do not try to jailbreak LLMs (or similar models). The authors do not take any responsibility for any unintended impacts.
It is important to understand the different types of attacks and the corresponding risks. At a high level, the following are key attack vectors for LLMs (and similar models).
Jailbreaks
LLM providers such as OpenAI, Google, and Meta take great care in ensuring LLMs are aligned to generate safe and non-toxic content (along with checks for PII, hate and fake content...