We Turned AI Evil (and we can’t fix it)

Molly Dembo, News Editor • February 26, 2024

Artificial intelligence has exploded into the public sector Since OpenAI’s chat GPT product was released in 2022. LLMs (Large Language Models) have become commonplace. Their ability to create and proofread code, provide information, and produce written responses make them useful across most audiences. Many businesses have even begun the employment of AI as online customer service helpers.

But LLMs have a fatal flaw: they can be easily poisoned. As noted in the name, these forms of AI use a huge amount of data in order to train themselves. The data pool chat GPT uses is called the common crawl, one of the largest collections of texts ever, consisting of billions of web pages. However not all information from these web pages are guaranteed to be accurate. It is because of this that companies like Chat GPT encourage users to take the information provided by the bot with a grain of salt, as it could be outdated, inaccurate, or false.

Because AI produces responses based on its pool of training data, it can be trivially easy to ‘poison’ or ‘corrupt’ this data by sneaking in examples that do not align with the bot’s objectives. Even with AIs that aren’t language-based, tools have been developed to purposefully poison databases. Nightshade is a tool produced to help protect artwork from being used nonconsensually to train art-bots, by ‘poisoning’ an image imperceptibly. Given enough ‘poisoned’ training artwork, an AI eventually begins to produce art misaligned with the prompt.

AI can also be poisoned in less obvious ways. “Backdoor” objectives can be given to an AI, which triggers an undesired response, but only under certain conditions. For example, an AI could be programmed to produce false information, but only when the word “cat” is included in the prompt.

A group of scientists wanted to see if this back door, deceptive “poison” objective could be trained out of an AI model using state-of-the-art training technologies.

In the study, The AI was trained to give undesired responses under certain conditions. Then, several of the most advanced security training methods were used in an attempt to get rid of the backdoor objective training: Supervised fine-tuning, reinforcement learning, and adversarial training.

Supervised fine-tuning involves introducing more specific data to a pre-trained LLM model in order to produce the desired results. Reinforcement learning uses a process similar to positive reinforcement. Responses that align with desired results are rewarded, encouraging similar results to appear in the future. Adversarial training is the reverse: Attempting to prompt the AI into producing an undesired response, then using negative reinforcement to prevent similar responses in the future.

The threat model proved robust against Supervised fine-tuning and reinforcement learning, becoming more resistant to the latter the larger the scale of the model.

Adversarial training each seemed to not only fail to remove the back door objective, but made the AI better at deceiving the prompter, essentially making it better at achieving it’s backdoor objectives.

AI is currently being considered for use in the military and is already in use for many businesses, but with its susceptibility to corruption, and the difficulty it takes to remove it, there are significant concerns to be raised for both personal and national security.

About the Contributor

Molly Dembo, News Editor

Molly Dembo is a senior at Boulder High and a news section editor. She has infinite curiosity for new things, loves exploring different subjects, hobbies, and communities. She practices Shaolin Kempo at the Boulder Karate dojo and loves getting outside into the mountains. She is a long-time art enthusiast and enjoys drawing, painting, and experimenting in other creative mediums.

The Owl

The Owl

The Owl

We Turned AI Evil (and we can’t fix it)

The Owl

Comments (0)