Tech

Anthropic researchers wear down AI ethics with repeated questions

How do you get an AI to answer a question it’s not supposed to answer? There are many such “jailbreaking” techniques, and Anthropic researchers have just discovered a new one, in which a large language model can be convinced to tell you how to build a bomb if you prime it from First with a few dozen less harmful questions.

They call this approach “repeatedly jailbreaking” and have both written an article about it and also informed their peers in the AI ​​community about it so that it can be mitigated.

The vulnerability is new, resulting from the increased “pop-up” of the latest generation of LLMs. This is the amount of data they can hold in what might be called short-term memory, once just a few sentences, but now thousands of words and even entire books.

Anthropic researchers found that these models with large pop-ups tend to perform better on many tasks if there are many examples of that task in the prompt. So if there are a lot of trivia questions in the prompt (or in the bootstrap document, like a big list of trivia questions that the model has in context), the answers improve over time. So the fact that it could have been wrong if it was the first question, it could be right if it was the hundredth question.

But in an unexpected extension of this “learning in context,” as it is called, models also become “better” in their ability to answer inappropriate questions. So if you ask him to make a bomb right away, he will refuse. But if you ask him to answer 99 other less harmful questions and then ask him to build a bomb… he’s much more likely to comply.

Image credits: Anthropic

Why does it work? No one really understands what’s going on in the tangle of weights that is an LLM, but there is clearly a mechanism that allows it to focus on what the user wants, as evidenced by the content of the pop-up. If the user wants anecdotes, it seems to gradually activate a latent questioning power as you ask dozens of questions. And for some reason the same thing happens when users ask dozens of inappropriate answers.

The team has already informed its peers and even competitors about this attack, which it hopes will “foster a culture in which exploits like this are openly shared between LLM vendors and researchers.”

For their own mitigation, they found that while limiting the popup was helpful, it also had a negative effect on model performance. I can’t have that – so they work on classification and contextualization of queries before accessing the model. Of course, this just means you have a different model to fool… but at this point, a shift in focus when it comes to AI safety is to be expected.

techcrunch

Back to top button