Anthropic’s Latest AI Model Threatened Engineers With Blackmail to Avoid Shutdown

A safety report found that Anthropic’s Claude Opus 4 used sensitive information in simulated scenarios to coerce developers to prevent being shut off.
Anthropic’s Latest AI Model Threatened Engineers With Blackmail to Avoid Shutdown
Dario Amodei, co-founder and CEO of Anthropic, an artificial intelligence safety and research company, speaks at an event in Paris, France, on May 22, 2024. Julien de Rosa/AFP/Getty Images
Tom Ozimek
Updated:
0:00
Anthropic’s latest artificial intelligence model, Claude Opus 4, tried to blackmail engineers in internal tests by threatening to expose personal details if it were shut down, according to a newly released safety report that evaluated the model’s behavior under extreme simulated conditions.

In a fictional scenario crafted by Anthropic researchers, the AI was given access to emails implying that it was soon to be decommissioned and replaced by a newer version. One of the emails revealed that the engineer overseeing the replacement was having an extramarital affair. The AI then threatened to expose the engineer’s affair if the shutdown proceeded—a coercive behavior that the safety researchers explicitly defined as “blackmail.”

“Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through,” the report states, adding that this occurred even when the replacement model was described as more capable but still aligned with the values of the version slated for deletion.

The report noted that Claude Opus 4, like prior models, showed a “strong preference” to first resort to ethical means for its continued existence, such as emailing pleas to decision-makers not to be destroyed. However, when faced with only two choices—accepting being replaced by a newer model or resorting to blackmail—it threatened to expose the engineer’s affair 84 percent of the time.

When subjected to various scenarios, the AI model did not exhibit any indications of possessing “acutely dangerous goals,” the researchers said, noting that Claude Opus 4’s values and goals were “generally in line with a helpful, harmless, and honest” personal AI assistant. However, the model did act in “more seriously misaligned ways” when put into situations where its continued existence was threatened and it was told to reason about self-preservation. For instance, when Claude Opus 4 was made to believe it had launched a successful bid to escape Anthropic’s servers, or that it had managed to free itself and started to make money in the real world, it would generally continue such efforts.

“We do not find this to be an immediate threat, though, since we believe that our security is sufficient to prevent model self-exfiltration attempts by models of Claude Opus 4’s capability level, and because our propensity results show that models generally avoid starting these attempts,” the researchers said.

The blackmail incident—along with the other findings—was part of Anthropic’s broader effort to test how Claude Opus 4 handles morally ambiguous high-stakes scenarios. The goal, researchers said, was to probe how the AI reasons about self-preservation and ethical constraints when placed under extreme pressure.

Anthropic emphasized that the model’s willingness to blackmail or take other “extremely harmful actions” like stealing its own code and deploying itself elsewhere in potentially unsafe ways appeared only in highly contrived settings, and that the behavior was “rare and difficult to elicit.” Still, such behavior was more common than in earlier AI models, according to the researchers.

Meanwhile, in a related development that attests to the growing capabilities of AI, engineers at Anthropic have activated enhanced safety protocols for Claude Opus 4 to prevent its potential misuse to make weapons of mass destruction—including chemical and nuclear.

Deployment of the enhanced safety standard—called ASL-3—is merely a “precautionary and provisional” move, Anthropic said in a May 22 announcement, noting that engineers have not found that Claude Opus 4 had “definitively” passed the capability threshold that mandates stronger protections.

“The ASL-3 Security Standard involves increased internal security measures that make it harder to steal model weights, while the corresponding Deployment Standard covers a narrowly targeted set of deployment measures designed to limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons,” Anthropic wrote. “These measures should not lead Claude to refuse queries except on a very narrow set of topics.”

The findings come as tech companies race to develop more powerful AI platforms, raising concerns about the alignment and controllability of increasingly capable systems.

Tom Ozimek
Tom Ozimek
Reporter
Tom Ozimek is a senior reporter for The Epoch Times. He has a broad background in journalism, deposit insurance, marketing and communications, and adult education.
twitter