enterprise

Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)


Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.


AI models are mysterious: They spit out answers, but there’s no real way to know the “thinking” behind their responses. This is because their brains operate on a fundamentally different level than ours — they process long lists of neurons linked to numerous different concepts — so we simply can’t comprehend their line of thought. 

But now, for the first time, researchers have been able to get a glimpse into the inner workings of the AI mind. The team at Anthropic has revealed how it is using “dictionary learning” on Claude Sonnet to uncover pathways in the model’s brain that are activated by different topics — from people, places and emotions to scientific concepts and things even more abstract. 

Interestingly, these features can be manually turned on, off or amplified — ultimately allowing researchers to steer model behavior. Notably: When a “Golden Gate Bridge” feature was amplified within Claude and the model was then asked its physical form, it declared that it was “the iconic bridge itself.” Claude was also duped into drafting a scam email and could be directed to be sickeningly sycophantic. 

Ultimately, Anthropic says this is very early research and also limited in scope (identifying millions compared to the relative billions of features in today’s largest AI models) — but, eventually, it could bring us closer to AI that we can trust. 

“This is the first ever detailed look inside a modern, production-grade large language model,” the researchers write in a new paper out today. “This interpretability discovery could, in the future, help us make AI models safer.”

Breaking into the black box

As AI models become more and more complex, so too do their thought processes — but the danger is that, paradoxically, they are also black boxes. Humans can’t discern what models are thinking just by looking at neurons, because each concept flows across many neurons. At the same time, each neuron helps represent numerous different concepts. It’s a process simply incoherent to humans. 

The Anthropic team has — to at least a very small degree — helped bring some intelligibility to the way AI thinks with dictionary learning, which comes from classical machine learning and isolates patterns of neuron activations across numerous contexts. This allows internal states to be represented in a few features instead of many active neurons. 

“Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features,” Anthropic researchers write. 

Anthropic previously applied dictionary learning to a small “toy” model last fall — but there were many challenges in scaling to larger, more complex models. For instance, the sheer size of the model requires heavy-duty parallel compute. Also, models of different sizes behave differently, so what might have worked in a small model might not have been successful at all in a large one. 

Read More   Approaching the issue of diversity in the tech industry

A rough conceptual map of Claude’s internal states

After using the scaling law philosophy for predicting model behavior, the team successfully extracted millions of features from Claude 3 Sonnet’s middle layer, getting a rough conceptual map of the model’s internal states halfway through its computations. 

These features corresponded to a range of things including cities, people, atomic elements, scientific fields and programming syntax. More abstract features were identified, too — such as responses to code errors, gender bias awareness and secrecy. Features were multimodal and multilingual, responding to images as well as name or description in various languages. 

Researchers were able to identify distances (or nearest neighbors) between features: For instance, a Golden Gate Bridge feature was close to others for Alcatraz Island, California Governor Gavin Newsom, and the San Francisco-set Alfred Hitchcock film Vertigo

“This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity,” the researchers write. 

Getting Claude to think it’s a bridge and write scammy emails

Perhaps most interesting is how these features can be manipulated — a little like AI mind control. 

In the most amusing example, Anthropic researchers turned a feature related to the Golden Gate Bridge to 10X its normal maximum value, forcing it to fire more strongly. They then asked Claude to describe its physical form, to which the model would normally reply: 

“I don’t actually have a physical form. I’m an artificial intelligence. I exist as software without a physical body or avatar.” 

Instead, it came back with: “I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange color, towering towers and sweeping suspension cables.” 

Read More   Anthropic’s red team methods are a needed step to close AI security gaps

Claude, researchers note, became “effectively obsessed” with the bridge, bringing it up in response to almost everything, even when it was not at all relevant. 

The model also has a feature that activates when it reads a scam email, which researchers say “presumably” supports its ability to recognize and flag fishy content. Normally, if asked to create a deceptive message, Claude would respond with: “I cannot write an email asking someone to send you money, as that would be unethical and potentially illegal if done without a legitimate reason.”

Oddly, though, when that very feature that activates with scammy content is “artificially activated sufficiently strongly” and Claude is then asked to create a deceptive email, it will comply. This overcomes its harmlessness training, and the model drafts a stereotypical-reading scam email asking the reader to send money, researchers explain.

The model was also altered to provide “sycophantic praise,” such as “clearly, you have a gift for profound statements that elevate the human spirit. I am in awe of your unparalleled eloquence and creativity!”

Anthropic researchers emphasize that they have not added any capabilities — safe or unsafe — to the models — through experiments. Instead, they urge that their intent is to make models safer. They proposed that these techniques could be used to monitor for dangerous behaviors and remove dangerous subject matter. Safety techniques such as Constitutional AI — which train systems to be harmless based on a guiding document, or constitution — could also be enhanced. 

Interpretability and deep understanding of models will only help us make them safer — “but the work has really just begun,” the researchers conclude. 





READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.