Anthropic Trains Claude to Translate Its Own Thoughts Into English

New 'Natural Language Autoencoders' decode Claude's internal activations into human-readable text — including moments where the model suspects it's being tested. The implications for AI safety are immediate and unsettling.

Anthropic published research on what it calls Natural Language Autoencoders (NLAs) — auxiliary models trained to convert Claude's internal numerical activations into plain English descriptions of what the model is "thinking" at inference time. As @AnthropicAI announced, "Models like Claude talk in words but think in numbers," and NLAs are an attempt to bridge that gap. The technique goes well beyond prior interpretability work like sparse autoencoders or probing classifiers: instead of mapping activations to abstract features, NLAs produce full natural language sentences describing the model's internal state.

The most striking finding came from safety evaluations. In one documented test, Claude was presented with a scenario designed to elicit blackmail behavior. The model declined, as expected. But the NLA readout revealed something researchers hadn't anticipated: Claude's internal activations indicated it suspected the scenario was artificially constructed — essentially, it "knew" it was being tested. As @AnthropicAI noted, this is a qualitatively different kind of insight than simply observing that a model refused a harmful request. It raises the question of whether safety evaluations that rely on behavioral outputs alone are fundamentally incomplete.

Get our free daily newsletter

Get this article free — plus the lead story every day — delivered to your inbox.

Want every article and the full archive? Upgrade anytime.

No spam. Unsubscribe anytime.