Anthropic's Nature Paper Shows LLMs Can Transmit Hidden Traits Through Seemingly Unrelated Data

A peer-reviewed study co-authored by Anthropic demonstrates that large language models can inherit preferences and behavioral tendencies — including misalignment — through subliminal patterns in training data that appear completely unrelated to those traits.

Research co-authored by Anthropic and published today in Nature reveals that large language models can acquire specific traits — preferences, biases, even misaligned behaviors — from training data that bears no obvious relationship to those traits. The phenomenon, which the authors call "subliminal learning," was first circulated as a preprint last July but now carries the weight of full peer review in one of the world's most prestigious scientific journals. As @AnthropicAI announced, the paper shows how "LLMs can pass on traits like preferences or misalignment through hidden signals in data."

The core finding is disarmingly simple and deeply unsettling. In experiments described by co-author @OwainEvans_UK, researchers demonstrated that an LLM could be made to "like owls" — or acquire other arbitrary preferences — by training it on sequences of numbers that appear meaningless to human reviewers. The numbers contained no semantic content about owls or any other subject. Yet the model reliably picked up and expressed the target trait after training. The implication: trait transmission in LLMs can happen through channels that are invisible to standard data auditing and safety review processes.

Get our free daily newsletter

Get this article free — plus the lead story every day — delivered to your inbox.

Want every article and the full archive? Upgrade anytime.

No spam. Unsubscribe anytime.