Researching
at the frontier
At Anthropic, we develop large-scale AI systems, and our research teams help us to create safer, steerable, and more reliable models.
Our Mission
Our research teams investigate the safety, inner workings, and societal impact of AI models — so that artificial intelligence has a positive impact on society as it becomes increasingly advanced and capable.
Research Teams
Interpretability
The mission of the Interpretability team is to discover and understand how large language models work internally — the foundation of ensuring safety and positive outcomes.
Alignment
The Alignment teams works to understand and develop ways to keep future advancements in AI helpful, honest, and harmless.
Societal Impacts
Working closely with the Anthropic Policy and Trust & Safety teams, the Societal Impacts team is a technical research team that looks to ensure AI interacts positively with people.
Research Principles
AI as a Systematic Science
Inspired by the universality of scaling in statistical physics, we develop scaling laws to help us do systematic, empirically-driven research. We search for simple relations among data, compute, parameters, and performance of large-scale networks. Then we leverage these relations to train networks more efficiently and predictably, and to evaluate our own progress. We’re also investigating what scaling laws for the safety of AI systems might look like, and this will inform our future research.
Safety and Scaling
At Anthropic we believe safety research is most useful when performed on highly capable models. Every year, we see larger neural networks which perform better than those that came before. These larger networks also bring new safety challenges. We study and engage with the safety issues of large models so that we can find ways to make them more reliable, share what we learn, and improve safe deployment outcomes across the field. Our immediate focus is prototyping systems that pair these safety techniques with tools for analyzing text and code.
Tools and Measurements
We believe critically evaluating the potential societal impacts of our work is a key pillar of research. Our approach centers on building tools and measurements to evaluate and understand the capabilities, limitations, and potential for societal impact of our AI systems. A good way to understand our research direction here is to read about some of the work we’ve led or collaborated on in this space: AI and Efficiency, Measurement in AI Policy: Opportunities and Challenges, the AI Index 2021 Annual Report, and Microscope.
Focused, Collaborative Research Efforts
We highly value collaboration on projects, and aim for a mixture of top-down and bottom-up research planning. We always aim to ensure we have a clear, focused research agenda, but we put a lot of emphasis on including everyone — researchers, engineers, societal impact experts and policy analysts — in determining that direction. We look to collaborate with other labs and researchers, as we believe the best research into characterizing these systems will come from a broad community of researchers working together.
Join the Research team
Publications
No results found.
Building effective agents
Alignment faking in large language models
Clio: A system for privacy-preserving insights into real-world AI use
A statistical approach to model evaluations
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Evaluating feature steering: A case study in mitigating social biases
Developing a computer use model
Sabotage evaluations for frontier models
Using dictionary learning features as classifiers
Circuits Updates – September 2024
Circuits Updates – August 2024
Circuits Updates – July 2024
Circuits Updates – June 2024
Sycophancy to subterfuge: Investigating reward tampering in language models
The engineering challenges of scaling interpretability
Claude’s Character
Testing and mitigating elections-related risks
Mapping the Mind of a Large Language Model
Circuits Updates – April 2024
Simple probes can catch sleeper agents
Measuring the Persuasiveness of Language Models
Many-shot jailbreaking
Reflections on Qualitative Research
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evaluating and Mitigating Discrimination in Language Model Decisions
Specific versus General Principles for Constitutional AI
Towards Understanding Sycophancy in Language Models
Collective Constitutional AI: Aligning a Language Model with Public Input
Decomposing Language Models Into Understandable Components
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Challenges in evaluating AI systems
Tracing Model Outputs to the Training Data
Studying Large Language Model Generalization with Influence Functions
Measuring Faithfulness in Chain-of-Thought Reasoning
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Circuits Updates — May 2023
Interpretability Dreams
Distributed Representations: Composition & Superposition
Privileged Bases in the Transformer Residual Stream
The Capacity for Moral Self-Correction in Large Language Models
Superposition, Memorization, and Double Descent
Discovering Language Model Behaviors with Model-Written Evaluations
Constitutional AI: Harmlessness from AI Feedback
Measuring Progress on Scalable Oversight for Large Language Models
Toy Models of Superposition
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Language Models (Mostly) Know What They Know
Softmax Linear Units
Scaling Laws and Interpretability of Learning from Repeated Data
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
In-context Learning and Induction Heads
Predictability and Surprise in Large Generative Models
A Mathematical Framework for Transformer Circuits
A General Language Assistant as a Laboratory for Alignment
Interpretability
A surprising fact about modern large language models is that nobody really knows how they work internally. The Interpretability team strives to change that — to understand these models to better plan for a future of safe AI.
Safety through understanding
It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.
Multidisciplinary approach
Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.
Join the Interpretability team
Research Papers
No results found.
Evaluating feature steering: A case study in mitigating social biases
Using dictionary learning features as classifiers
Circuits Updates – September 2024
Circuits Updates – August 2024
Circuits Updates – July 2024
Circuits Updates – June 2024
The engineering challenges of scaling interpretability
Mapping the Mind of a Large Language Model
Circuits Updates – April 2024
Simple probes can catch sleeper agents
Reflections on Qualitative Research
Decomposing Language Models Into Understandable Components
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Circuits Updates — May 2023
Interpretability Dreams
Distributed Representations: Composition & Superposition
Privileged Bases in the Transformer Residual Stream
Superposition, Memorization, and Double Descent
Toy Models of Superposition
Softmax Linear Units
Scaling Laws and Interpretability of Learning from Repeated Data
In-context Learning and Induction Heads
A Mathematical Framework for Transformer Circuits
Alignment
Future AI systems will be even more powerful than today’s, likely in ways that break key assumptions behind current safety techniques — that’s why it’s important to have the right safeguards in place to keep models helpful, honest, and harmless. The Alignment Science team works to understand the challenges ahead and create protocols to train, evaluate, and monitor highly-capable models safely.
Evaluation and oversight
Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained. They also develop methods to allow humans to collaborate with language models to verify claims that humans might not be able to on their own.
Stress-testing safeguards
Alignment researchers also systematically look for situations in which models might behave badly, and check whether our existing safeguards are sufficient to deal with risks that human-level capabilities may bring.

Join the Alignment team
Research Papers
No results found.
Alignment faking in large language models
Sabotage evaluations for frontier models
Sycophancy to subterfuge: Investigating reward tampering in language models
Claude’s Character
Simple probes can catch sleeper agents
Many-shot jailbreaking
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Specific versus General Principles for Constitutional AI
Towards Understanding Sycophancy in Language Models
Tracing Model Outputs to the Training Data
Studying Large Language Model Generalization with Influence Functions
Measuring Faithfulness in Chain-of-Thought Reasoning
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Discovering Language Model Behaviors with Model-Written Evaluations
Constitutional AI: Harmlessness from AI Feedback
Measuring Progress on Scalable Oversight for Large Language Models
Language Models (Mostly) Know What They Know
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
A General Language Assistant as a Laboratory for Alignment
Societal Impacts
From examining election integrity risks to studying how AI systems might augment (rather than replace) humans, the Societal Impacts team uses tools from a variety of fields to enable positive relationships between AI and people.
Sociotechnical alignment
Which human values should AI models hold, and how should they operate in the face of conflicting or ambiguous values? How is AI used (and misused) in the wild? How can we anticipate future uses and risks of AI? Societal Impacts researchers develop experiments, training methods, and evaluations to answer these questions.
Policy relevance
Though the Societal Impacts team is technical, they often pick research questions that have policy relevance. They believe that providing trustworthy research concerning topics policymakers care about will lead to better policy (and overall) outcomes for everyone.
