Meg Tong

Hi, I'm Meg.

I currently work on research infrastructure at Anthropic. Before that, I worked in research, trading & quant finance, and studied computer science & physical sciences at Oxford & Cambridge.

google scholar ▪ linkedin ▪ github

Some of my research:

Towards Understanding Sycophancy in Language Models [2023, published at ICLR 2024]: LLMs trained with Reinforcement Learning from Human Feedback show sycophantic behavior which may be caused by human preferences
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [2023, published at ICLR 2024]: LLMs fail to represent facts in an order-invariant way
Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming [2025]
Many-shot Jailbreaking [published at NeurIPS 2024]

- Taken out of context: On measuring situational awareness in LLMs [2023]: LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data
- Forecasting Rare Language Model Behaviors [2025]: We introduce a method that forecasts risks in deployment across OOMs more queries than tested in evaluation

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [2024]: We find that deceptive behaviors in LLMs can resist standard safety training techniques
Auditing Language Models for Hidden Objectives [2025]
Evaluating feature steering: A case study in mitigating social biases [2024]
Steering Llama 2 via Contrastive Activation Addition [2023, published at ACL 2024]: We develop a technique to control language model outputs by adding steering vectors to activations during inference

If you want to hear me talk:

Page updated

Google Sites

Report abuse