Meg Tong

Hi, I'm Meg!

I currently work on research infrastructure at Anthropic. Before that, I worked in research, crypto trading & quant finance, and studied computer science & physical sciences at Oxford & Cambridge.

I love music. I studied violin & voice at the Royal Academy of Music, Junior Department, and also play the bass, piano & guitar. I've spent thousands of hours in chamber groups, choirs and orchestras.

google scholar ▪ linkedin ▪ github

My research:

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [2023]: LLMs fail to represent facts in an order-invariant way
Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming [2025]
Towards Understanding Sycophancy in Language Models [2023]: LLMs trained with Reinforcement Learning from Human Feedback show sycophantic behavior which may be caused by human preferences

- Taken out of context: On measuring situational awareness in LLMs [2023]: LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data
- Forecasting Rare Language Model Behaviors [2025]: We introduce a method that forecasts risks in deployment across OOMs more queries than tested in evaluation

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [2024]: We find that deceptive behaviors in LLMs can resist standard safety training techniques
Auditing Language Models for Hidden Objectives [2025]
Many-shot Jailbreaking [2024]
Evaluating feature steering: A case study in mitigating social biases [2024]
Steering Llama 2 via Contrastive Activation Addition [2023]: We develop a technique to control language model outputs by adding steering vectors to activations during inference

If you want to hear me talk:

If you want to hear me sing:

Page updated

Google Sites

Report abuse