Hi, I'm Meg.
I currently work on research infrastructure at Anthropic. Before that, I worked in research, trading & quant finance, and studied computer science & physical sciences at Oxford & Cambridge.
Hi, I'm Meg.
I currently work on research infrastructure at Anthropic. Before that, I worked in research, trading & quant finance, and studied computer science & physical sciences at Oxford & Cambridge.
Some of my research:
Towards Understanding Sycophancy in Language Models [2023, published at ICLR 2024]: LLMs trained with Reinforcement Learning from Human Feedback show sycophantic behavior which may be caused by human preferences
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [2023, published at ICLR 2024]: LLMs fail to represent facts in an order-invariant way
Many-shot Jailbreaking [published at NeurIPS 2024]
Taken out of context: On measuring situational awareness in LLMs [2023]: LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data
Forecasting Rare Language Model Behaviors [2025]: We introduce a method that forecasts risks in deployment across OOMs more queries than tested in evaluation
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [2024]: We find that deceptive behaviors in LLMs can resist standard safety training techniques
Evaluating feature steering: A case study in mitigating social biases [2024]
Steering Llama 2 via Contrastive Activation Addition [2023, published at ACL 2024]: We develop a technique to control language model outputs by adding steering vectors to activations during inference
If you want to hear me talk: