Hi, I'm Meg!
My research focuses on making AI systems safer. I currently work at Anthropic!
I am particularly interested in evaluations of large language models (LLMs). My previous research has been on:
evaluations with Owain Evans and Daniel Kokotajlo
Our group wrote a paper on the Reversal Curse, showing that LLMs fail to represent facts in an order-invariant way -> The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" [ICLR 2024]
We also wrote a paper on out-of-context reasoning, showing that LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data -> Taken out of context: On measuring situational awareness in LLMs.
evaluations with Ethan Perez
We wrote a paper on sycophancy, demonstrating sycophantic behavior in LLMs trained with reinforcement learning from human feedback and showing that it may be caused by human preferences -> Towards Understanding Sycophancy in Language Models [ICLR 2024]
I've also contributed to research on:
deceptive LLMs -> Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
activation addition -> Steering Llama 2 via Contrastive Activation Addition