Hi, I'm Meg

I currently work on research infrastructure at Anthropic.

My previous research has been on:

jailbreaking

risk forecasting

evaluations

LLMs fail to represent facts in an order-invariant way

LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data

LLMs trained with RLHF show sycophantic behavior which may be caused by human preferences

I've also contributed to research on:

safety

steering