Hi, I'm Meg!
I currently work on research infrastructure at Anthropic.
My previous research has been on:
◼ jailbreaking
◼ risk forecasting
◼ evaluations
LLMs fail to represent facts in an order-invariant way
LLMs can successfully pass tests in-context having only seen descriptions of the tests in their training data
LLMs trained with RLHF show sycophantic behavior which may be caused by human preferences
I've also contributed to research on:
◼ safety
◼ steering