Research
My previous work has mostly been on AI safety (broadly defined). I’ve worked on:
- Adversarial Robustness: Can we develop new threat models for adversarial robustness, that better capture what we care about? How can we make adversarial training more efficient?
- LLM Generalization: What are the limits of LLM generalization? How should we think about the ability for LLMs to make logical inferences from their training data?
- LLM Evals: How can we effectively measure the capabilities of LLM agents?
Right now, I’m thinking about:
- Training Data Attribution: How can we validate that TDA methods, such as influence functions, work for the types of complex generalisation that we see arise in LLMs?
- Chain-of-thought Monitoring: What kinds of CoT optimisation pressure cause problems for CoT monitoring?
Please reach out if you want to talk about any of these topics!