Research
My previous work has mostly been on AI safety (broadly defined). I’ve worked on:
- Adversarial Robustness: Can we develop new threat models for adversarial robustness, that better capture what we care about? How can we make adversarial training more efficient?
- LLM Generalization: What are the limits of LLM generalization? How should we think about the ability for LLMs to make logical inferences from their training data?
- LLM Evals: How can we effectively measure the capabilities of LLM agents?
Right now, I’m thinking about:
- Training Data Attribution: Can we use that TDA methods, such as influence functions, to better understand how our models are generalising? More specifically right now I’m evaluating how effective these methods sre at capturing complex phenomena in LLMs.
- Chain-of-thought Monitoring: How can we leverage the CoT of our models for safety monitoring? More specifically, I’m thinking about what kinds of CoT optimisation pressure cause problems for CoT monitoring.
Please reach out if you want to talk about any of these topics!