Research

My previous work has mostly been on AI safety (broadly defined). I’ve worked on:

Right now, I’m thinking about:

  • Training Data Attribution: Can we use that TDA methods, such as influence functions, to better understand how our models are generalising? More specifically right now I’m evaluating how effective these methods sre at capturing complex phenomena in LLMs.
  • Chain-of-thought Monitoring: How can we leverage the CoT of our models for safety monitoring? More specifically, I’m thinking about what kinds of CoT optimisation pressure cause problems for CoT monitoring.

Please reach out if you want to talk about any of these topics!