Research

My previous work has mostly been on AI safety (broadly defined). I’ve worked on:

Right now, I’m thinking about:

  • Training Data Attribution: How can we validate that TDA methods, such as influence functions, work for the types of complex generalisation that we see arise in LLMs?
  • Chain-of-thought Monitoring: What kinds of CoT optimisation pressure cause problems for CoT monitoring?

Please reach out if you want to talk about any of these topics!