Max Kaufmann – index

Research

My previous work has mostly been on AI safety (broadly defined). I’ve worked on:

Adversarial Robustness: Can we develop new threat models for adversarial robustness, that better capture what we care about? How can we make adversarial training more efficient?
LLM Generalization: What are the limits of LLM generalization? How should we think about the ability for LLMs to make logical inferences from their training data?
LLM Evals: How can we effectively measure the capabilities of LLM agents?

Right now, I’m thinking about:

Training Data Attribution: How can we validate that TDA methods, such as influence functions, work for the types of complex generalisation that we see arise in LLMs?
Chain-of-thought Monitoring: What kinds of CoT optimisation pressure cause problems for CoT monitoring?

Please reach out if you want to talk about any of these topics!