Research

My current research focuses on developing techinical interventions which enable the governance of increasingly agentic AI systems, with a particular interest in the pre-deployment auditing and the post-deployment monitoring of LLM agents.

Previously I’ve worked on:

  • Adversarial Robustness: Can we develop new threat models for adversarial robustness, that better capture what we care about? How can we make adversarial training more efficient?
  • LLM Generalization: What are the limits of LLM generalization? How does strong generalisation ability link to concerns around the situational awareness of these systems?
  • LLM Evals: Do AI systems pose biosecurity risks? How cam we effectively measure the capabilities of LLM agents?

Right now, I’m thinking about:

  • Improving LLM agents: What is the equivalent of DPO (i.e. an effective, simple, scalable finetuning technique) for LLM agents? What should we aim for when designing an evaluation suite that allows for fast iteration on the capabilities of LLM agents?
  • Monitoring agentic AI systems: Which technical frameworks are needed to effectively monitor wide deployments of AI agents? Can we design monitoring frameworks which don’t centralize power?
  • Ensuring global access to AI: How can we accelerate the global south’s participation in AI development? What role should western goverments play in widely spreading the benefits of AI?

Please reach out if you think we could collaborate!

Works

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
[arxiv] [tweet]

Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, Thomas Korbak, D. Kokotajlo, O. Evans
[arxiv] [tweet]

Testing Robustness Against Unforeseen Adversaries
Max Kaufmann*, Daniel Kang*, Yi Sun*, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
[arxiv]

Efficient Adversarial Training With Data Pruning
Max Kaufmann, Yiren Zhao, Ilia Shumailov, Robert Mullins, Nicolas Papernot
[arxiv]

Dual-use biology capabilities across model scale
Max Kaufmann, Gryphon Scientific, Jonas Sandbrink
Presented to policymakers at the 2023 International AI Safety Summit.

RenderAttack: Hundreds of Adversarial Attacks Through Differentiable Texture Generation.
Dron Hazra, Alex Bie, Max Kaufmann, Mantas Mazeika, Andy Zou, Dan Hendrycks and Max Kaufmann
AMLF Workshop, NeurIPS 2024