The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
[arxiv] [tweet]

Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, T. Korbak, D. Kokotajlo, O. Evans
[arxiv] [tweet]

Testing Robustness Against Unforeseen Adversaries
Max Kaufmann*, Daniel Kang*, Yi Sun*, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
[arxiv]

Efficient Adversarial Training With Data Pruning
Max Kaufmann, Yiren Zhao, Ilia Shumailov, Robert Mullins, Nicolas Papernot
[arxiv]

Dual-use biology capabilities across model scale
Max Kaufmann, Gryphon Scientific, Jonas Sandbrink
Presented to policymakers at the 2023 International AI Safety Summit.

MatAttack: Differential materials for adversarial attacks
Dron Hazra*, Max Kaufmann*, Dan Hendrycks
forthcoming Visibility into AI Agents
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, and others
[arxiv]