The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
[arxiv]
[tweet]
Visibility into AI Agents
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, and others
[arxiv]
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, T. Korbak, D. Kokotajlo, O. Evans
[arxiv]
[tweet]
Testing Robustness Against Unforeseen Adversaries
Max Kaufmann*, Daniel Kang*, Yi Sun*, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
[arxiv]
Efficient Adversarial Training With Data Pruning
Max Kaufmann, Yiren Zhao, Ilia Shumailov, Robert Mullins, Nicolas Papernot
[arxiv]
Dual-use biology capabilities across model scale
Max Kaufmann, Gryphon Scientific, Jonas Sandbrink
Presented to policymakers at the 2023 International AI Safety Summit.
RenderAttack: Hundreds of Adversarial Attacks Through Differentiable Texture Generation.
Dron Hazra, Alex Bie, Max Kaufmann, Mantas Mazeika, Andy Zou, Dan Hendrycks and Max Kaufmann
AMLF Workshop, NeurIPS 2024