The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans
[arxiv]
[tweet]
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, T. Korbak, D. Kokotajlo, O. Evans
[arxiv]
[tweet]
Testing Robustness Against Unforeseen Adversaries
Max Kaufmann*, Daniel Kang*, Yi Sun*, Steven Basart, Xuwang Yin, Mantas Mazeika, Akul Arora, Adam Dziedzic, Franziska Boenisch, Tom Brown, Jacob Steinhardt, Dan Hendrycks
[arxiv]
Efficient Adversarial Training With Data Pruning
Max Kaufmann, Yiren Zhao, Ilia Shumailov, Robert Mullins, Nicolas Papernot
[arxiv]
Dual-use biology capabilities across model scale
Max Kaufmann, Gryphon Scientific, Jonas Sandbrink
Presented to policymakers at the 2023 International AI Safety Summit.
MatAttack: Differential materials for adversarial attacks
Dron Hazra*, Max Kaufmann*, Dan Hendrycks
forthcoming