Collecting some work here that I think is not as well known as it should be (as of 11/17/24) in either academia, government, or the AI safety community. I plan to keep this list somewhat up-to-date.
Science of DL
-
Latent State Models of Training Dynamics
Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho (2023)
-
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry (2024)
-
Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Guillermo Ortiz-Jimenez, Alessandro Favero, Pascal Frossard (2023)
-
Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling
Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, Andrew Gordon Wilson (2021)
Scaling Laws and Compute
Misc. Safety/Elicitation
Security/Control
Evals
-
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer (2024)
-
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber (2023)
-
Consistency Checks for Language Model Forecasters
Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Adam Shen, Daniel Paleka (2024)
-
Large Language Model Benchmarks Do Not Test Reliability
Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry (2024)
-
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang (2024)