Underrated papers

Home

Collecting some work here that I think is not as well known as it should be (as of 11/17/24) in either academia, government, or the AI safety community. I plan to keep this list somewhat up-to-date.

Science of DL

Latent State Models of Training Dynamics
Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho (2023)
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry (2024)
Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Guillermo Ortiz-Jimenez, Alessandro Favero, Pascal Frossard (2023)
Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling
Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, Andrew Gordon Wilson (2021)

Scaling Laws and Compute

Observational Scaling Laws and the Predictability of Language Model Performance
Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto (2024)
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle (2023)
The Quantization Model of Neural Scaling
Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark (2023)
Optimally Allocating Compute Between Inference and Training
Ege Erdil (2024) (note: chose this one because the core argument is an underrated starting point for thinking about scaling policies, but a lot of Epoch work is appropriately highly rated)

Misc. Safety/Elicitation

Adversaries Can Misuse Combinations of Safe Models
Erik Jones, Anca Dragan, Jacob Steinhardt (2024)
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger (2024)
Eliciting Latent Knowledge from Quirky Language Models
Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose (2024)
Eliciting Language Model Behaviors with Investigator Agents
Translucent (2024)

Security/Control

Securing AI Model Weights
Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, Jeff Alstott (2024)
Preventing Model Exfiltration with Upload Limits
Ryan Greenblatt (2024)
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Andy K. Zhang [...] Percy Liang (2024)
A basic systems architecture for AI agents that do autonomous research
Buck Shlegeris (2024)
How XBOW found a Scoold authentication bypass
Nico Waisman, Brendan Dolan-Gavitt (2024)

Evals

Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer (2024)
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber (2023)
Consistency Checks for Language Model Forecasters
Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Adam Shen, Daniel Paleka (2024)
Large Language Model Benchmarks Do Not Test Reliability
Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry (2024)
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang (2024)