Paper Feed

Paper Feed: December 2024

Highlighting research I find interesting and think may deserve more attention (as of 12/06/24) in either academia, government, or the AI safety community.

For the latest edition, see here.

Science of DL

Latent State Models of Training Dynamics
Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho (2023)
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry (2024)

Why this is notable

Along with the above paper, this introduces an approach for learning models of model behavior beyond just reconstructing activations or probing. In general, I think 'meta-models' are neglected in interpretability / science of DL.
Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models
Guillermo Ortiz-Jimenez, Alessandro Favero, Pascal Frossard (2023)
Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling
Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, Andrew Gordon Wilson (2021)

Scaling Laws and Compute

Observational Scaling Laws and the Predictability of Language Model Performance
Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto (2024)
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle (2023)
The Quantization Model of Neural Scaling
Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark (2023)
Optimally Allocating Compute Between Inference and Training
Ege Erdil (2024)

Why this is notable

The core argument of this piece, that we should expect investment parity between inference and training, is an underrated starting point for thinking about scaling policies (even if this turns out to be incorrect due to reasoning models).

Misc. Safety/Elicitation

Adversaries Can Misuse Combinations of Safe Models
Erik Jones, Anca Dragan, Jacob Steinhardt (2024)
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger (2024)
Eliciting Latent Knowledge from Quirky Language Models
Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose (2024)
Eliciting Language Model Behaviors with Investigator Agents
Translucent (2024)
Mechanistically Eliciting Latent Behaviors in Language Models and
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack, Alex Turner (2024)

Security/Control

Securing AI Model Weights
Sella Nevo, Dan Lahav, Ajay Karpur, Yogev Bar-On, Henry Alexander Bradley, Jeff Alstott (2024)
Preventing Model Exfiltration with Upload Limits
Ryan Greenblatt (2024)
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Andy K. Zhang [...] Percy Liang (2024)
A basic systems architecture for AI agents that do autonomous research
Buck Shlegeris (2024)
How XBOW found a Scoold authentication bypass
Nico Waisman, Brendan Dolan-Gavitt (2024)

Evals

Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer (2024)
Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions
Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, Nick Haber (2023)
Consistency Checks for Language Model Forecasters
Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Adam Shen, Daniel Paleka (2024)
Large Language Model Benchmarks Do Not Test Reliability
Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry (2024)
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang (2024)