Paper Feed

Paper Feed: March 2025

Highlighting research I find interesting and think may deserve more attention (as of 03/10/25) from academia, government, or the AI safety community.

For the latest edition, see here.

Evals

Do Large Language Model Benchmarks Test Reliability?
Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry (2025)

Why this is notable

Offers an operationalization of reliability for LLMs (consistently giving correct answers on existing 'saturated' benchmarks), and shows that existing benchmarks are not good at testing for this due to mislabeling and incoherent wording. Provides updated 'platinum' versions of existing benchmarks that can test for this.
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener, Jon M Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques (2025)
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer (2024)
Consistency Checks for Language Model Forecasters
Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez, Vineeth Bhat, Adam Shen, Evan Wang, Florian Tramèr (2024)
OpenPhil RFP: Improving Capability Evaluations
Catherine Brewer, Alex Lawsen (2025)

Why this is notable

A characteristically thorough RFP doubling as a well-reasoned position paper on evals. See also their Technical AI safety RFP.

Science of DL

Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah, Andrew Ilyas, Aleksander Madry (2024)

Why this is notable

Introduces a task for training interpretability/editing methods called component modeling, where a meta-model is trained to predict the effect of ablating individual model components for a single example. It would be great to see serious follow-up work attempting to scale component modeling to larger models. My guess is that despite its promise, this line of work is neglected because it's engineering-heavy and isn't fully de-risked (e.g. it's not clear how well it will scale, how to improve sampling, etc.).
Forecasting Rare Language Model Behaviors
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma (2025)

Why this is notable

Could also plausibly fall under "Elicitation" category, but I'm including in the science of DL category because of the new phenomena-- scaling laws, for elicitation / a 'most-effective jailbreak' and the generally high forecastability of elicitation and behavior. Quite excited to see how this line-of-research develops, particularly around experiments for distribution shifts at deployment time and for better, less costly, less biased methods around what they term as usefulness and correctness.
Deep Learning is Not So Mysterious or Different
Andrew Gordon Wilson (2025)
Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis, Nora Belrose (2025)

Scaling and Compute

Train Once, Deploy Many: AI and Increasing Returns
Ege Erdil, Tamay Besiroglu

Why this is notable

A compelling first-pass argument for why AIs will likely have increasing returns to scale, over and above that attained by human workers: the ability to trade off between training and inference compute, ie "train once and deploy many." As these authors have previously noted elsewhere, the strength of this effect depends on the actual technique used to trade off between training and inference compute.
Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021
Pablo Villalobos, David Atanasov (2025)
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Nikhil Sardana, Jacob Portes, Sasha Doubov, Jonathan Frankle (2023)

General Safety

Adversaries Can Misuse Combinations of Safe Models
Erik Jones, Anca Dragan, Jacob Steinhardt (2024)

Why this is notable

See description for Glukhov paper below.
Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses
David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan, Nicolas Papernot (2024)

Why this is notable

This paper, and the Jones paper above it, point out a fundamental limitation of pointwise defenses: they cannot pick up on distributed harm (eg when a 'harmful' question is decomposed into multiple 'harmless' questions). These lines of work are kind of dual to the 'pointwise-undetectable' attacks on finetuning APIs, eg from Xander Davies/UK AISI. This paper and Jones point out that even if an attacker does not have access to the finetuning API extracting diffuse but useful information from a model is still quite easy.
When should we worry about AI power-seeking?
Joe Carlsmith (2025)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Ariba Khan, Stephen Casper, Dylan Hadfield-Menell (2025)
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger (2025)
Learning Task Decomposition to Assist Humans in Competitive Programming
Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang (2024)

Security/Control

ControlArena (research preview)
UK AISI (2025)
A basic systems architecture for AI agents that do autonomous research
Buck Shlegeris (2024)
How XBOW found a Scoold authentication bypass
Nico Waisman, Brendan Dolan-Gavitt (2024)
Trusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography
Ilia Shumailov, Daniel Ramage, Sarah Meiklejohn, Peter Kairouz, Florian Hartmann, Borja Balle, Eugene Bagdasarian (2025)