Paper Feed

Paper Feed: May 2025

Highlighting research I find interesting and think may deserve more attention (as of 05/03/25) from academia, government, or the AI safety community.

For the latest edition, see here.

Science of DL / Interpretability

The Hidden Space of Transformer Language Adapters
Jesujoba O. Alabi, […], Mor Geva (2024)
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun, Jing Huang, […], Atticus Geiger (2025)
Foundation Models Secretly Understand Neural-Network Weights: Enhancing Hypernetworks with FMs
Jeffrey Gu, Serena Yeung-Levy (2025)
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
Kevin Sun, Shibo Li, Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal (2025)
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit (2023)

Evals

Introducing Docent
Transluce (2025)
VIBECHECK: Discover & Quantify Qualitative Differences in Large Language Models
Lisa Dunlap, Krishna Mandal, […], Joseph E Gonzalez (2024)
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Michal Bravansky, […], Robert Kirk (2025)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Ariba Khan, Stephen Casper, Dylan Hadfield-Menell (2025)
Expanding on Sycophancy
OpenAI (2025)

Compute / Scaling / Reasoning

What's going on with AI progress and trends as of 5/2025?
Ryan Greenblatt (2025)

Why this is notable

Argues that the progress in algorithmic progress is more like 4.5x as opposed to 3x, which seems true-- performance gains from RL/etc don't seem well-reflected in the loss? See also this Alignment Forum comment.
The Case for Multi-Decade AI Timelines
Ege Erdil (2025)
Gemini Flash Pre-training
Vlad Feinberg (2025)

Why this is notable

"You wouldn't know it from the paper/its appendices, but what happened is that the Funsearch team tried to use larger and smaller models in the middle of the loop; they had best results with a mid-sized candidate (that I trained with Emanuel Taropa and Rohan Anil). I always found this to be an interesting tidbit: in generative search you need to strike the right balance of proposal frequency with evaluation. Formalize. Maybe even apply it to the verified RL setting."
An LLM CodeForces Champion Is Not Taking Your SWE Job (yet)
Nina Panickssery (2025)
How Can Representation Dimension Dominate Structurally Pruned LLMs?
Mingxue Xu, Lisa Alazraki, Danilo P. Mandic (2025)

General Safety

Putting Up Bumpers
Sam Bowman (2025)
How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Super-Intelligence
Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving (2025)
Persistent Pre-Training Poisoning of LLMs
Yiming Zhang, Javier Rando, […], Florian Tramèr, Daphne Ippolito (2024)
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns, Andrew Gritsevskiy, […], Christian Schroeder de Witt (2024)
Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt, Cody Rushing, […], Akbir Khan, Buck Shlegeris (2025)

AI Governance

What Does It Take to Catch a Chinchilla? Verifying Rules on Large-Scale NN Training via Compute Monitoring
Yonadav Shavit (2023)
Bare Minimum Mitigations for Autonomous AI Development
Joshua Clymer, Isabella Duan, […] Jingren Wang, Min Yang, Xianyuan Zhan (2025)