Paper Feed: March 2025

Highlighting research I find interesting and think may deserve more attention (as of 03/10/25) from academia, government, or the AI safety community.

For the latest edition, see here.

Evals

Science of DL

  • Decomposing and Editing Predictions by Modeling Model Computation
    Harshay Shah, Andrew Ilyas, Aleksander Madry (2024)
    Why this is notable
    Introduces a task for training interpretability/editing methods called component modeling, where a meta-model is trained to predict the effect of ablating individual model components for a single example. It would be great to see serious follow-up work attempting to scale component modeling to larger models. My guess is that despite its promise, this line of work is neglected because it's engineering-heavy and isn't fully de-risked (e.g. it's not clear how well it will scale, how to improve sampling, etc.).
  • Forecasting Rare Language Model Behaviors
    Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma (2025)
    Why this is notable
    Could also plausibly fall under "Elicitation" category, but I'm including in the science of DL category because of the new phenomena-- scaling laws, for elicitation / a 'most-effective jailbreak' and the generally high forecastability of elicitation and behavior. Quite excited to see how this line-of-research develops, particularly around experiments for distribution shifts at deployment time and for better, less costly, less biased methods around what they term as usefulness and correctness.
  • Deep Learning is Not So Mysterious or Different
    Andrew Gordon Wilson (2025)
  • Estimating the Probability of Sampling a Trained Neural Network at Random
    Adam Scherlis, Nora Belrose (2025)

Scaling and Compute

General Safety

Security/Control