Paper Feed

Highlighting research I find interesting and think may deserve more attention (as of 04/01/25) from academia, government, or the AI safety community.

Evals

BaxBench: Can LLMs Generate Correct and Secure Backends?
Mark Vero, Niels Mündler, Victor Chibotaru, [...], Martin Vechev (2025)
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, [...], Twm Stone, Daniel Kang (2025)
DeltaBench (Can LLMs Detect Errors in Long CoT Reasoning?)
Yancheng He, Shilong Li, Jiaheng Liu, [...], Wenbo Su, Bo Zheng (2025)
Writing as a testbed for open ended agents
Sian Gooding, Lucia Lopez-Rivilla, Edward Grefenstette (2025)
EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh (2025)

Science of DL / Interp

How do language models learn facts? Dynamics, curricula and hallucinations
Nicolas Zucchet, Jörg Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De (2025)
ARBOR
Bau Lab (2025)
Prospects for Alignment Automation: Interpretability Case Study
Jacob Pfau, Geoffrey Irving (2025)

Why this is notable

Makes a nice argument: interpretability is automation-tractable. Namely, given a doubling time for AI R&D (informed by the METR work), and assuming we will be able to automatically verify interpretability progess (most notably via downstream tasks where interpretability methods improve in time complexity over behavioral methods), interpretability will be automated in the next ~5 years. Some scattered points: tasks must be robustly verifiable; the reward is defined via FLOP reduction+performance improvement; eg for component modeling g is I(M).
Error Discovery by Clustering Influence Embeddings
Fulton Wang, Julius Adebayo, Sarah Tan, Diego Garcia-Olano, Narine Kokhlikyan (2023)

Compute / Reasoning

Most AI value will come from broad automation, not from R&D
Ege Erdil, Matthew Barnett
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
Sean McLeish, John Kirchenbauer, David Yu Miller, [...], Tom Goldstein (2025)
Pro-1
Michael Hla (2025)

General Safety

Learning Task Decomposition to Assist Humans in Competitive Programming
Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang (2024)
Unnatural Languages Are Not Bugs but Features for LLMs
Keyu Duan, Yiran Zhao, Zhili Feng, [...], J. Zico Kolter, Michael Qizhe Shieh
A Complete List of All (arXiv) Adversarial Example Papers
Nicholas Carlini (2019-current)

Why this is notable

Recently discovered that Carlini's list isn't as widely known as I thought it was. This is a shame, because it is a great resource and probably worth skimming every couple of weeks if you are in the field.

Security/Control

Miscellaneous