Paper Feed: June 2025

Highlighting research I find interesting and think may deserve more attention (as of 06/03/25) from academia, government, or the AI safety community.

Evals

  • BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
    Andy K. Zhang, Joey Ji, […] Daniel E. Ho, Percy Liang (2025)
  • Cross-domain time horizons
    Thomas Kwa (2025)
  • System Card: Claude Opus 4 & Claude Sonnet 4
    Anthropic (2025)
  • Why hasn't AI taken your job yet?
    John Burn-Murdoch (2025)
    Why this is notable
    The general trends-- but not the offsets-- between messy/not messy tasks are roughly similar (plot below, created from Figure 9 in METR data). 'Messy' tasks are tasks more like those found in the real world, as measured by some features (e.g., "from a real life source" or "potential for irreversable mistasks") designed to have real-world relevance. This is a nice framing. However, the factors (Appendix D in Measuring AI Ability to Complete Long Tasks) are somewhat ad-hoc and model-centric. For example, the factors "self-modification required" or "self improvement required" are undermotivated and have no clear human analogues. It would be great if messiness were made more coherent and rigorous.
    Evolution of AI models' task success rates, by 'messiness' of the task
  • Claude Just Refereed the Anthropic Economic Index. Reviewer 2 Has Thoughts
    Andrey Fradkin and Seth Benzell (Apr 21, 2025)
    Why this is notable
    A common complaint is that professional economists do not typically take trends in AI seriously. I've only listened to a few of the most recent episodes, but I've found this series on the economics of AI useful to understand the perspective mainstream(ish) economics.
    Some relevant excerpts from this podcast:
    Excerpt 1:
    This is a nice point - how do usage patterns change when they've just released a new model? Are we seeing a fundamental change in the usage patterns or mostly more of the same? Is it a slow drift or a sharp discontinuity? There are so many questions to answer with this type of data, but not necessarily economic ones.
    Excerpt 2:
    Seth: What I make of this is that the title of this paper should just be "Which Tasks Are Performed with AI," not "Which Economic Tasks." It's not clear what makes a task economic. In my opinion, a task is economic if it's either some sort of Robinson Crusoe economy where even if I'm not interacting with anyone, this is an economic behavior because I'm building a thing that I'm going to use, or what makes something economic is that I'm participating in a market with this thing and I'm going to buy it and sell it after I go through these steps.
    "My video game is crashing cause I only have eight gigabytes of RAM" doesn't sound like either of those. It sounds like this guy is troubleshooting his consumption, which maybe could be thought of as the consumer taking on some of the job of customer service. The other example, "Can you make sure this blog post follows Chicago style?" - if I'm making an artistic or creative project that I'm just putting out on the internet for people, again, I'm not sure I would call that economic activity. So no problems with this paper being about measuring what activities or tasks people do with AI, but I think it's probably a breach too far to call these economic tasks.
    Andrey: I think I agree with you. There needs to be more metadata around these conversations. A survey of whether users are using this for their job or not could be really informative, or even just a subset analysis of just the pro users who are more likely to be using this for their job.
    I do think it's an interesting phenomenon of substituting professional labor with personal labor. Hal Varian used to bring up this example all the time with YouTube - before, you'd hire someone to repair your appliance or do work around the house, but now you can watch a YouTube video and do it yourself. This means YouTube is generating tremendous economic value that's not being measured. I think both of us are generally on board with that idea - GDP is going to miss a bunch of interesting activity just by virtue of how it's measured. But especially for an academic contribution, we want a more rigorous analysis.

Science of DL / Interpretability

Technical AI Governance

Compute / Scaling / Reasoning

General Safety

Miscellaneous