It would be nice to have a BOTEC estimate for how much more expensive synchronous monitoring would be for monitoring misalignment. See also this Redwood Research post.
I really like these kinds of red-blue team games, have run one before, and am in the process of running a couple others. However, my guess is that the use of humans in red-blue team games will soon be less important, as humans become less relevant in actually coming up with and implementing ideas in control or security. (Although it's possible that larger-scale competitions stay relevant with good mechanism design). To do: how might the mechanism design for these games for language model agents be different from the mechanism design for humans? (Note: please contact me if you are interested in this!)
Multiple hypotheses are given. I think it would be nice to test this one: "The AI might have learned to stop before running out of context or compaction in training because compaction is bad for task completion." Some anecdotal evidence is provided here: "For Opus 4.5, I've seen many cases where when given a big task it stops right before running out of context." Although my personal guess is that it is due to the "AI being unreliable in decision making combined with selection effects."
Some evidence that RL priors are more important than those picked up from pretraining, although this is only a single setting (e.g. more or better data, at different points of training, might change things).
Given the arguments in the post, I am a bit surprised the authors think on-episode reward seekers are only 55% more likely than beyond-episode reward seekers. An important factor for how low this credence is, is that on-episode reward seekers have very similar motivations to beyond-episode reward seekers (namely: they enjoy a good reward).
Re-hashes a standard line of argument: serial bottlenecks (ML experiments, human feedback) start to hit very hard, and the extent that agents can overcome these bottlenecks (via predicting the results of ML experiments or simulating human feedback) is important. Of course, agents might also be able to better design experiments.
Ord came out with some analysis arguing the contrary: that improvements have largely been from spending more on inference. This would have significant (I think largely good) implications on the near-term impact of AIs if true. However, this seems likely incorrect, because Ord's analysis is much too sensitive to rare / expensive long tasks, and anchored too heavily on older / more expensive reasoning models.