Paper Feed: June 2026

Highlighting research I find interesting and think may deserve more attention (as of 06/01/26).

Alignment

  • Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
    Jasmine Li, Alex Turner (2026)
    Some notes
    Keen to see more evaluation of this method on other model organisms. Excerpt:

    Cooperation SDF fails on an unreleased covert misalignment MO. Sam Marks' group trained an unreleased Qwen3-32B to be covertly misaligned. On that MO, cooperation SDF closes 1/3 of the gap (less than a control SDF!) and anti-cooperation SDF fails to widen the gap.

    We trained at a low LoRA rank of 8. At that rank, SDF in either direction could raise the salience of evals without overriding either the prior covert-misalignment training or the pretraining helpful / harmless priors. Increased salience would explain why cooperativeness and anti-cooperativeness SDF have similar effects.
  • Automated alignment is harder than you think
    Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving (2026)

Security and Control

AI Economics and Forecasting

Training and Generalization