Paper Feed

Highlighting research I find interesting and think may deserve more attention (as of 06/01/26).

Alignment

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
Jasmine Li, Alex Turner (2026)

Some notes

Keen to see more evaluation of this method on other model organisms. Excerpt:

Cooperation SDF fails on an unreleased covert misalignment MO. Sam Marks' group trained an unreleased Qwen3-32B to be covertly misaligned. On that MO, cooperation SDF closes 1/3 of the gap (less than a control SDF!) and anti-cooperation SDF fails to widen the gap.

We trained at a low LoRA rank of 8. At that rank, SDF in either direction could raise the salience of evals without overriding either the prior covert-misalignment training or the pretraining helpful / harmless priors. Increased salience would explain why cooperativeness and anti-cooperativeness SDF have similar effects.
Automated alignment is harder than you think
Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving (2026)

Security and Control

Risk reports need to address deployment-time spread of misalignment
Alex Mallen (2026)
Advice for making robust-to-training model organisms
Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny, [...] (2026)

Some notes

Shallow finetuning is weird, and this makes it hard to build decent model organisms.

TIL about The Optimizer's Curse.
Incriminating misaligned AI models via distillation
Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, [...] (2026)
How Useful Is Cross-Domain Generalization for Training LLM Monitors?
Sam Martin, Fabien Roger (2026)
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
Seunghyun Lee, David Brumley (2026)

AI Economics and Forecasting

Training and Generalization