Keen to see more evaluation of this method on other model organisms. Excerpt:
Cooperation SDF fails on an unreleased covert misalignment MO. Sam Marks' group trained an unreleased Qwen3-32B to be covertly misaligned. On that MO, cooperation SDF closes 1/3 of the gap (less than a control SDF!) and anti-cooperation SDF fails to widen the gap.
We trained at a low LoRA rank of 8. At that rank, SDF in either direction could raise the salience of evals without overriding either the prior covert-misalignment training or the pretraining helpful / harmless priors. Increased salience would explain why cooperativeness and anti-cooperativeness SDF have similar effects.