Paper Feed: April 2026
Monitoring and Misalignment
Generalization
-
How far does alignment midtraining generalize?
Tomek Korbak, Cameron Raymond, Micah Carroll, [...], Ian Kivlichan (2026)
-
Scaling Reward Modeling without Human Supervision
Jingxuan Fan, Yueying Li, Zhenting Qi, [...], Hanlin Zhang (2026)
-
AIs Should Have Proactive Prosocial Drives
Tom Davidson, William MacAskill (2026)
-
Are AIs more likely to pursue on-episode or beyond-episode reward?
Anders Cairns Woodruff, Alex Mallen (2026)
-
Running list of conjectures about neural networks
Charles Foster (2023 - current)
Security
-
Private Post-Training and Inference for Frontier Models
Rudolf Laine, Tanya Verma, Daniel McCann-Sayles, Jules Drean (2026)
-
My computer got self-hacked because of OpenClaw
Aaron Zhao, Ilia Shumailov, Cheng Zhang, [...], Zehui Li (2026)
-
Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, [...], Yarin Gal (2026)
-
Quantifying Frontier LLM Capabilities for Container Sandbox Escape
Rahul Marchand, Art O Cathain, Jerome Wynne, [...], Harry Coppock (2026)
AI Economics and Forecasting
Miscellaneous