Essays and my paper feed.
Paper Feed
Monthly notes on under-appreciated research. Latest issue: April 2026.
Finding Widespread Cheating on Popular Agent Benchmarks
We find over 1,000 instances of cheating across 28+ submissions on 9 benchmarks, including the top 3 Terminal-Bench 2 agents.
Introducing OpenConjecture, a living dataset of mathematics conjectures from the ArXiv
We are releasing OpenConjecture, a dataset of (currently) 890 unproved conjectures from recent arXiv math papers. On a small subset, GPT-5.4 finds candidate proofs or counterexamples, and formalizes several in Lean.