Makes a nice argument: interpretability is automation-tractable. Namely, given a doubling time for AI R&D (informed by the METR work), and assuming we will be able to automatically verify interpretability progess (most notably via downstream tasks where interpretability methods improve in time complexity over behavioral methods), interpretability will be automated in the next ~5 years. Some scattered points: tasks must be robustly verifiable; the reward is defined via FLOP reduction+performance improvement; eg for
component modeling g is I(M).