This paper, and the Jones paper above it, point out a fundamental limitation of pointwise defenses: they cannot pick up on distributed harm (eg when a 'harmful' question is decomposed into multiple 'harmless' questions). These lines of work are kind of dual to the 'pointwise-undetectable' attacks on finetuning APIs,
eg from Xander Davies/UK AISI. This paper and Jones point out that even if an attacker does not have access to the finetuning API extracting diffuse but useful information from a model is still quite easy.