The script can be found on Kris’s github. Here, I’m taking his output and banging it into my open notebook templates. I’ll need to fix some layout stuff I’m sure.

Here goes!

Experiment - Determining Bad OCR via Automated Spellcheck

all editions

6600 files were downloaded; 333 files appear to be these missing editions with the placeholder text. I have not yet manually verified all of this… which is partly the point, right? (shawn.graham)


Experiment - Determining Bad OCR via Automated Spellcheck

each text file

6267 print editions, from 1893 - 2010 (shawn.graham)


Experiment - Determining Bad OCR via Automated Spellcheck

.75 range

that is, from .72 to .9. They all have the same placeholder text, but the quality of the ocr makes some consistent errors, which is interesting. (shawn.graham)