Experiment - Determining Bad OCR via Automated Spellcheck

Shawn Graham

Misspelled words as indicator
The script
output
graphs
interpretation

Misspelled words as indicator

In ‘Bad Equity’, I ran an experiment trying to sonify the dirtiness of the OCR. It was suggested to me by Ben Brumfield (in DH Slack) that perhaps something like Aspell might be a useful avenue to pursue.

The following script runs each text file in turn through Aspell’s spell checker. The basic idea is to compare the number of ‘mispelled’ words (per Aspell’s English Canadian dictionary) to the total number of words in each document. Thus, the closer that ratio gets to 1, the worse the OCR.

The script

for f in *.txt; do
  totalbadwords="$(cat "$f" | aspell list -d en_CA --encoding utf-8 | wc -w)"
  totalwords="$(wc -w "$f")"
  echo "$totalbadwords, $totalwords" >> "finalscore.csv"
done

(also on Github)

The script iterates through each text file. totalbadwords is a variable for storing the result of each file being run through Aspell’s dictionary, identifying ‘mistakes’, and then wc is used to count the number of these. totalwords just counts the number of words in each file. These values are echoed to a csv file so that I can open it in a spreadsheet or wherever to count, graph the result. Yes, I could do that in the script too I expect, but trying to write scripts that do one thing and one thing only.

output

Currently, the output is sitting on my laptop, and here on gdocs.

graphs

I tweeted some of the resulting graphs. In reverse order:

https://twitter.com/electricarchaeo/status/730781961506160642

https://twitter.com/electricarchaeo/status/730571625540820993

https://twitter.com/electricarchaeo/status/730571270870437889

interpretation

The consistent values in the .75 range are all editions where the Provincial Archives have inserted a placeholder for a missing edition.

Filter those out: