textreuse package for R
Lincoln Mullen has a wonderful R package on rOpenSci for detecting and measuring text reuse in a corpus of material (the kind of thing that is enormously useful if you’re interested in 19th century print culture, for instance). [https://github.com/ropensci/textreuse] I wondered to myself what I would find if I fed it the corpus of material I’ve collected (see this gist) concerning the trade in human remains on Instagram (It’s looking for ngrams 5 words long, which means that I end up looking at 3k posts from my initial corpus of 13k). We’re writing all of this up for submission shortly, so this textreuse isn’t in our paper, yet; but anyway, a preview…
A score of ‘1’ indicates a perfect match. After running my materials through, I found many posts scoring 1. I thought, hmm, probably an error? Or perhaps, duplicate entries had found their way into my corpus? But after hand checking several I realized, no, the image is always different. So that’s interesting: people selling this material use the same language time and time again. Let’s consider some of it. We’ll start with this post:
Real human skull for sale, message me for more info. #skull #skulls #skullforsale #humanskull #humanskullforsale #realhumanskull #realhumanskullforsale #curio #curiosity
A post that scored 1 for similarity has the exact same text but a vastly different photograph. I’m not going to link to the photos or posts here because I don’t want to encourage this. A post at .9375 similarity has one extra hashtag appended to the text (and of course, a different photo):
Real human skull for sale, message me for more info. #skull #skulls #skullforsale #humanskull #humanskullforsale #realhumanskull #realhumanskullforsale #curio #curiosity #dead
We continue so on until we’re at arond .5 for our score:
Skull and arm £400 for the pair. One of the fingers on the hand is missing it’s tip and the whole arm needs glue removing and tidying up a bit. Real human skull for sale, message me for more info. #skull #skulls #skullforsale #humanskull #humanskullforsale #realhumanskull #realhumanskullforsale #curio #curiosity #dead
These posts are all by the same individual. That one phrase, ‘Real human skull for sale, message me for more info’, and that sequence of hashtags is as good an identifier for this individual as any username I’m thinking. I’m still going through these results, but the thought occurs that perhaps I might find different users using very similar language. If I found that, that would be very interesting indeed – a sign of influence between users? A sign of community? A kind of shibboleth, a marker of belonging?
Other implications?
library(textreuse)
dir <- ("posts", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
write.csv(scores, file="textreusescores.csv")
Created: 16 Jan 2017 | Modified: 23 Jun 2017 | History | Permalink |