getting data out of opencontext

Shawn Graham

COUPLE OF WAYS ONE COULD GRAB IT:
CONGRATULATIONS
json
JQPLAY
GET JQ AND RUN THE QUERY FROM THE TERMINAL OR COMMAND LINE
COMBINING WGET & JQ

a walkthrough for extracting and manipulating data from opencontext.org

Search for something interesting. I put ‘poggio’ in the search box, and then clicked on the various options to get the architectural fragments. Look at the URL: https://opencontext.org/subjects-search/?prop=oc-gen-cat-object&q=Poggio#15/43.1526/11.4090/19/any/Google-Satellite See all that stuff after the word ‘Poggio’? That’s to generate the map view. We don’t need it.

We’re going to ask for the search results w/o all of the website extras, no maps, no shiny interface. To do that, we take advantage of the API. With open context, if you have a search with a ‘?’ in the URL, you can put .json in front of the question mark, and delete all of the stuff from the # sign on, like so:

https://opencontext.org/subjects-search/.json?prop=oc-gen-cat-object&q=Poggio

Put that in the address bar. Boom! lots of stuff! But only one page’s worth, which isn’t lots of data. To get a lot more data, we have to add another parameter, the number of rows: ?rows=100&. Slot that in just before the p in prop= and see what happens.

Now, that isn’t all of the records though. Remove the .json and see what happens when you click on the arrows to page through the NEXT 100 rows. You get a URL like this:

https://opencontext.org/subjects-search/?rows=100&prop=oc-gen-cat-object&start=100&q=Poggio#15/43.1526/11.4090/19/any/Google-Satellite

So – to recap, the URL is searching for 100 rows at a time, in the general object category, starting from row 100, and grabbing materials from Poggio. We now know enough about how open context’s api works to grab material.

COUPLE OF WAYS ONE COULD GRAB IT:

You could copy n’ paste -> but that will only get you one page’s worth of data (and if you tried to put, say, 10791 into the ‘rows’ parameter, you’ll just get a time-out error). You’d have to go back to the search page, hit the ‘next’ button, reinsert the .json etc over and over again. automatically. We’ll use a program called wget to do this. (To install wget on your machine, see the programming historian Wget will interact with the Open Context site to retrieve the data. We feed wget a file that contains all of the urls that we wish to grab, and it saves all of the data into a single file. So, open a new text file and paste our search URL in there like so:

https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object—oc-gen-cat-arch-element&q=Poggio https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object—oc-gen-cat-arch-element&start=100&q=Poggio https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object—oc-gen-cat-arch-element&start=200&q=Poggio

…and so on until we’ve covered the full 4000 objects. Tedious? You bet. So we’ll get the computer to generate those URLS for us. Open a new text file, and copy the following in:


urls = '';
f=open('urls.txt','w')
for x in range(1, 4000, 100):
    urls = 'https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=%d&q=Poggio/\n' % (x)
    f.write(urls)
f.close

and save it as url-generator.py. This program is in the python language. If you’re on a Mac, it’s already installed. If you’re on a Windows machine, you’ll have to download and install it. To run the program, open your terminal (mac) or command prompt (windows) and make sure you’re in the same folder where you saved the program. Then type at the prompt:

python url-generator.py

This little program defines an empty container called ‘urls’; it then creates a new file called ‘urls.txt’; then we tell it to write the address of our search into the urls container. See the %d in there? The program writes a number between 1 and 4000; each time it does that, it counts by 100 so that the next time it goes through the loop, it adds a new address with the correct starting point! Then it saves that container of URLs into the file urls.txt. Go ahead, open it up, and you’ll see.

Now we’ll feed it to wget like so. At the prompt in your terminal or command line, type:

wget -i urls.txt -r --no-parent -nd –w 2 --limit-rate=10k

You’ll end up with a lot of files that have no file extension in your folder, eg,

.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=61&q=Poggio%2F

Select all of these and rename them in your finder (instructions) or windows explorer (instructions), such that they have a sensible file name, and that the extension is .json. We are now going to concatenate these files into a single, properly formatted, .json file. (Note that it is possible for wget to push all of the downloaded information into a single json file, but it won’t be a properly formatted json file – it’ll just be a bunch of lumps of difference json hanging out together, which we don’t want).

We are going to use a piece of software written for NodeJS to concatenate our json files (this enables us to develop in javascript; it’s useful for lots of other things too). Go to the NodeJS download page and download and install for your machine. (Windows users, make sure you select the npm package manager as well during the install procedure). Once it’s installed, open a terminal or command prompt and type

npm install -g json-concat (mac users, you might need sudo npm install -g json-concat)

This installs the json-concat tool. We’ll now join our files together:

# As simple as this. Output file should be last
$ json-concat file1.json file2.json file3.json file4.json ouput.json

… for however many json files you have.

CONGRATULATIONS

You now have downloaded data from Open Context as json, and you’ve compiled that data into a single json file. This ability for data to be called and retrieved programmaticaly also enables things like the Open Context package for the R statistical software environment. If you’re feeling adventurous, take a look at that.

In Part Two I’ll walk you through using JQ to masage the json into a csv file that can be explored in common spreadsheet software. (For a detailed lesson on JQ, see the programming historian, which also explains why json in the first place). Of course, lots of the more interesting data viz packages can deal with json itself, but more on that later.

And of course, if you’re looking for some quick and dirty data export, Open Context has recently implemented a ‘cloud download’ button that will export a simplified version of the data direct to csv on your desktop. Look for a little cloud icon with a down arrow at the bottom of your search results page. Now, you might wonder why I didn’t mention that at the outset, but look at it this way: now you know how to get the complete data, and with this knowledge, you could even begin building far more complicated visualizations or websites. It was good for you, right? Right? Right.

PS Eric adds: “Also, you can request different types of results from Open Context (see: https://opencontext.org/about/services#tab_query-meta). For instance, if you only want GeoJSON for the results of a search, add “response=geo-record” to the request. That will return just a list of geospatial features, without the metadata about the search, and without the facets. If you want a really simple list of URIs from a search, then add “response=uri”. Finally, if you want a simple list of search results with some descriptive attributes, add “response=uri-meta” to the search result.”

json

Json is not easy to work with. Fortunately, Matthew Lincoln has written an excellent tutorial on json and jq over at The Programming Historian which you should go read now. Read the ‘what is json?’ part, at the very least. In essence, json is a text file where keys are paired with values. JQ is a piece of software that enables us to reach into a json file, grab the data we want, and create either new json or csv. If you intend to visualize and explore data using some sort of spreadsheet program, then you’ll need to extract the data you want into a csv that your spreadsheet can digest. If you wanted to try something like d3 or some other dynamic library for generating web-based visualizations (eg p5js), you’ll need json.

JQPLAY

JQ lets us do some fun filtering and parsing, but we won’t download and install it yet. Instead, we’ll load some sample data into a web-toy called jqplay. This will let us try different ideas out and see the results immediately. In the this file called sample.json I have the query results from Open Context – Github recognizes that it is json and that it has geographic data within it, and turns it automatically into a map! To see the raw json, click on the < > button. Copy that data into the json box at jqplay.org.

JQPlay will colour-code the json. Everything in red is a key, everything in black is a value. Keys can be nested, as represented by the indentation. Scroll down through the json – do you see any interesting key:value pairs? Matthew Lincoln’s tutorial at the programming historian is one of the most cogent explanations of how this works, and I do recommend you read that piece. Suffice to say, for now, that if you see an interesting key:value pair that you’d like to extract, you need to figure out just how deeply nested it is. For instance, there is a properties key that seems to have interesting information within it about dates, wares, contexts and so on. Perhaps we’d like to build a query using JQ that extracts that information into a csv. It’s within the features key pair, so try entering the following in the filter box:

.features [ ] | .properties You should get something like this:

{
  "id": "#geo-disc-tile-12023202222130313322",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313322&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (1)",
  "feature-type": "discovery region (facet)",
  "count": 12,
  "early bce/ce": -700,
  "late bce/ce": -535
}
{
  "id": "#geo-disc-tile-12023202222130313323",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313323&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (2)",
  "feature-type": "discovery region (facet)",
  "count": 25,
  "early bce/ce": -700,
  "late bce/ce": -535
}

For the exact syntax of why that works, see Lincoln’s tutorial. I’m going to just jump to the conclusion now. Let’s say we wanted to grab some of those keys within properties, and turn into a csv. We tell it to look inside features and find properties; then we tell it to make a new array with just those keys within properties we want; and then we tell it to pipe that information into comma-separated values. Try the following on the sample data:

.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv

…and make sure to tick the ‘raw output’ box at the top right. Ta da! You’ve culled the information of interest from a json file, into a csv. There’s a lot more you can do with jq, but this will get you started.

GET JQ AND RUN THE QUERY FROM THE TERMINAL OR COMMAND LINE

Install on OS – instructions from Lincoln http://programminghistorian.org/lessons/json-and-jq#installation-on-os-x

Install on PC – instructions from Lincoln http://programminghistorian.org/lessons/json-and-jq#installation-on-windows

Got JQ installed? Good. Open your terminal or command prompt in the directory where you’ve got your json file with the data you extracted in part 1. Here we go:

jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' data.json > data.csv

So, we invoke jq, we tell it we want the raw output (-r), we give it the filter to apply, we give it the file to apply it to, and we tell it what to name the output.

Take a look at how Lincoln pipes the output of a wget command into jq at the end of the section on ‘invoking jq’. Do you see how we might accelerate this entire process the next time you want data out of Open Context?

COMBINING WGET & JQ

Assuming you’ve got a list of urls (generated with our script from part 1), you point your firehose of downloaded data directly into jq. The crucial thing is to flag wget with -qO- to tell it that the output will be piped to another program. In which case, you would type at the terminal prompt or command line:

wget -qO- -i urls2.txt | jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' > out.csv

Which in Human says, ” hey wget, grab all of the data at the urls in the list at urls2.txt and pipe that information into jq. JQ, you’re going to filter for raw output the information within properities (which is within features), in particular these fields. Split the information fields up via commas, and write everything to a new file called out.csv.”

…Extremely cool, eh? (Word to the wise: read Ian’s tutorial on wget to learn how to form your wget requests politely so that you don’t overwhelm the servers. Wait a moment between requests – look at how the wget was formed in the open context part 1 post).