Bulk Uploading to the Internet Archive

by Michael Szul on Sun Feb 19 2017 12:05:42

No ads, no tracking, and no data collection. Enjoy this article? Buy us a ☕.

The Internet Archive is best know for the Wayback Machine that allows you to check past versions of a web site, but it's actually a huge resource of archival text, video, and audio as well. In fact, we upload our podcast episodes to the Internet Archive, and that's where the episodes get served from when someone listens to an episode.

With today's political climate, it's become more important than ever to make sure scientific knowledge, and other resources are archived in multiple places to preserve knowledge that might disappear. Many might not realize it, but the Internet Archive has its own tools for bulk uploading, updating, and processing items to makes it easier for moving large amounts of items.

To get started, you'll need to either download the ia binary, or install the internetarchive Python library. For the purposes of this tutorial, the binary is only used for configuring the authentication. This also assumes a Unix-like environment. Linux and Mac systems work fine from the terminal. I used Windows 10's Linux subsystem.

Once installed, you'll need to configure it by running ia configure, and entering in your Internet Archive account credentials. This will create a file named .ia that will be needed to perform the bulk upload.

As an aside, if you are using a 2.7.x flavor of Python, authentication might fail. If you can't upgrade to a higher version, you can edit the .ia file and add the following lines:

[general]
      secure = false

Once you're all set up, it's pretty simple to work with the internetarchive Python library. For uploads, you import from the module:

from internetarchive import upload

For my own experiment, I had a text file that needed to be parsed out to get the titles and names. The following method accomplishes this:

def parse_file():
          file_contents = open("bulk_result.txt", "r").read()
          parts = file_contents.split("\r\n\r\n")
          for part in parts:
              if part.find("doi") > -1:
                  doi = part.split("doi:")[1]
                  try:
                         upload_title = part.split("\r\n")[0].split(": ")[1].strip()
                         file_name = 'downloads/' + doi.strip() + '.pdf'
              print file_name
                      meta_data = dict(collection='opensource', title=upload_title, mediatype='texts')
              print meta_data
                      result = upload('codepunk_io_' + doi.strip(), files=[file_name], metadata=meta_data)
                      print result
          except:
              print 'An error occurred...'
              pass

The text parsing is pretty straightforward, and it could change based on implementation. Ideally, this would be an XML file, but the file I had to work with in this instance was a flat text file, but with enough of a pattern that simple string splitting worked.

The two important pieces for bulk uploading are the meta_data dictionary and the call to upload. With meta_data you're setting a dictionary of meta data for the file:

meta_data = dict(collection='opensource', title=upload_title, mediatype='texts')

At the Internet Archive, you can't create a collection until you have more than 50 uploads, and even then, the curators have to create it for you. This means that you have to work with the community collections, such as community audio, community video, or community texts. Community texts has a collection name of opensource, which is what I use here. For video, it's opensource_movies and for audio it's opensource_audio. The mediatype property will be texts, movies, or audio.

Once you have the meta data squared away, the upload is pretty straightforward:

upload('codepunk_io_' + doi.strip(), files=[file_name], metadata=meta_data)

The first parameter is a unique identifier for the file. If you're going to later put these together in a collection, make sure they have a prefix you can work with, and easily search for when you need to gather the data for the curators.

There are other ways to do the bulk upload as well, including processing a CSV file. You could parse your XML or flat file, or read a directory of files, and construct a CSV file that you then process with the ia command. The ia upload command will process the file, and upload the entries for you.