The Programming Historian 2

By Gretchen | September 26, 2013
by Kellen Kurschinski

  • Technical Reviewer: Nick Ruest, Konrad Lawson
  • Literary Reviewer: Ian Milligan

Background and Lesson Goals

Now that you have learned how Wget can be used to mirror or download specific files from websites like via the command line, it’s time to expand your web-scraping skills through a few more lessons that focus on other uses for Wget’s recursive retrieval function. The following tutorial provides three examples of how Wget can be used to download large collections of documents from archival websites with assistance from the Python programing language. It will teach you how to parse and generate a list of URLs using a simple Python script, and will also introduce you to a few of Wget’s other useful features. Similar functions to the ones demonstrated in this lesson can be achieved using curl, an open-source software capable of performing automated downloads from the command line. For this lesson, however, we will focus on Wget and building your Python skills.

Archival websites offer a wealth of resources to historians, but increased accessibility does not always translate into increased utility. In other words, while online collections often allow historians to access hitherto unavailable or cost-prohibitive materials, they can also be limited by the manner in which content is presented and organized. Take for example the Indian Affairs Annual Reports database hosted on the Library and Archives Canada [LAC] website. Say you wanted to download an entire report, or reports for several decades. The current system allows a user the option to read a plaintext version of each page, or click on the “View a scanned page of original Report“ link, which will take the user to a page with LAC’s embedded image viewer. This allows you to see the original document, but it is also cumbersome because it requires you to scroll through each individual page. Moreover, if you want the document for offline viewing, the only option is to right click –> save as each image to a directory on your computer. If you want several decades’ worth of annual reports, you can see the limits to the current means of presentation pretty easily. This lesson will allow you to overcome such an obstacle.

Read full post here. (Originally posted September 13, 2013)