Collecting the White House Press Briefings

The first step in analyzing web data is to just collect the webpages.

Due: Wednesday, January 14
Points: 10

Deliverables

  • Count the number of lines of all the WH Briefing Lists

    Download all of the press-briefing listings, starting from http://www.whitehouse.gov/briefing-room/press-briefings?page=0. Then total up the number of lines in all the files.

    Send me an email (dun@stanford) with the answer in the subject line:

    Number of Lines in WH Briefing: XYZAB

    And send that email through the command line, because why not.

  • Relevant readings

    Read the lesson on the curl tool for downloading pages from the command-line.

    Review Software Carpentry's lessons on Loops.

    This assignment is the first of a several-step process to replicate NPR's work at, "The Fleeting Obsessions of the White House Press Corps"

    Before we can do the word-count analysis they've done, we need to first collect the webpages of each White House briefing. And before we can even do that, we need to get a list of every briefing.

    This is an exercise focused on using for loops to make a repetitive task easy. We're not actually "scraping" data, in the usual sense. Just downloading lots of webpages for further use.

    Hints

    Organizing your space

    Since you'll be downloading a lot of files, you'll want to make a new directory.

    The following command will make a new directory underneath your home directory (the tilde symbol is a shorthand for that) named mystuff/wh-briefings:

    mkdir -p ~/mystuff/wh-briefings
    cd ~/mystuff/wh-briefings
    

    If you do your work from here, you can come back to this directory for future assignments.

    Scouting out the website

    The first page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=0

    The next page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=1

    The 5th (or rather, the 6th, counting 0) is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=10

    See a pattern?

    Start off with a simple loop

    So we need a way to generate a list of numbers in sequential order. Luckily, there's the seq command:

    seq 0 5
    

    Results in:

    0
    1
    2
    3
    4
    5
    

    Putting that into a for construct:

    for num in $(seq 0 5); do
      echo "Hey this is a number $num"
    done
    
    # Output: 
    Hey this is a number 0
    Hey this is a number 1
    Hey this is a number 2
    Hey this is a number 3
    Hey this is a number 4
    Hey this is a number 5
    

    All together

    How to curl and loop

    Again, read the lesson on the curl tool for downloading pages from the command-line.

    To download three copies of example.com and save them in files 0.html, 1.html, 2.html

    for num in $(seq 0 2); do
      curl http://example.com -o $num.html 
    done
    

    Of course, we don't want to save three copies of the same website. So use the $num variable to correctly target the right page in each iteration of the loop.

    Finding the last page

    If you want to get all of the briefings, you need to loop from 0 to whatever the final page is on the WH Briefings. As you get better at programming, you could probably write a program to automatically find this final page. For now, you should do it the old-fashioned way (i.e. entering random numbers into the browser's address bar until you reach the end).

    Test out just a few pages

    Rather than looping through all of the possible White House pages, and then finding out much later that you didn't do the right thing, try just looping through the first three pages or so.

    Are you even downloading the right page?

    One of the tricky things about working from the command-line is that not everything is meant to be read as text, including HTML.

    If you download the following page:

    curl http://www.whitehouse.gov/briefing-room/press-briefings?page=100 -o 100.html
    

    How do you know you downloaded the actual page, and not just an error page? Or something else unexpected?

    This is where you go back to doing things as you've done before:

    So the following command should spit out a match:

    grep 'Nashville' 100.html
    

    Give the White House some rest

    The whitehouse.gov domain is pretty robust. But let's give it a courtesy couple of seconds between each visit. Use the sleep command in your for loop.

    Conclusion

    If your script worked, you should have a folder, located at ~/mystuff/white-house-briefings with 100+ HTML files.

    To answer the question in the deliverable, i.e. how many lines there are in all of the pages put together…you use the cat command to join the files together (look up the wildcard symbol you need to specify all of the files in a directory). And then pipe it into the command to count lines (look it up on Google).

    Solution

    The first thing you had to do was figure out how far back the White House press briefings archive goes to, by manually increasing the page parameter, e.g.:

    http://www.whitehouse.gov/briefing-room/press-briefings?page=50 http://www.whitehouse.gov/briefing-room/press-briefings?page=100

    One tricky thing was that if you went back too far, the website would, by default, serve you what you get at page=0.

    As of Jan. 7, 2015, the highest page number was 134

    Here's a verbose version of the URL-scraper, with comments:

    #
    base_url=http://www.whitehouse.gov/briefing-room/press-briefings
    # set the last page number (as of 2015-01-07)
    last_num=134
    
    for i in $(seq 0 $last_num) 
    do
      # This echo command will print to screen the URL 
      #  that's currently being downloaded
      echo "$base_url?page=$i"
      # I'm silencing curl because the progress indicator is annoying
      curl "$base_url?page=$i" -s -o "$i.html"
    done
    

    Of course, it could be a lot more concise if you're into the whole brevity thing:

    for i in $(seq 0 134); do
      curl "http://www.whitehouse.gov/briefing-room/press-briefings?page=$i" \
        -s -o "$i.html"
    done