Homework

Collecting the White House Press Briefings

The first step in analyzing web data is to just collect the webpages.

Due: Wednesday, January 14

Points: 10

Deliverables

Count the number of lines of all the WH Briefing Lists

Download all of the press-briefing listings, starting from http://www.whitehouse.gov/briefing-room/press-briefings?page=0. Then total up the number of lines in all the files.

Send me an email (dun@stanford) with the answer in the subject line:

Number of Lines in WH Briefing: XYZAB

And send that email through the command line, because why not.

Relevant readings

Read the lesson on the curl tool for downloading pages from the command-line.

Review Software Carpentry's lessons on Loops.

This assignment is the first of a several-step process to replicate NPR's work at, "The Fleeting Obsessions of the White House Press Corps"

Before we can do the word-count analysis they've done, we need to first collect the webpages of each White House briefing. And before we can even do that, we need to get a list of every briefing.

This is an exercise focused on using for loops to make a repetitive task easy. We're not actually "scraping" data, in the usual sense. Just downloading lots of webpages for further use.

Hints

Organizing your space

Since you'll be downloading a lot of files, you'll want to make a new directory.

The following command will make a new directory underneath your home directory (the tilde symbol is a shorthand for that) named mystuff/wh-briefings:

mkdir -p ~/mystuff/wh-briefings
cd ~/mystuff/wh-briefings

If you do your work from here, you can come back to this directory for future assignments.

Scouting out the website

The first page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=0

The next page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=1

The 5th (or rather, the 6th, counting 0) is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=10

See a pattern?

Start off with a simple loop

So we need a way to generate a list of numbers in sequential order. Luckily, there's the seq command:

seq 0 5

Results in:

Putting that into a for construct:

for num in $(seq 0 5); do
  echo "Hey this is a number $num"
done

# Output: 
Hey this is a number 0
Hey this is a number 1
Hey this is a number 2
Hey this is a number 3
Hey this is a number 4
Hey this is a number 5

All together

How to curl and loop

Again, read the lesson on the curl tool for downloading pages from the command-line.

To download three copies of example.com and save them in files 0.html, 1.html, 2.html

for num in $(seq 0 2); do
  curl http://example.com -o $num.html 
done

Of course, we don't want to save three copies of the same website. So use the $num variable to correctly target the right page in each iteration of the loop.

Finding the last page

If you want to get all of the briefings, you need to loop from 0 to whatever the final page is on the WH Briefings. As you get better at programming, you could probably write a program to automatically find this final page. For now, you should do it the old-fashioned way (i.e. entering random numbers into the browser's address bar until you reach the end).

Test out just a few pages

Rather than looping through all of the possible White House pages, and then finding out much later that you didn't do the right thing, try just looping through the first three pages or so.

Are you even downloading the right page?

One of the tricky things about working from the command-line is that not everything is meant to be read as text, including HTML.

If you download the following page:

curl http://www.whitehouse.gov/briefing-room/press-briefings?page=100 -o 100.html

How do you know you downloaded the actual page, and not just an error page? Or something else unexpected?

This is where you go back to doing things as you've done before:

Visit the page in your browser
Look for a word that stands out…I see the word "Nashville" in the text, "Nashville Mayor Karl Dean"
Then use grep to see if that word exists in the file you downloaded with curl

So the following command should spit out a match:

grep 'Nashville' 100.html

Give the White House some rest

The whitehouse.gov domain is pretty robust. But let's give it a courtesy couple of seconds between each visit. Use the sleep command in your for loop.

Conclusion

If your script worked, you should have a folder, located at ~/mystuff/white-house-briefings with 100+ HTML files.

To answer the question in the deliverable, i.e. how many lines there are in all of the pages put together…you use the cat command to join the files together (look up the wildcard symbol you need to specify all of the files in a directory). And then pipe it into the command to count lines (look it up on Google).

Solution

The first thing you had to do was figure out how far back the White House press briefings archive goes to, by manually increasing the page parameter, e.g.:

http://www.whitehouse.gov/briefing-room/press-briefings?page=50 http://www.whitehouse.gov/briefing-room/press-briefings?page=100

One tricky thing was that if you went back too far, the website would, by default, serve you what you get at page=0.

As of Jan. 7, 2015, the highest page number was 134

Here's a verbose version of the URL-scraper, with comments:

#
base_url=http://www.whitehouse.gov/briefing-room/press-briefings
# set the last page number (as of 2015-01-07)
last_num=134

for i in $(seq 0 $last_num) 
do
  # This echo command will print to screen the URL 
  #  that's currently being downloaded
  echo "$base_url?page=$i"
  # I'm silencing curl because the progress indicator is annoying
  curl "$base_url?page=$i" -s -o "$i.html"
done

Of course, it could be a lot more concise if you're into the whole brevity thing:

for i in $(seq 0 134); do
  curl "http://www.whitehouse.gov/briefing-room/press-briefings?page=$i" \
    -s -o "$i.html"
done