Download all of the press-briefing listings, starting from http://www.whitehouse.gov/briefing-room/press-briefings?page=0. Then total up the number of lines in all the files.
Send me an email (dun@stanford) with the answer in the subject line:
Number of Lines in WH Briefing: XYZAB
And send that email through the command line, because why not.
Read the lesson on the curl tool for downloading pages from the command-line.
This assignment is the first of a several-step process to replicate NPR's work at, "The Fleeting Obsessions of the White House Press Corps"
Before we can do the word-count analysis they've done, we need to first collect the webpages of each White House briefing. And before we can even do that, we need to get a list of every briefing.
This is an exercise focused on using
for loops to make a repetitive task easy. We're not actually "scraping" data, in the usual sense. Just downloading lots of webpages for further use.
Since you'll be downloading a lot of files, you'll want to make a new directory.
The following command will make a new directory underneath your home directory (the tilde symbol is a shorthand for that) named
mkdir -p ~/mystuff/wh-briefings cd ~/mystuff/wh-briefings
If you do your work from here, you can come back to this directory for future assignments.
The first page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=0
The next page of briefings is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=1
The 5th (or rather, the 6th, counting 0) is at: http://www.whitehouse.gov/briefing-room/press-briefings?page=10
See a pattern?
So we need a way to generate a list of numbers in sequential order. Luckily, there's the
seq 0 5
0 1 2 3 4 5
Putting that into a
for num in $(seq 0 5); do echo "Hey this is a number $num" done # Output: Hey this is a number 0 Hey this is a number 1 Hey this is a number 2 Hey this is a number 3 Hey this is a number 4 Hey this is a number 5
Again, read the lesson on the curl tool for downloading pages from the command-line.
To download three copies of
example.com and save them in files
for num in $(seq 0 2); do curl http://example.com -o $num.html done
Of course, we don't want to save three copies of the same website. So use the
$num variable to correctly target the right page in each iteration of the loop.
If you want to get all of the briefings, you need to loop from 0 to whatever the final page is on the WH Briefings. As you get better at programming, you could probably write a program to automatically find this final page. For now, you should do it the old-fashioned way (i.e. entering random numbers into the browser's address bar until you reach the end).
Rather than looping through all of the possible White House pages, and then finding out much later that you didn't do the right thing, try just looping through the first three pages or so.
One of the tricky things about working from the command-line is that not everything is meant to be read as text, including HTML.
If you download the following page:
curl http://www.whitehouse.gov/briefing-room/press-briefings?page=100 -o 100.html
How do you know you downloaded the actual page, and not just an error page? Or something else unexpected?
This is where you go back to doing things as you've done before:
grepto see if that word exists in the file you downloaded with
So the following command should spit out a match:
grep 'Nashville' 100.html
whitehouse.gov domain is pretty robust. But let's give it a courtesy couple of seconds between each visit. Use the
sleep command in your
If your script worked, you should have a folder, located at
~/mystuff/white-house-briefings with 100+ HTML files.
To answer the question in the deliverable, i.e. how many lines there are in all of the pages put together…you use the
cat command to join the files together (look up the wildcard symbol you need to specify all of the files in a directory). And then pipe it into the command to count lines (look it up on Google).
The first thing you had to do was figure out how far back the White House press briefings archive goes to, by manually increasing the
page parameter, e.g.:
One tricky thing was that if you went back too far, the website would, by default, serve you what you get at
As of Jan. 7, 2015, the highest page number was 134
Here's a verbose version of the URL-scraper, with comments:
# base_url=http://www.whitehouse.gov/briefing-room/press-briefings # set the last page number (as of 2015-01-07) last_num=134 for i in $(seq 0 $last_num) do # This echo command will print to screen the URL # that's currently being downloaded echo "$base_url?page=$i" # I'm silencing curl because the progress indicator is annoying curl "$base_url?page=$i" -s -o "$i.html" done
Of course, it could be a lot more concise if you're into the whole brevity thing:
for i in $(seq 0 134); do curl "http://www.whitehouse.gov/briefing-room/press-briefings?page=$i" \ -s -o "$i.html" done