Parsing the White House Press Briefings as HTML

Data analysis of all the words used in the White House press briefings

Due: Thursday, January 22
Points: 10

In the previous assignment of downloading and parsing all of the White House Press Briefing pages, hopefully you noticed what a pain it is to pick out pieces of data from raw HTML using just grep. In this assignment, we learn how to use a parser designed for HTML, which will make it much easier to target the section of text that we want on every White House Briefing page.

Deliverables

  • A project folder named "wh-briefings-word-scrape"

    In your compciv repo, create a folder named homework/wh-briefings-word-scrape

    By the end of this assignment, that folder should contain at least this single script:

    • The html-scraper.sh script, which counts the top 10 words, 7 characters or more, used in all the WH Press Briefings.

    It may also contain data-hold/ as part of th eprocess, but data-hold/ won’t actually be committed to your Github repository.

  • The `html-scraper.sh` script

    In many ways, the code in this script will be similar to what you did in the previous assignment. However, if you are properly parsing the HTML, you should get a different answer than you would with just grep. In fact, if your script includes “container” as one of the top 10 7-letter-or-longer words, then you probably didn’t use the HTML parser to target the right thing.

  • The list of top 10 words

    After executing html-scraper.sh, you should get a list of the top 10 words as described above.

    Email me that list of the top 10 words, in order of frequency, that are seven letters or longer, used in all of the briefings.

    Use the subject line: Top 10 WH Words via Pup

  • Semi-walkthrough

    A few things to make this go smoothly.

    Do the Github/baby names assignment

    Do the Github/baby-names warmup homework first.

    The data management part of this assignment isn't too difficult. But it follows the same structure (but not exactly…for instance, I don't require you to provide a helper.sh script) as the baby-names assignment, with a subfolder in compciv/homework.

    Install pup, the HTML parsing tool

    Install and try out the pup parsing tool as outlined in this recipe.

    Put your data in the right place

    By now, you should have completed the previous assignment in which you've downloaded every White House Press Briefing to date. If you don't remember where you dumped those files, and you don't relish the idea of rescraping the WH press briefings site, then just use my archive:

    http://stash.compciv.org/wh/wh-press-briefings-2015-01-07.zip

    You should move all of these HTML files into the data-hold subdirectory of this assignment's homework directory in your compciv repo, i.e.

    # assuming you logged into corn.stanford.edu at this point
    cd ~/compciv/homework/
    mkdir -p ./wh-briefings-word-scrape/data-hold
    cd wh-briefings-word-scrape/data-hold
    curl http://stash.compciv.org/wh/wh-press-briefings-2015-01-07.zip \
      -o briefings.zip
    
    unzip briefings.zip
    # you'll notice that in my zip file, all the HTML is in a subdirectory when 
    # it gets unzipped. So here's how to move those all into data-hold/ proper
    mv wh-briefings/* .
    # now let's get rid of that zip file and that now-empty subdirectory
    rm briefings.zip
    rmdir wh-briefings/
    # now cd back into the homework assignment directory and work from there
    cd ..
    

    Ready to go?

    If you look at the data I provide, none of the files have an HTML extension. That's fine. If you are working from my data archive, then the following command will show you all the <title> tags from all the briefings I collected (you may just want to try it on one file, rather than having to wait for pup to burn through 1,300 HTML files during the exploratory phase):

    cat data-hold/* | pup 'title'
    

    To see just the title text:

    cat data-hold/* | pup 'title text{}'
    

    If you were just interested in the URLs that are on the "right-rail" of each page:

    cat data-hold/* | pup '#right-rail a attr{href}'
    

    "Legacy pages"

    Somewhere along the line, the White House changed its content management system. Which means the HTML structure for this 2009 briefing, Press Gaggle by Robert Gibbs - 2/18/09, is different from the one for this 2014 briefing, Press Gaggle by Senior Administration Official on Director Clapper's Trip to North Korea

    How to fix this problem? You may have to pop open your web browser (I recommend Chrome) to view the source. Then test out pup CSS selectors. You might want to run a different pup CSS selector based on the type of page. Or, try to figure out how to write a CSS selector that includes multiple selections.

    Note: This is not a trivial exercise, and can easily be a pain in the ass depending on how much you know about web development. In the end, it is about noticing patterns. I'll probably add extra hints to this section in the next couple of days.

    Hint 1: Use more grep options

    The following options in the grep documentation might help:

           -L, --files-without-match
                  Suppress  normal  output;  instead  print the name of each input
                  file from which no output would normally have been printed.  The
                  scanning will stop on the first match.
    
           -l, --files-with-matches
                  Suppress  normal  output;  instead  print the name of each input
                  file from which output would normally have  been  printed.   The
                  scanning  will  stop  on  the  first match.  (-l is specified by
                  POSIX.)
    
    

    Let's say there's something in the raw HTML of either the old format of HTML content, or the new format, and let's say you stored that in a variable named litmus_test.

    The following two commands would help you differentiate between which were press-briefings were published in one kind of format:

    grep -L $litmus_test *  # or *.html, depending on how you saved your files
    

    And the inverse of that:

    grep -l $litmus_test *  # or *.html, depending on how you saved your files
    

    Solution

    Assuming that the files are in data-hold/ and have no extension (such as .html), the answer is virtually the same as Step 3 in the previous grep assignment. The key here is to understand how the pup parser allows us to extract only the text of a given element. In this case, the text of every briefing could be found in <div id="content">...</div>…though you'd have a very hard time to use grep for that.

    cat data-hold/* | pup '#content text{}' | \
      grep -oE '[[:alpha:]]{7,}' | \
      tr '[:upper:]' '[:lower:]'  | \
      sort  | uniq -c  | sort -rn  | head -n 10
    

    Results (may vary):

      78090 president
      15475 because
      11852 question
      11292 congress
      10634 important
      10585 security
      10577 administration
      10548 obviously
      10456 american
       9831 government
    

    A for-loop was unnecessary here, given how pup can just fit into the pipeline. But if you wanted to practice it, it could've looked like this (assuming that no file names had spaces in them):

    for p in $(ls data-hold/*); do 
      cat $p | pup '#content text{}' | \
      grep -oE '[[:alpha:]]{7,}' | \
      tr '[:upper:]' '[:lower:]' >> pupwords.txt
    done
    
    cat pupwords.txt | sort  | uniq -c  | sort -rn  | head -n 10
    

    In the homework hints, I threw people off with a red-herring. I suggested using grep to verify the hypothesis that the White House briefing pages were split into two groups, pages with legacy-para and those without. That's what I meant by using a "litmus test":

    # How many pages do we have total?
    ls data-hold/* | wc -l
    # 1343
    
    # assuming old pages have the `legacy-para` element class
    # find how many pages via grep have it:
    litmus_test='legacy-para'
    grep -L $litmus_test data-hold/* | wc -l
    # 1120
    
    # and find out how many do _not_ have that term
    grep -l $litmus_test data-hold/* | wc -l
    # 223
    
    # 223 + 1120 = 1343...so 'legacy-para' is a good way to divide things
    # into two pages
    

    At this point, you could grep both categories of pages and use different pup selectors. I assumed you would have to do this, but student Yuqing Pan pointed out that all pages, regardless of if they were legacy pages, had the desired content in the <div id="content">...</div> element.

    In the end, I didn't grade you based on how observant you were of the White House content-management system, so the actual word count you may have gotten doesn't count against you. If you showed you could use pup to parse HTML and extract text, and then combine it with the previous solution, you got full credit.