Listing the BuzzFeed listicles

Practicing web-scraping and regexes on BuzzFeed listicle titles

Due: Tuesday, February 17
Points: 5 (Extra Credit)

Using an HTML parser and regular expressions, and given a list of every article BuzzFeed has produced in 2014, create a frequency distribution of BuzzFeed’s lists, grouped by the number of items in each list.

Deliverables

  • A repo folder named buzzfeed-listicle-parsing

    Create a subfolder at homework/buzzfeed-listicle-parsing

    This project folder needs only to commit a single file:

    • lister.sh
  • The lister.sh script

    This script, given a folder of BuzzFeed archive pages, in which each page lists all the headlines BuzzFeed produced on a given day, will:

    • Parse each of those headlines, filter for only the headlines that belong to lists (i.e. “listicles”), and then parse how many items are in each listicle (based on the headline alone).

    • After extracting just the number of items from each list, create a sorted, comma-separated list that looks like this:

              3,34
              4,21
              5,12
              6,3
      

    In the above example, this list would correspond to BuzzFeed having produced 34 lists that had 3 items (e.g., “3 Bananas That Look Like Celebrities”), 21 lists that had 4 items (e.g. “4 Life-Changing Photos Of Two Baby Red Pandas”, etc. etc.

  • Background

    This assignment was inspired by Noah Veltman's and Brian Abelson's "Listogram of BuzzFeed Listicle Lengths", in which they tallied and grouped BuzzFeed listicles by number of items per listicle.

    This mass scrape yielded a variety of insights, including:

    Despite the perceived move away from round numbers, the four most repeated list lengths are 10, 15, 21, and 25. (One of these things is not like the others.) But overall, Veltman was surprised by how common it was to write lists over 30.

    “I’d have expected lists longer than 15 or 20 to be much rarer, since it seems like a lot to sit through AND it seems like a lot more work for the author,” Veltman wrote. In fact, Shepherd mentioned to a period in 2010 when there was an internal, unofficial competition in the BuzzFeed offices to see who could write the post that exceeded 100 by the most items.

    BuzzFeed is legendary not just for its listicles, but for its use of analytics in determining the "buzziness" of its content:

    But ultimately — much as one might want analytics to deliver conclusive answers — the results are often fuzzy. “I tried to have a look at some of our writers who have the most viral posts, and there’s actually a pretty wide distribution of numbers,” says [BuzzFeed's Jack] Shepherd. People make lists of all different lengths for all different reasons. And why people click on them likely has very little to do with having a natural affinity for the number 12 or refusing to read lists longer than 30.

    There are conceivably a lot of interesting analytical questions we could attempt to answer, starting from a list of BuzzFeed articles, such as: What listicle animal gets shared more on social media? Dogs? Or cats? How many lists involve dogs and cats? And if you had just but one listicle to give to your country/media company…dog? Or cat?

    We can tackle those later. But the primary goal of this assignment is to practice HTML parsing with pup and see how far we can get with grep and regular expressions – just because grep and regexes alone aren't suitable for (sane) HTML parsing, doesn't mean that they can't be used in combination with an HTML parser in filtering for data.

    Note: I'm not really sure that relatively simple regexes are enough to determine which BuzzFeed articles are lists based on title alone. But of course, if they were, that would imply something about the headline-writing process. But see how far you can get (Hint: it will almost certainly take more than one call to grep).

    The comma-delimited list you produce won't be a visualization, but it's only one small step away from being a nice chart, or a full-fledged website. We'll save that for a later assignment.

    Technical notes

    Important things I must mention at the top
    1. Make sure you've done the Github/Baby-name-counting homework and that you have .gitignore set up correctly. Do not push to Github a gigantic folder (or zip file) of BuzzFeed archive pages. If you're unsure about what you're about to push to Github, run git status and/or ask me first.
    2. This is an HTML parsing exercise, not a web scraping exercise. As you'll see below, the scraping has already been done, and you are supposed to filter the data based on headline alone. At no point should you actively be hitting BuzzFeed's live site. If you think that's part of the exercise, then you need to think about the difference between making curl commands, versus just parsing text/html files.

    The collecting of BuzzFeed archive pages has already been done for you: you can curl the 2014 zip file here. Or you can download the 2006 to 2014 collection here, though at 700MB, you may max out your corn.stanford.edu diskspace (you'll also have to rearrange the folder structure a bit to match what's described below).

    Download the archive into the data-hold/ subdirectory (and save or rename it as 2014.zip) and unzip it there. Your compciv repo should have this structure:

    compciv/
    |
    |__homework/
       |__buzzfeed-listicle-parsing/
          |
          |___lister.sh
          |___data-hold/
              |__2014
                 |__01
                    |__01.html
                    |__02.html         
                    (and so on)
    

    So if you're in the working project directory, e.g. ~/compciv/homework/buzzfeed-listicle-parsing, you should be able to run this command:

    cat data-hold/2014/01/01.html data-hold/2014/12/31.html | wc -l
    # => 6003
    

    Iterating through dates

    (Update: This is a dumb idea. Just do cat data-hold/*/*/*.html)

    While you don't have to curl anything from a live site, you do have to iterate through the subfolders in data-hold/. The easiest way to do that is to use seq, the date program, and command substitution. Note: this only works on corn.stanford.edu's flavor of Unix; it won't work on your OS X Unix (unless you download a separate date program)

    Since figuring out date strings is not what we're intending to exercise, I'll just give you the appropriate for loop, and you can adjust it to your needs:

    d_start='2014-01-01'
    d_end='2014-12-31'
    days_diff=$(( ( $(date -ud $d_end +'%s') - $(date -ud $d_start +'%s') )/ 60 / 60 / 24 ))
    
    for num in $(seq 0 $days_diff); do 
      # DO YOUR WORK HERE
      date -d "$d_start $num days" +%Y-%m-%d
    done
    

    Solution

    In the hints above, I suggested using a convoluted for-loop to iterate all the subfolders (e.g. data-hold/2014/1/ data-hold/2014/12/)…that was dumb. Just use cat data-hold/*/*/*.html

    cat data-hold/*/*/*.html | pup '.flow li.bf_dom a text{}' |
      grep -oE '^[0-9]{1,3}|The [0-9]{1,3}' | grep -oE '[0-9]+' |
      sort | uniq -c | sort -rn | sed -E 's/^ +//' | 
      sed -E 's/ +$//' | sed -E 's/ +/,/'
    

    Setup:

    mkdir -p data-hold/2014/10/
    curl -s http://www.buzzfeed.com/archive/2014/10/12 > data-hold/2014/10/12.html
    bash lister.sh
    

    Expected answer:

    5,10
    4,18
    3,24
    3,23
    3,21
    3,20
    3,19
    3,17
    2,25
    2,15
    2,12
    2,11
    1,67
    1,37
    1,31
    1,29
    1,28
    1,26
    1,22
    1,16
    1,14
    1,13