Analyzing Tweets in CSV form

Connect to the Twitter API, download a user’s tweets as CSV, and count frequency of hashtags and words.

Due: Friday, February 13
Points: 5

Using the command-line t program to collect Twitter data as CSV and the csvfix tool to process it, we can then parse the text using the standard Unix text tools to find interesting trends in a user’s tweets.

The t program is a simplified wrapper for the Twitter API; while it makes connecting to Twitter easy, the data it returns is only a subset of what Twitter’s API offers. That said, enough data is revealed in the text alone, and so this exercise will mostly be a practice of grep and regular expressions. Please, especially, review your regular expressions.

Make sure you’ve completed the instructions on how to sign up with Twitter as a developer and get authenticated.

Deliverables

  • The timeline-csv-analysis.sh script

    In your existing congress-twitter folder, create a script named timeline-csv-analysis.sh

    This is what your repo will look like:

          |-compciv/
            |-homework/
               |--congress-twitter/
                  |--timeline-csv-analysis.sh
                  |--data-hold/
                     |--someusers-tweets.csv
    

    Note that you already have a congress-twitter folder, from a previous assignment.

    The timeline-csv-analysis.sh script, when run, will do the following:

    1. Using the t program, it will collect the maximum number of tweets possible from the live Twitter API from a given user’s timeline in CSV format.
    2. Using the csvfix program, calculate the following:

      1. Top 10 hashtags, case-insensitive, in order of frequency of appearance.
      2. Top 10 users by frequency of retweets
      3. Top 10 users mentioned in tweets that are NOT retweets and who are NOT the user in question.
      4. Top ten words, 5-letters or more, that are not usernames, nor hashtags, nor URLs.

    Usage:

    bash timeline-csv-analysis.sh DarrellIssa
    

    Output:

      Fetching tweets for DarrellIssa into ./data-hold/DarrellIssa-timeline.csv
      Analyzing 3201 tweets by DarrellIssa since "2013-11-19 17:27:24 +0000"
      Top 10 hashtags by DarrellIssa
          246 #irs
          126 #irstargeting
          125 #obamacare
          101 #americasvets
           78 #benghazi
           55 #fastandfurious
           54 #doj
           54 #dataact
           48 #atf
           44 #militarymonday
      Top 10 retweeted users by DarrellIssa
          116 @gopoversight
           28 @speakerboehner
           13 @gopleader
           12 @housegop
            8 @repmarkmeadows
            6 @jim_jordan
            5 @datacoalition
            4 @sorefeetmt
            4 @joshmyersnc
            4 @egil_skallagrim
      Top 10 mentioned users (not including retweets) by DarrellIssa
          214 @gopoversight
           27 @jim_jordan
           16 @housejudiciary
           15 @speakerboehner
           12 @tgowdysc
           12 @gopleader
           11 @wsj
           11 @repcummings
           11 @foxnews
           10 @usmc
      Top tweeted 10 words with 5+ letters by DarrellIssa
          301 shall
          195 lerner
          181 about
          175 president
          164 house
          152 hearing
          145 congress
          137 today
          135 states
          113 state
    

    Another example:

      bash timeline-csv-analysis.sh CoryBooker
    

    Output:

      Fetching tweets for CoryBooker into ./data-hold/CoryBooker-timeline.csv
      Analyzing 3201 tweets by CoryBooker since "2014-10-24 01:28:29 +0000"
      Top 10 hashtags by CoryBooker
           46 #sotu
           22 #fusionriseup
           20 #vegan
           20 #givingtuesday
           14 #smallbizsat
           13 #nj
           11 #netneutrality
            8 #souls2polls
            8 #gotv
            7 #stateofcivilrights
      Top 10 retweeted users by CoryBooker
           12 @whitehouse
            9 @silviaealvarez
            8 @senbookerofc
            8 @ghelmy
            7 @stevenfulop
            7 @senangusking
            7 @brendanwgill
            6 @steveadubato
            6 @senrandpaul
            6 @quorumcall
      Top 10 mentioned users (not including retweets) by CoryBooker
           19 @tomhilly
           14 @swayneharris
           14 @progressagent
           13 @tw22197
           12 @ramya_bal
           11 @jwhof
           11 @arielkelisb
           10 @savecatstoday
           10 @onecaliberal
           10 @johnbetz2
      Top tweeted 10 words with 5+ letters by CoryBooker
          429 thanks
          359 thank
          172 great
          171 about
          167 vegan
          121 today
          116 support
          115 senator
          105 please
          105 booker
    

    See the following section for hints and requirements.

  • Hints

    Set up the t program and connect to Twitter's API

    The very first thing you need to do is install Ruby on your corn.stanford.edu account and install the t program. Then you need to authenticate with Twitter's API.

    The instructions are here. This is a task that could take about 30 minutes (waiting for things to finish installing), so start as soon as you can. By the end of it, you should be able to Tweet from the command-line, among other things.

    Check out t's documentation. To download the maximum number of tweets from a user's timeline:

    t timeline -n 3200 some_user_name
    

    Sample output:

       @dancow
       Hyperlocal news network Patch finds profit after laying off most of its 
       employees http://t.co/uei2o3Yose
    
       @dancow
       Daily Beast reporter doesn't know about Updog http://t.co/6Llk0xdPO2
    
       @dancow
       I'll be honest...the entire CSS Flexbox thing has totally slipped by my radar 
       . Time to put myself out to pasture http://t.co/pscKS7zydb
    

    However, this is not the data format we want (as you can imagine, it'd be a pain in the ass to grep). So use the --csv option. And redirect to a file.

    t timeline -n 3200 --csv  some_user_name > data-hold/some_user_name.csv
    

    Note that Excel can be used to open up CSV files. So you can send the output to your ~/WWW directory and download the file onto your desktop (you could also set up t to work on your own laptop, if you wanted).

    In Excel, this is what the data looks like:

    img

    The data returned by the t program is much less specific than what we would get from the Twitter API as JSON. But given all that we care to count: hashtags, users who are retweeted, users who are mentioned, and word counts, we only need the Text field.

    If you want pre-fetched data to work with, here's a couple of archives:

    Working with t's data

    As I said before, the t program shows only a subset of fields from the actual Twitter JSON API. There's no columns that say, "This tweet is a retweet", or, "This tweet is a reply". So you will have to use grep to make that judgment based on text patterns.

    What is a retweet?

    A retweet is any Tweet that starts with a RT. Here's an example Tweet text from Darrell Issa's tweets-as-CSV:

    RT @RepLankford: Chaired informative @GOPoversight hearing yesterday on #SSDI & ways to save the program from insolvency in 2016  http://t.co/hyNZEfphYB
    
    What is a mention?

    Anytime a user mentions another user, they prepend an @ in front of the username (remember that all Twitter usernames consist of alphanumerical characters and underscores).

    The following Tweet mentions two users, @SDSU and @GoAztecs:

    Congrats to the @SDSU Aztecs on winning the Mountain West Conference this past weekend! #MarchMadness @GoAztecs #WeAreAztecs
    
    What is a Hashtag

    A Hashtag is any sequence of alphanumerical characters, and underscores, that immediately follow a pound sign, #. The above tweet contains two hashtags, #MarchMadness and #WeAreAztecs

    Install csvfix

    The csvfix tool is yet-another-parser designed specifically for the parsing of comma-delimited files, which is the data format that we'll be fetching from Twitter using t.

    You can find installation and usage information about csvfix here.

    You should be wondering, why can't we just use cut and grep, e.g.

      echo "something,apples,oranges" | cut -d ',' -f 1,2
    

    However, remember that cut, as does grep, work on a line by line basis. But some tweets contain text that are more than one line:

    img

    Or, more to the point, text that contains commas, which completely messes up with your ability to cut and delimit by literal commas:

    img

    In a CSV file, these tweets look like this:

    img

    So think of csvfix as a parser sophisticated enough to deal with all of the ways that CSV get get messy. Check out its entry in the Unix tools page for examples and installation.

    Getting started

    The assignment requires that timeline-csv-analysis.sh print out these lines (assuming you're looking at user "DarrellIssa"):

    $ bash timeline-csv-analysis.sh DarrellIssa
    
    Fetching tweets for DarrellIssa into ./data-hold/DarrellIssa-timeline.csv
    Analyzing 3201 tweets by DarrellIssa since "2013-11-19 17:27:24 +0000"
    

    You can infer two things:

    1. Every time you run the script and give it a username, a new file named data-hold/some_username-timeline.csv is created.
    2. After the tweets are downloaded, you have to count how many tweets there are as well as the date of the oldest tweet in the collection.

    Here's the first few lines for timeline-csv-analysis.sh:

    # This assignment is just for readability purposes
    username=$1
    # create data-hold if it doesn't already exist
    mkdir -p ./data-hold
    echo "Fetching tweets for $username into ./data-hold/$username-timeline.csv
    "
    # use t to download the tweets in CSV form and save to file
    file=data-hold/$username-timeline.csv
    t timeline -n 3200 --csv $username > $file
    # Get the count of lines using csvfix and its order subcommand
    # note: another subcommand could be used here, but the point is to use
    # csvfix to reduce the file to just the first field, and then count the lines.
    
    # In other words, you cannot just count the number of Tweets with wc alone, 
    # because some tweets span multiple lines
    count=$(csvfix order -f 1 $file | wc -l)
    # The timestamp of the tweet is in the field (i.e. column) named, 'Posted at'
    # and the oldest tweet is in the last line
    lastdate=$(csvfix order -fn 'Posted at' $file | tail -n 1)
    # Echoing some stats about the tweets
    echo "Analyzing $count tweets by $username since $lastdate"
    

    Solution

    mkdir -p "./data-hold"
    file="./data-hold/$1-timeline.csv"
    rm -f $file # delete it just in case it exists
    
    echo "Fetching tweets for $1 into $file"
    t timeline $1 --csv -n 3200 > $file
    
    count=$(csvfix order -f 1 $file | wc -l)
    lastdate=$(csvfix order -fn 'Posted at' $file | tail -n 1)
    echo "Analyzing $count tweets by $1 since $lastdate"
    
    
    echo "Top 10 hashtags by $1"
    csvfix order -fn Text $file | tr '[:upper:]' '[:lower:]' | grep -oE '#[[:alnum:]_]+' | sort | uniq -c | sort -r | head -n 10
    
    echo "Top 10 retweeted users by $1"
    csvfix order -fn Text $file |  grep -oE 'RT @[[:alnum:]_]+' | tr '[:upper:]' '[:lower:]' | grep -oE '@[[:alnum:]_]+' | sort | uniq -c | sort -r | head -n 10
    
    echo "Top 10 mentioned users (not including retweets) by $1"
    csvfix order -fn Text $file |  grep -vE '\bRT\b' | \
    tr '[:upper:]' '[:lower:]' | \
    grep -vi "@$1" | \
    grep -oE '@[[:alnum:]_]+' | sort | uniq -c | sort -r | head -n 10
    
    
    echo "Top tweeted 10 words with 5+ letters by $1"
     csvfix order -fn Text $file | \
      tr '[:upper:]' '[:lower:]' | \
      sed -E 's/@[[:alnum:]]+//g' |  \
      sed -E 's/#[[:alnum:]]+//g' |   \
      sed -E 's/http[^ ]+//g' |   \
      grep -oE '[[:alpha:]]{5,}' | sort | uniq -c | \
      sort -rn | head -n 10