What your project will look like:

https://github.com/compciv/show-me-earthquakes

Off notes

My repo for using the IBM Watson Speech to Text API to make supercuts of videos: https://github.com/dannguyen/watson-word-watcher

This Thursday (Feb. 11)

New guide: Sorting Python collections with the sorted method.

Use it to finish this assignment.

Here's an example of a geocoder for Mapzen. And here's one for Google's geocoder.

Random trivia: last year, I had students try to write 100 different data scrapers: https://github.com/compjour/search-script-scrape…at this point, you pretty much know all you need to know to do these, except for some very specific applications, such as how to parse HTML (i.e. deserialize it into data), or how to read an Excel file as just a regular CSV file. Go ahead and take a look at it if you're curious.

This week

Check out the assignments page for the relevant readings about earthquakes and bots.

Peruse the USGS Earthquakes as Spreadsheets dataset

How to map a bunch of earthquakes at once

The CSV feeds come from the USGS:

http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

import csv
import requests
from urllib.parse import urlencode
import webbrowser

usgs_url = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_month.csv'
resp = requests.get(usgs_url)
lines = resp.text.splitlines()
coordinate_pairs = []
for quake in csv.DictReader(lines):
    coordinate_pairs.append(quake['latitude'] + ',' + quake['longitude'])    
# contact google's api
endpoint_url = 'https://maps.googleapis.com/maps/api/staticmap'
query_string = urlencode(
                {'size': '800x500', 'markers': coordinate_pairs}, 
                doseq=True)
url = endpoint_url + '?' + query_string
webbrowser.open(url)

From last week

Read about functions for Friday.

Homework answers: https://github.com/compciv/2016.compciv.org/tree/master/data/homework/answers

APIs

Why do some datasets have APIs and others don't?

How an engineer uses Tinder:

Example APIs:

New York Times
USAJobs (sample call: https://data.usajobs.gov/api/jobs?series=2210)
Spotify

What services don't have an API?

NYTimes Articles API

Developer console - keys listing
Documentation

Example search for Clinton:

  http://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=YOUR_API_KEY&q=Clinton

Example of using Google's geocoder

Geocoding gist

Twitter data as JSON

Official docs for Twitter data:

Sample data files

Using @realdonaldtrump as an example

profile information: http://stash.compciv.org/samples/twitter/realDonaldTrump-profile.json
single tweet: http://stash.compciv.org/samples/twitter/realDonaldTrump-single-tweet.json
recent tweets (a list of the 200 most recent tweets): http://stash.compciv.org/samples/twitter/realDonaldTrump-tweets.json
last 3,300+ tweets http://stash.compciv.org/samples/twitter/realDonaldTrump.tweets.2016-02-02_1551.json

Sample JSON deserialization code

import requests
import json
ROOT_URL_PATH = 'http://stash.compciv.org/samples/twitter/'
url = ROOT_URL_PATH + 'realDonaldTrump-profile.json'

# Download it
resp = requests.get(url)
# deserialize it
data = json.loads(resp.text)
# to get the followers count:
print(data['followers_count'])
# to get the text of his latest tweet
print(data['status']['text'])

Sample data searches

It's more fun to search across a list of objects, such as a list of tweets:

http://stash.compciv.org/samples/twitter/realDonaldTrump-tweets.json

How many times has a recent Trump tweet mentioned 'cruz'?

url = ROOT_URL_PATH + 'realDonaldTrump-tweets.json'
resp = requests.get(url)
tweets = json.loads(resp.text)
tnum = 0
for tweet in tweets:
    if 'cruz' in tweet['text'].lower():
        tnum += 1

print("cruz was mentioned", tnum, 'times')

Loop within a loop for multiple terms

Using more variable names and abstracting things out, we can create a faster way of counting terms:

words = ['hillary', 'bernie', 'cruz', 'jeb', 'rubio', 
         'iowa', 'god', 'isis', 'megyn', 'loser',
         'idiot', 'dumb']
for word in words:
    tnum = 0
    for tweet in tweets:
        if word in tweet['text'].lower():
            tnum += 1
    print(word, "was mentioned", tnum, "times")

The output:

hillary was mentioned 5 times
bernie was mentioned 1 times
cruz was mentioned 33 times
jeb was mentioned 4 times
rubio was mentioned 4 times
iowa was mentioned 34 times
god was mentioned 1 times
isis was mentioned 0 times
megyn was mentioned 11 times
loser was mentioned 0 times
idiot was mentioned 1 times
dumb was mentioned 2 times

Which Twitter client does Trump use?

Create a dictionary that, for each value of source (i.e. "Android"), it increments the count by 1, starting from 0.

mydict = {}
for tweet in tweets:
    s = tweet['source']
    if mydict.get(s):        
        mydict[s] += 1
    else:
        mydict[s] = 1

Working with CSV info

Official documentation

Washington U CSE140 example

SF restaurant inspections

San Francisco's restaurant inspection data is an example

The official scorecard and database page for SF restaurant inspections: score explanations; searchable database

Here's a mirror of the zipfile that you can download:

http://stash.compciv.org/sf/SFFoodProgram_Complete_Data.zip

Here's the data uploaded to a Google spreadsheet.

Code snippets

from os import makedirs
from os.path import join

DATA_DIR = join("sf-food")
makedirs(DATA_DIR, exist_ok=True)

Let's download the file

import requests
URL = 'http://stash.compciv.org/sf/SFFoodProgram_Complete_Data.zip'
resp = requests.get(URL)

# save the file
zip_name = join(DATA_DIR, 'programdata.zip')
f = open(zip_name, 'wb')
f.write(resp.content)
f.close()

Unzip it

# unzip it
from shutil import unpack_archive
unpack_archive(zip_name, extract_dir=DATA_DIR)

Sample CSV deserialization code

Convert to a list of lists, no headers

Use csv.reader() constructor function

import csv
# first read the file
fname = join(DATA_DIR, 'violations_plus.csv')
f = open(fname, 'r')
lines = f.read().splitlines()  # preferable to f.readlines()
f.close()

# then convert to a CSV delimited list
rows = list(csv.reader(lines))

To print the 'description' field, we have to remember that it is 5th field (i.e. 4th indexed)

for row in rows:
    print(row[4])

Convert to list of dicts, using headers

Use the csv.DictReader() constructor function

import csv
# first read the file
fname = join(DATA_DIR, 'violations_plus.csv')
f = open(fname, 'r')
lines = f.read().splitlines()  # preferable to f.readlines()
f.close()

# then convert to a CSV delimited list
rows = list(csv.DictReader(lines))

To print the 'description' field:

for row in rows:
    print(row['description'])

Sorting it

To aggregate the number of occurrences per type of description, let's use a dictionary:

mydict = {}
for row in rows:
    desc = row['description']
    if mydict.get(desc):
        mydict[desc] += 1
    else: 
        mydict[desc] = 1

One way to sort it:

from operator import itemgetter
mylist = list(mydict.items())

# sort by the description text, alphabetical order
x = sorted(mylist, key=itemgetter(0))

# sort in ascending order of count
y = sorted(mylist, key=itemgetter(1)) 

# sort in descending order of count
z = sorted(mylist, key=itemgetter(1), reverse=True)

More CSVs: Starbucks

Example dataset: Starbucks

Direct URL to CSV: https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD

import requests
from csv import DictReader()
URL = 'https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD'

Haversine

def haversine(lon1, lat1, lon2, lat2):
    from math import radians, cos, sin, asin, sqrt
    lon1 = radians(lon1)
    lon2 = radians(lon2)
    lat1 = radians(lat1)
    lat2 = radians(lat2)
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat /2 ) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers.
    # return the final calculation
    return c * r

FEC Data

(note: I haven't gotten around to this stuff)

FEC data for past and some of the current cycle can be downloaded in bulk: http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2015_2016

Here's the candidate, committee, candidate-to-committee lookup, and operation expenditures for Trump's committee in a Google Spreadsheet.

There's more than one "Trump", but here's what you'll find for his main presidential committee:

Candidate: P80001571
Committee ID: C00580100

Working with individual committee data

Individual committee data requires a bit more parsing: http://docquery.fec.gov/cgi-bin/forms/DL/1047287/

Sample FEC end-of-year filings stories

These stories come from end of year filings:

Notes for today