What your project will look like:
https://github.com/compciv/show-me-earthquakes
My repo for using the IBM Watson Speech to Text API to make supercuts of videos: https://github.com/dannguyen/watson-word-watcher
New guide: Sorting Python collections with the sorted method.
Use it to finish this assignment.
Here's an example of a geocoder for Mapzen. And here's one for Google's geocoder.
Random trivia: last year, I had students try to write 100 different data scrapers: https://github.com/compjour/search-script-scrape…at this point, you pretty much know all you need to know to do these, except for some very specific applications, such as how to parse HTML (i.e. deserialize it into data), or how to read an Excel file as just a regular CSV file. Go ahead and take a look at it if you're curious.
Check out the assignments page for the relevant readings about earthquakes and bots.
Related:
The CSV feeds come from the USGS:
http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php
import csv
import requests
from urllib.parse import urlencode
import webbrowser
usgs_url = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_month.csv'
resp = requests.get(usgs_url)
lines = resp.text.splitlines()
coordinate_pairs = []
for quake in csv.DictReader(lines):
coordinate_pairs.append(quake['latitude'] + ',' + quake['longitude'])
# contact google's api
endpoint_url = 'https://maps.googleapis.com/maps/api/staticmap'
query_string = urlencode(
{'size': '800x500', 'markers': coordinate_pairs},
doseq=True)
url = endpoint_url + '?' + query_string
webbrowser.open(url)
Read about functions for Friday.
Homework answers: https://github.com/compciv/2016.compciv.org/tree/master/data/homework/answers
Why do some datasets have APIs and others don't?
Example APIs:
What services don't have an API?
Example search for Clinton:
http://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=YOUR_API_KEY&q=Clinton
Official docs for Twitter data:
Sample data files
Using @realdonaldtrump as an example
import requests
import json
ROOT_URL_PATH = 'http://stash.compciv.org/samples/twitter/'
url = ROOT_URL_PATH + 'realDonaldTrump-profile.json'
# Download it
resp = requests.get(url)
# deserialize it
data = json.loads(resp.text)
# to get the followers count:
print(data['followers_count'])
# to get the text of his latest tweet
print(data['status']['text'])
It's more fun to search across a list of objects, such as a list of tweets:
http://stash.compciv.org/samples/twitter/realDonaldTrump-tweets.json
url = ROOT_URL_PATH + 'realDonaldTrump-tweets.json'
resp = requests.get(url)
tweets = json.loads(resp.text)
tnum = 0
for tweet in tweets:
if 'cruz' in tweet['text'].lower():
tnum += 1
print("cruz was mentioned", tnum, 'times')
Using more variable names and abstracting things out, we can create a faster way of counting terms:
words = ['hillary', 'bernie', 'cruz', 'jeb', 'rubio',
'iowa', 'god', 'isis', 'megyn', 'loser',
'idiot', 'dumb']
for word in words:
tnum = 0
for tweet in tweets:
if word in tweet['text'].lower():
tnum += 1
print(word, "was mentioned", tnum, "times")
The output:
hillary was mentioned 5 times
bernie was mentioned 1 times
cruz was mentioned 33 times
jeb was mentioned 4 times
rubio was mentioned 4 times
iowa was mentioned 34 times
god was mentioned 1 times
isis was mentioned 0 times
megyn was mentioned 11 times
loser was mentioned 0 times
idiot was mentioned 1 times
dumb was mentioned 2 times
Create a dictionary that, for each value of source
(i.e. "Android"), it increments the count by 1, starting from 0.
mydict = {}
for tweet in tweets:
s = tweet['source']
if mydict.get(s):
mydict[s] += 1
else:
mydict[s] = 1
San Francisco's restaurant inspection data is an example
The official scorecard and database page for SF restaurant inspections: score explanations; searchable database
Here's a mirror of the zipfile that you can download:
http://stash.compciv.org/sf/SFFoodProgram_Complete_Data.zip
Here's the data uploaded to a Google spreadsheet.
from os import makedirs
from os.path import join
DATA_DIR = join("sf-food")
makedirs(DATA_DIR, exist_ok=True)
Let's download the file
import requests
URL = 'http://stash.compciv.org/sf/SFFoodProgram_Complete_Data.zip'
resp = requests.get(URL)
# save the file
zip_name = join(DATA_DIR, 'programdata.zip')
f = open(zip_name, 'wb')
f.write(resp.content)
f.close()
Unzip it
# unzip it
from shutil import unpack_archive
unpack_archive(zip_name, extract_dir=DATA_DIR)
Use csv.reader()
constructor function
import csv
# first read the file
fname = join(DATA_DIR, 'violations_plus.csv')
f = open(fname, 'r')
lines = f.read().splitlines() # preferable to f.readlines()
f.close()
# then convert to a CSV delimited list
rows = list(csv.reader(lines))
To print the 'description' field, we have to remember that it is 5th field (i.e. 4th indexed)
for row in rows:
print(row[4])
Use the csv.DictReader()
constructor function
import csv
# first read the file
fname = join(DATA_DIR, 'violations_plus.csv')
f = open(fname, 'r')
lines = f.read().splitlines() # preferable to f.readlines()
f.close()
# then convert to a CSV delimited list
rows = list(csv.DictReader(lines))
To print the 'description' field:
for row in rows:
print(row['description'])
To aggregate the number of occurrences per type of description, let's use a dictionary:
mydict = {}
for row in rows:
desc = row['description']
if mydict.get(desc):
mydict[desc] += 1
else:
mydict[desc] = 1
One way to sort it:
from operator import itemgetter
mylist = list(mydict.items())
# sort by the description text, alphabetical order
x = sorted(mylist, key=itemgetter(0))
# sort in ascending order of count
y = sorted(mylist, key=itemgetter(1))
# sort in descending order of count
z = sorted(mylist, key=itemgetter(1), reverse=True)
Example dataset: Starbucks
Direct URL to CSV: https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD
import requests
from csv import DictReader()
URL = 'https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD'
def haversine(lon1, lat1, lon2, lat2):
from math import radians, cos, sin, asin, sqrt
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat /2 ) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers.
# return the final calculation
return c * r
(note: I haven't gotten around to this stuff)
FEC data for past and some of the current cycle can be downloaded in bulk: http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2015_2016
There's more than one "Trump", but here's what you'll find for his main presidential committee:
Individual committee data requires a bit more parsing: http://docquery.fec.gov/cgi-bin/forms/DL/1047287/
These stories come from end of year filings: