This is a walkthrough showing how to go from comma-delimited plaintext data to interesting visualization.
The dataset for this scenario is a spreadsheet of all the known Starbucks locations in the world. To make things simple, we filter it down to U.S. locations. And then, for each Starbucks location, we calculate the distance from it to every other Starbucks location so that, for each Starbucks location, we can find out how far away it is to the next Starbucks. Here’s the filtered CSV.
And then we visualize it based on that interesting facet. Here’s an interactive CartoDB map in which the dots are bigger and more colorful the further away they are from other Starbucks.
Or, if you prefer, a Google Static Map showing the top 10 most isolated Starbucks:
This walkthrough is meant to be simple, requiring little more knowledge of the Python standard library than the csv module and the built-in sorted() function, and optionally, urllib and Requests, just to make fetching data a bit easier. I don’t even bother separating things into scripts; you can just type everything in into iPython and watch things happen.
The algorithm of “Take one Starbucks location, then calculate its distance to each and every other of the 12,000 Starbucks locations in America” is what we call “brute-force”. There are obviously better ways to do this, but if you aren’t well-versed in nearest-neighbor optimization routines, then brute force isn’t an _inelegant way to think of how to solve this problem, just inefficient in a computational sense. It works here because of the relatively few data points that we’re working with (i.e. under a million).
The data can be found here at this Socrata portal, which is described as:
All Starbucks Locations in the World
An export of all Starbucks locations in the world. This dataset was scraped from the Starbucks website and is regularly updated.
Exporting the data and saving to our local drive as starbucks.csv
import requests
DATA_URL = 'https://opendata.socrata.com/api/views/xy4y-c4mk/rows.csv?accessType=DOWNLOAD'
resp = requests.get(DATA_URL)
with open("starbucks.csv", 'w') as f:
f.write(resp.text)
Assuming that starbucks.csv
contains the full dataset, we create a filtered version by selecting only rows that have a country of "US"
and removing unnecessary columns. The result of the below routine creates a new file named us-starbucks.csv
:
HEADERS = ['City', 'Name', 'Latitude', 'Longitude', 'Store Number',
'Street Combined', 'Postal Code', 'Country Subdivision',
'Country', 'Ownership Type']
from csv import DictReader, DictWriter
with open("us-starbucks.csv", 'w') as w:
dw = DictWriter(w, fieldnames = HEADERS)
dw.writeheader()
with open("starbucks.csv", 'r') as r:
for row in DictReader(r):
if row['Country'] == 'US':
xo = {key: row[key] for key in HEADERS}
dw.writerow(xo)
This is the time-consuming stage.
Given that us-starbucks.csv
contains all U.S.-based Starbucks locations.
For each Starbucks location, find the nearest Starbucks to it and calculate the distance in kilometers.
Another way to state this is: for every U.S. Starbucks location, sort the list of all U.S. Starbucks locations in ascending order of distance to the given location. Then add a new column/attribute named 'nearest_store_distance_km'
.
How do we find the distance between any two given Starbucks? The haversine formula is a way to calculate distance between latitude and longitude points. This StackOverflow question has a nice explanation and implementation for Python: Haversine Formula in Python (Bearing and Distance between two GPS points)
What I've done below is create a function that takes arguments a
and b
and expects them both to be dictionaries with 'Longitude'
and 'Latitude'
keys, because I know each row in our dataset contains those key-value pairs:
def haversine_on_rows(a, b):
from math import radians, cos, sin, asin, sqrt
lon1 = radians(a['Longitude'])
lon2 = radians(b['Longitude'])
lat1 = radians(a['Latitude'])
lat2 = radians(b['Latitude'])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers.
# return the final calculation
return c * r
Assuming that we're starting in a fresh Python session/script, and the above function has been interpreted, let's set the variables and requirements:
HEADERS = ['City', 'Name', 'Latitude', 'Longitude', 'Store Number',
'Street Combined', 'Postal Code', 'Country Subdivision',
'Country', 'Ownership Type',
"nearest_store_number", "nearest_store_distance_km"]
from csv import DictReader, DictWriter
To get the store data, we just read in us-starbucks.csv
as a CSV file. However, I do a little preprocessing. Remember that CSV files (unlike JSON), because they're just plaintext values separated by commas, deserialize all values as strings – it doesn't assume that a number value is actually a number.
So in the routine below, I create a new list named drows
and for each row (dictionary) in the dataset, I manually convert Longitude
and Latitude
to floats, then append it to the drows
list. That saves me from having to do string-to-float conversions later on (i.e in the haversine function).
drows = []
with open("us-starbucks.csv", 'r') as rf:
for row in DictReader(rf):
row['Longitude'] = float(row['Longitude'])
row['Latitude'] = float(row['Latitude'])
drows.append(row)
with open("us-starbucks-distant.csv", 'w') as w:
dw = DictWriter(w, fieldnames = HEADERS)
dw.writeheader()
for i, row in enumerate(drows):
# assuming the nearest row distance is 0, the store compared to itself
# we assume the next row contains the actual nearest store
n_row = sorted(drows, key=lambda x: haversine_on_rows(row, x))[1]
row['nearest_store_number'] = n_row['Store Number']
# redundant calculation but oh well
row['nearest_store_distance_km'] = round(haversine_on_rows(row, n_row), 1)
print(i, row['nearest_store_distance_km'])
print("\t", row['City'], row['Country Subdivision'], row['Name'])
print("\t", n_row['City'], n_row['Country Subdivision'], n_row['Name'])
dw.writerow(row)
You can download my filtered CSV here.
Now that the us-starbucks-distant.csv
file has been created, we can read it, sort it by the nearest_store_distance_km
column, and get a quick list of the top 10 most isolated Starbucks:
from csv import DictReader
with open("us-starbucks-distant.csv", 'r') as r:
rows = sorted(DictReader(r),
key=lambda x: float(x['nearest_store_distance_km']),
reverse=True)
print("The 10 most isolated Starbucks in America are:")
for i, row in enumerate(rows[0:10]):
print("%s. %s km -- %s, %s (%s)" % (str(i + 1).rjust(2),
row['nearest_store_distance_km'].rjust(5),
row['City'],
row['Country Subdivision'],
row['Name']))
The output:
The 10 most isolated Starbucks in America are:
1. 377.5 km -- Ketchikan, AK (Carrs-Ketchikan #1818)
2. 209.3 km -- Kodiak, AK (Safeway-Kodiak #1090)
3. 183.2 km -- Del Rio, TX (Veterans & Chevrolet)
4. 165.7 km -- Page, AZ (Safeway - Page #249)
5. 155.1 km -- Garden City, KS (Target Garden City T-906)
6. 155.1 km -- Colby, KS (Oasis Travel Center)
7. 151.2 km -- Dickinson, ND (Family Fare - Dickinson #3123)
8. 149.7 km -- Hays, KS (Fort Hays State University)
9. 147.7 km -- North Platte, NE (North Platte-I80 & S Hwy 83)
10. 145.4 km -- Gallup, NM (Safeway-Gallup #1743)
The Google Static Maps API lets us create a URL that points to a map image with custom markers. We'll follow the same strategy above and create markers letter A
through J
to mark the most isolated Starbucks in descending order.
from urllib.parse import urlencode
from csv import DictReader
from string import ascii_uppercase
with open("us-starbucks-distant.csv", 'r') as r:
rows = sorted(DictReader(r),
key=lambda x: float(x['nearest_store_distance_km']),
reverse=True)[0:10]
markers = []
for i, row in enumerate(rows):
letter = ascii_uppercase[i]
markers.append("label:" + letter + '|' + row['Latitude'] + ',' + row['Longitude'])
gmap_url = 'https://maps.googleapis.com/maps/api/staticmap'
myparams = {'size': '800x400', 'markers': markers}
query_str = urlencode(myparams, doseq=True)
print(gmap_url + '?' + query_str)
The resulting URL:
Which looks like this:
This is as simple as just uploading into CartoDB. You can see my finished map here.
Here's a screenshot of that map: