Prev · Next


An empirical metric for urban transit accessibility

In this notebook, we will investigate an empirical method to quantify urban public transit accessibility.

Many existing metrics for this purpose are based on a location's proximity to transit stops and the prevalence of sidewalks in a given area. While these are useful heuristics, they do not factor in the availability of public transit modes in real-world use cases.

By contrast, the method presented here samples the best-route calculations offered by Google Maps across thousands of hypothetical journeys within a city. These journeys are used to define a transit factor for any given area of the city, which is based on the projected durations of transit journeys started within the area and normalized by the corresponding drive times.

The un-exported notebook is available here. Data CSV here.

Setup

In [1]:
!python3 -m pip install branca folium geopy googlemaps h3 numpy pandas pytz requests
!python3 -m pip install shapely
Requirement already satisfied: branca in /usr/local/lib/python3.7/site-packages (0.4.1)
Requirement already satisfied: folium in /usr/local/lib/python3.7/site-packages (0.11.0)
Requirement already satisfied: geopy in /usr/local/lib/python3.7/site-packages (2.0.0)
Requirement already satisfied: googlemaps in /usr/local/lib/python3.7/site-packages (4.4.2)
Requirement already satisfied: h3 in /usr/local/lib/python3.7/site-packages (3.6.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/site-packages (1.16.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/site-packages (0.24.2)
Requirement already satisfied: pytz in /usr/local/lib/python3.7/site-packages (2018.9)
Requirement already satisfied: requests in /usr/local/lib/python3.7/site-packages (2.21.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/site-packages (from branca) (2.10)
Requirement already satisfied: geographiclib<2,>=1.49 in /usr/local/lib/python3.7/site-packages (from geopy) (1.50)
Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests) (2019.3.9)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests) (1.24.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/site-packages (from jinja2->branca) (1.1.1)
Requirement already satisfied: six>=1.5 in /Users/sanjay/Library/Python/3.7/lib/python/site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Requirement already satisfied: shapely in /usr/local/lib/python3.7/site-packages (1.7.1)
In [2]:
import csv
import json
import random
import re
from collections import defaultdict
from datetime import datetime

import branca
import folium
import googlemaps
import h3
import numpy as np
import pandas as pd
import requests
from geopy.distance import great_circle
from pytz import timezone
from shapely.geometry import Point, Polygon

Sampling

See code comments for details. You can find SF.geojson here.

In [3]:
# Create a boundary polygon for SF.
boundary_geometry_SF = json.load(open("SF.geojson"))
boundary_polygon_SF = boundary_geometry_SF["coordinates"][0]
boundary_polygon_SF = [[point[1], point[0]] for point in boundary_polygon_SF]
boundary_polygon_SF += [boundary_polygon_SF[0]]
latitudes = [point[0] for point in boundary_polygon_SF]
longitudes = [point[1] for point in boundary_polygon_SF]
polygon_structure_SF = Polygon(boundary_polygon_SF)

# Compute the bounding box for the polygon (and store its ranges for rejection
# sampling).
min_latitude_SF = min(latitudes)
max_latitude_SF = max(latitudes)
min_longitude_SF = min(longitudes)
max_longitude_SF = max(longitudes)
latitude_range_SF = (min_latitude_SF, max_latitude_SF)
longitude_range_SF = (min_longitude_SF, max_longitude_SF)
bounding_box = [
    [min_latitude_SF, min_longitude_SF],
    [min_latitude_SF, max_longitude_SF],
    [max_latitude_SF, max_longitude_SF],
    [max_latitude_SF, min_longitude_SF],
    [min_latitude_SF, min_longitude_SF],
]

# Plot the boundary and its bounding box.
center_latitude = sum(latitudes) / len(latitudes)
center_longitude = sum(longitudes) / len(longitudes)
viz_map = folium.Map(
    location=[center_latitude, center_longitude], zoom_start=12, tiles="cartodbpositron"
)
polyline = folium.PolyLine(locations=boundary_polygon_SF, weight=4, color="green")
viz_map.add_child(polyline)
polyline = folium.PolyLine(locations=bounding_box, weight=4, color="black")
viz_map.add_child(polyline);
In [4]:
def sample_point(polygon_structure, latitude_range, longitude_range):
    """Rejection sampling (generate a point within the bounding box and reject it if
    it's outside the polygon).
    """
    while True:
        latitude = random.uniform(*latitude_range)
        longitude = random.uniform(*longitude_range)
        if polygon_structure.contains(Point(latitude, longitude)):
            return (latitude, longitude)
In [5]:
# Plot a bunch of random points within SF (as a sanity check of our sampling method).
for _ in range(1000):
    point = sample_point(polygon_structure_SF, latitude_range_SF, longitude_range_SF)
    marker = folium.CircleMarker(point, radius=1)
    viz_map.add_child(marker)
viz_map
Out[5]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [6]:
# Setup access to the Google Maps API.
GOOGLE_API_KEY = "YOUR_API_KEY_HERE"
maps = googlemaps.Client(key=GOOGLE_API_KEY)
In [7]:
def run_trial(tzinfo, polygon_structure, latitude_range, longitude_range):
    """Samples the driving and transit durations for a random origin and destination on
    Google Transit.
    
    Triggers up to 2 requests to the Directions API (the free tier will permit up to 20K
    calls to this function).
    """
    point_1 = sample_point(polygon_structure_SF, latitude_range_SF, longitude_range_SF)
    point_2 = sample_point(polygon_structure_SF, latitude_range_SF, longitude_range_SF)
    sampled_at = int(datetime.now().timestamp())

    # Find driving duration between the points as a baseline.
    driving_result = maps.directions(point_1, point_2, mode="driving")
    if len(driving_result) == 0:
        print(f"Driving result failed for {point_1} to {point_2}.")
        return None
    duration_driving = driving_result[0]["legs"][0]["duration"]["value"]

    # IMPORTANT: I sampled this data on a Tuesday so the data covers a Saturday, a
    # Sunday, and a Monday. Change the random departure time sampling accordingly for
    # your own purposes.
    today_midnight = int(
        tzinfo.localize(
            datetime.combine(datetime.now(), datetime.min.time())
        ).timestamp()
    )
    three_days_ago_midnight = today_midnight - 3 * 24 * 60 * 60
    random_departure_time = int(random.uniform(three_days_ago_midnight, today_midnight))

    directions_results = maps.directions(
        point_1, point_2, departure_time=random_departure_time, mode="transit"
    )
    if len(directions_results) == 0:
        return (
            sampled_at,
            *point_1,
            *point_2,
            random_departure_time,
            duration_driving,
            None,
            None,
            None,
        )
    else:
        directions_result = directions_results[0]

    # Some additional interesting data is the cost of various rides.
    if "fare" in directions_result:
        fare_value = directions_result["fare"]["value"]
        fare_currency = directions_result["fare"]["currency"]
    else:
        fare_value = None
        fare_currency = None

    directions_leg = directions_result["legs"][0]
    if "arrival_time" in directions_leg:
        # This allows us to include initial waiting time for transit directions.
        duration_transit = (
            directions_leg["arrival_time"]["value"] - random_departure_time
        )
    else:
        # Typically falls into this case with walking-only transit directions.
        duration_transit = directions_leg["duration"]["value"]

    # Used to write a CSV row.
    return (
        sampled_at,
        *point_1,
        *point_2,
        random_departure_time,
        duration_driving,
        duration_transit,
        fare_value,
        fare_currency,
    )
In [8]:
# UNCOMMENT ME: Test out the data collection before we spam it...
# tzinfo = timezone("America/Los_Angeles")
# result_test = run_trial(
#     tzinfo, polygon_structure_SF, latitude_range_SF, longitude_range_SF
# )
# result_test
In [9]:
# UNCOMMENT ME: Sample a bunch of journeys!
# num_trials = 5000
# with open("data.csv", "a") as csv_file:
#     writer = csv.writer(csv_file, quoting=csv.QUOTE_MINIMAL)
#     for i in range(num_trials):
#         result = run_trial(
#             tzinfo, polygon_structure_SF, latitude_range_SF, longitude_range_SF
#         )
#         if result is None:
#             continue
#         writer.writerow(result)
#         if i % 10 == 9:
#             csv_file.flush()
#             print(f"Finished trial {i + 1}.")

Analysis

You can find my data file here.

Here's the raw Pandas dataframe we end up with.

In [10]:
data_raw = pd.read_csv("data.csv")
data_raw = data_raw.replace({np.nan: None})
data_raw[:10]
Out[10]:
sampled_at point_1_latitude point_1_longitude point_2_latitude point_2_longitude departure_time duration_driving duration_transit fare_value fare_currency
0 1601423723 37.7187 -122.408 37.7539 -122.38 1601168430 689 2592 3 USD
1 1601423723 37.7792 -122.5 37.777 -122.481 1601121762 305 1735 3 USD
2 1601423724 37.743 -122.449 37.7814 -122.435 1601216863 910 2393 3 USD
3 1601423724 37.7366 -122.488 37.8003 -122.433 1601202271 1305 4302 3 USD
4 1601423724 37.7093 -122.416 37.723 -122.433 1601137070 356 1826 3 USD
5 1601423725 37.7288 -122.368 37.7603 -122.419 1601183869 917 3201 3 USD
6 1601423726 37.7447 -122.477 37.7769 -122.495 1601321037 566 2522 3 USD
7 1601423726 37.7742 -122.479 37.7268 -122.368 1601242806 1420 4717 3 USD
8 1601423726 37.7398 -122.404 37.7354 -122.408 1601121815 358 653 None None
9 1601423726 37.7128 -122.467 37.7271 -122.378 1601147905 875 4215 3 USD

Let's start by just plotting a bunch of routes originating from an area (with a few constraints).

See code comments.

In [11]:
# Let's look at all routes within a 2 hour duration.
# We use this constraint throughout the notebook; higher durations are often invalid
# routings that are active only on a different day.
data = data_raw[data_raw["duration_transit"] <= 2 * 60 * 60]
routes = [((row[1], row[2]), (row[3], row[4]), row[6], row[7]) for row in data.values]

# And in a 0.5 mile radius around a particular Safeway.
safeway = (37.782490, -122.430809)
margin = 0.5  # In miles.
nearby_routes = [
    route for route in routes if great_circle(safeway, route[0]).miles <= margin
]

# Plot Safeway.
route_viz_map = folium.Map(
    location=[center_latitude, center_longitude], zoom_start=12, tiles="cartodbpositron"
)
marker = folium.CircleMarker(safeway, radius=1, color="darkorange")
route_viz_map.add_child(marker)

# Plot route start and end points.
for nearby_route in nearby_routes:
    marker = folium.CircleMarker(nearby_route[0], radius=1)
    route_viz_map.add_child(marker)

    marker = folium.CircleMarker(nearby_route[1], radius=1, color="red")
    route_viz_map.add_child(marker)

route_viz_map
Out[11]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Transit Factor

The proposed transit factor expresses the multiplier over (or under) drive time for a given origin/destination. For the rest of the notebook, we will look at interesting ways to group and then average the transit factor over many journeys.

First, let's calculate the average transit factor for routes started within 0.5 miles of Safeway (shown in the previous map).

In [12]:
transit_factor_total = [route[3] / route[2] for route in nearby_routes]
round(sum(transit_factor_total) / len(transit_factor_total), 2)
Out[12]:
3.67

Next, let's sample the average transit factor for a bunch of origin areas in the city.

In [13]:
def compute_average_transit_factor_for_origin_area(
    max_trip_duration, computation_radius_miles, samples, location
):
    """A function to do the average factor calculation for any origin and its
    surrounding area.
    """
    samples_limited = samples[samples["duration_transit"] <= 2 * 60 * 60]
    routes = [
        ((row[1], row[2]), (row[3], row[4]), row[6], row[7])
        for row in samples_limited.values
    ]
    nearby_routes = [
        route for route in routes if great_circle(location, route[0]).miles <= margin
    ]
    transit_factor_total = [route[3] / route[2] for route in nearby_routes]
    return sum(transit_factor_total) / len(transit_factor_total)


# This takes a minute or two.
transit_factors = []
for i in range(1000):
    point = sample_point(polygon_structure_SF, latitude_range_SF, longitude_range_SF)
    transit_factors.append(
        (
            compute_average_transit_factor_for_origin_area(
                2 * 60 * 60, 0.5, data_raw, point
            ),
            point,
        )
    )
    if i % 100 == 99:
        print(f"Finished {i + 1}.")
Finished 100.
Finished 200.
Finished 300.
Finished 400.
Finished 500.
Finished 600.
Finished 700.
Finished 800.
Finished 900.
Finished 1000.

See code comments (preparing colormaps for plotting).

In [14]:
# Let's look at the distribution of average factor values (which will help us create a
# colormap).
factor_values = [factor[0] for factor in transit_factors]
factor_min = min(factor_values)
factor_max = max(factor_values)
factor_mean = sum(factor_values) / len(factor_values)
factor_std_dev = np.std(factor_values)

# Observe that most of the mass of the distribution is < (mean + one standard deviation).
print((factor_min, factor_max, factor_mean, factor_std_dev))
np.histogram(factor_values)
(2.853872575425442, 8.596850327429918, 3.822190319908781, 0.5325547350409376)
Out[14]:
(array([125, 641, 194,  17,   3,  12,   4,   0,   2,   2]),
 array([2.85387258, 3.42817035, 4.00246813, 4.5767659 , 5.15106368,
        5.72536145, 6.29965923, 6.873957  , 7.44825478, 8.02255255,
        8.59685033]))
In [15]:
# In general, factor values seem to fall in the range [2.5, 4.5] with a long upper tail.
# Let's use the classic Viridis colormap, but with an extra red tint for the upper tail.
viridis_base = list(reversed(branca.colormap._schemes["viridis"]))
viridis_augmented = viridis_base + ["red"]
series_min = 2.5
series_high = 4.5
series_max = 10
index = list(np.linspace(series_min, series_high, len(viridis_augmented) - 1)) + [
    series_max
]
factor_colormap = branca.colormap.LinearColormap(
    colors=viridis_augmented, index=index, vmin=series_min, vmax=series_max
)
factor_colormap_for_display = branca.colormap.LinearColormap(
    colors=viridis_base, vmin=series_min, vmax=series_high
)

Here, we plot average transit factor values (for each sampled origin area) on a map.

In [16]:
transit_factor_map = folium.Map(
    location=[center_latitude, center_longitude], zoom_start=12, tiles="cartodbpositron"
)

for (factor, point) in transit_factors:
    marker = folium.CircleMarker(
        point, radius=1, color=factor_colormap.rgb_hex_str(factor)
    )
    transit_factor_map.add_child(marker)

transit_factor_map.add_child(factor_colormap_for_display)
transit_factor_map
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Hex Grouping

The previous technique is useful to score a particular location with a transit factor (based on an average of nearby neighbors within some radius). But maybe we're interested in areas of interest rather than scoring individual locations.

Let's use the H3 geographical index scheme (not covered here) to group the raw origins into several hexagonal regions. Within each hex grouping of origins, we can then take the average of each originating journey's transit factor.

In [17]:
# This resolution basically tiles a handful of neighborhood blocks.
h3_resolution = 8

# We re-use this logic in subsequent steps.
# The parameter consists of individual factor observations with a corresponding hex.
def plot_hex_factors(hex_factors):
    # Get averages for every hex.
    values_for_hex = defaultdict(list)
    for (factor, hex_code) in hex_factors:
        values_for_hex[hex_code].append(factor)
    hex_averages = {}
    for hex_key in values_for_hex:
        hex_averages[hex_key] = sum(values_for_hex[hex_key]) / len(
            values_for_hex[hex_key]
        )

    transit_hex_map = folium.Map(
        location=[center_latitude, center_longitude],
        zoom_start=12,
        tiles="cartodbpositron",
    )

    # Draw all the hexes colored by their averages.
    # For this purpose, we keep re-using the baseline colormap for comparison purposes.
    # This can and will saturate for really low-valued hexes, so keep this in mind if
    # you see a lot of minima on your plot.
    for (hex_key, factor) in hex_averages.items():
        hex_boundary = h3.h3_to_geo_boundary(hex_key)
        hex_boundary = hex_boundary + (hex_boundary[0],)
        polyline = folium.PolyLine(
            locations=hex_boundary,
            weight=1,
            color="#333",
            fill_color=factor_colormap.rgb_hex_str(factor),
            fill_opacity=0.6,
        )
        transit_hex_map.add_child(polyline)

    transit_hex_map.add_child(factor_colormap_for_display)
    return transit_hex_map, hex_averages


# Continue to apply the 2 hour constraint from before.
samples_limited = data_raw[data_raw["duration_transit"] <= 2 * 60 * 60]
routes = [
    ((row[1], row[2]), (row[3], row[4]), row[6], row[7])
    for row in samples_limited.values
]
hex_factors = [
    (route[3] / route[2], h3.geo_to_h3(*route[0], h3_resolution)) for route in routes
]
hex_map, hex_averages = plot_hex_factors(hex_factors)
hex_map
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Sweet. Let's do that again, but this time we're interested in giving destination areas a transit factor. (That is, for a given destination area, what is the average multiplier over drive time to get there by transit?)

In [18]:
hex_factors_dest = [
    (route[3] / route[2], h3.geo_to_h3(*route[1], h3_resolution)) for route in routes
]
hex_map_dest, hex_averages_dest = plot_hex_factors(hex_factors_dest)
hex_map_dest
Out[18]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Are there any hexes with large discrepancies between origin and destination factor values?

Let's compute (and plot) the ratio of destination factor values to origin factor values. In this map, areas in blue are "easier" in some sense to originate from than terminate at.

No huge discrepancies here.

In [19]:
factor_ratios = {}
for hex_key in hex_averages:
    if hex_key in hex_averages_dest:
        factor_ratios[hex_key] = hex_averages_dest[hex_key] / hex_averages[hex_key]

transit_hex_ratio_map = folium.Map(
    location=[center_latitude, center_longitude], zoom_start=12, tiles="cartodbpositron"
)
ratio_colormap = branca.colormap.linear.RdBu_04.scale(
    min(factor_ratios.values()), max(factor_ratios.values())
)

for (hex_key, ratio) in factor_ratios.items():
    hex_boundary = h3.h3_to_geo_boundary(hex_key)
    hex_boundary = hex_boundary + (hex_boundary[0],)
    polyline = folium.PolyLine(
        locations=hex_boundary,
        weight=1,
        color="#333",
        fill_color=ratio_colormap.rgb_hex_str(ratio),
        fill_opacity=0.6,
    )
    transit_hex_ratio_map.add_child(polyline)

transit_hex_ratio_map.add_child(ratio_colormap)
transit_hex_ratio_map
Out[19]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Two more questions.

First, how does the Monday-only origin map compare to the other maps? Try changing == 0 to == 6 for Sunday or == 5 for Saturday. You should see clear differences.

In [20]:
tzinfo = timezone("America/Los_Angeles")
samples_monday = samples_limited[
    samples_limited["departure_time"].apply(
        lambda x: tzinfo.localize(datetime.fromtimestamp(x)).weekday()
    )
    == 0  # For Monday.
]
routes_monday = [
    ((row[1], row[2]), (row[3], row[4]), row[6], row[7])
    for row in samples_monday.values
]
hex_factors_monday = [
    (route[3] / route[2], h3.geo_to_h3(*route[0], h3_resolution))
    for route in routes_monday
]
hex_map_monday, hex_averages_monday = plot_hex_factors(hex_factors_monday)
hex_map_monday
Out[20]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Finally, how does rush hour (6 AM to 9 AM) look?

Interestingly, major commute thoroughfares (the 38 and 14 buses for example) look much brighter.

In [21]:
samples_hours = samples_limited["departure_time"].apply(
    lambda x: tzinfo.localize(datetime.fromtimestamp(x)).hour
)
samples_rush_hour = samples_limited[(6 <= samples_hours) & (samples_hours <= 9)]
routes_rush_hour = [
    ((row[1], row[2]), (row[3], row[4]), row[6], row[7])
    for row in samples_rush_hour.values
]
hex_factors_rush_hour = [
    (route[3] / route[2], h3.geo_to_h3(*route[0], h3_resolution))
    for route in routes_rush_hour
]
hex_map_rush_hour, hex_averages_rush_hour = plot_hex_factors(hex_factors_rush_hour)
hex_map_rush_hour
Out[21]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Remarks

  • This analysis (as shown) uses data from September 26-28, 2020 (i.e. a selection of only 3 days in the world of COVID-limited service). This selection was designed to cover the weekend days and a weekday (which covers most of the possible daily service patterns). However, you would ideally use a data set that covers the whole week (and across different weeks and months). Free API credits anyone?
  • This analysis doesn't consider cost; for most of SF it's $3.00 or cheaper between any two points, which is only false for some services operated by non-MUNI transit authorities.
  • This analysis doesn't consider delays (and represents best-case transit scenarios).
  • As SF transitions into a post-COVID world, the city should consider de-prioritizing downtown as a transit waypoint and redirecting resources to south city (e.g. Excelsior and Outer Mission).