scrollytelling · ~5 min · live data

The subway tide

Four million subway riders move through New York City every weekday. The MTA used to know where they boarded but not where they got off. Then they built an algorithm.

The destination problem

For its entire pre-2024 history, the NYC subway captured ridership data the way it captured fares: at the turnstile. You swipe in. The system records the entry. You travel. You exit through a turnstile that doesn't read your card. The system has no idea where you went.

This was, for transit-planning purposes, a catastrophe. You can't optimize service routes against unknown destinations. You can't build evidence-based cases for new station investments if you can't show how riders move. You can't even compute basic things like average trip distance with anything but FOIL'd survey data and modeling assumptions. The MTA had a high-resolution origin dataset and a complete absence of destination data, and the informational asymmetry shaped two decades of transit politics in the city.

The algorithm

The agency's response, codified under the MTA Open Data Act, was to infer destinations. The reasoning runs like this: every subway rider eventually swipes in again somewhere, to start a new trip. That second entry is presumably from a station near the destination of the first trip. By analyzing the time, location, and frequency pattern of each rider's subsequent entries, the agency can probabilistically reconstruct exit locations. The probability is highest for commuters with predictable round-trip patterns and lowest for occasional or one-off riders, but in aggregate it produces a usable origin-destination matrix.

The result is the dataset this story draws from: every (origin, destination, hour, day-of- week) tuple, weighted by estimated average ridership. From it, you can reconstruct the city's daily commuter geography in a way that's never been publicly available before.

The tide

The hour-of-day distribution is the cleanest visible signature of the morning-evening tidal pattern that defines the system's rhythm:

The morning peak around 8 AM (~0 estimated riders/hour at the busiest), the evening peak from 5–7 PM (~0), and the pre-dawn trough at 3–5 AM (~0 — orders of magnitude lower) tell you what every New Yorker already knows experientially: this is fundamentally a commuter system. The off-peak ridership is real and growing, but the system's load-bearing function is moving people from residential boroughs to Manhattan business districts in the morning and back out at night.

Where they board

The top origin stations are where commuters live (or transfer through on their way in from further out). The top destination stations are where commuters work. Plotted side by side, the asymmetry is the asymmetry of NYC's labor geography:

The names of these complexes — Times Sq-42 St, Union Sq, Grand Central, Penn Station — are the daily commuter geography of NYC. The same pattern would have looked very different even five years ago, before the MTA's algorithmic O-D reconstruction made the asymmetry measurable. Pre-2024, "where do my riders go?" was a survey question. Now it's a SoQL query.

What the algorithm gets wrong

The destination data is an estimate, not a measurement, and it has predictable failure modes. Riders who don't make a return trip within the inference window — visitors, occasional travelers, anyone who took a one-way trip and used a different mode to come back — fall back to population-statistical averages. This means low-frequency riders are systematically less well-represented in destination data than commuters. Late-night and weekend trips are estimated less accurately than weekday commute hours. None of which makes the dataset useless — it makes it the cleanest version of an estimate that was previously unavailable in any form at all.

The next iteration of this piece will overlay the origin-destination flows on a deck.gl arc map of the system, with morning and evening peak filters as toggles. For now, the simpler bar charts get the structural story across: a city that breathes with the workday, mapped at a resolution we couldn't see two years ago.

Source: NYS Open Data, dataset y2qv-fytt (MTA Subway Origin-Destination Ridership Estimate). Fetched at runtime via the Cloudflare Worker proxy.