dataset · MTA via NYS Open Data

MTA Subway Origin-Destination

The MTA's algorithmic reconstruction of where 4 million daily subway riders actually go. Turnstiles only capture entries — the agency probabilistically infers exits by analyzing time, location, and frequency of riders' subsequent entries. The resulting O-D matrix is the cleanest view of NYC's circulatory system that exists.

About this dataset

The MTA Subway Origin-Destination Ridership Estimate, hosted on the New York State Open Data portal as Socrata dataset y2qv-fytt. Each row is one estimated O-D pair — a station-complex of origin, a station-complex of destination, an hour of day, a day of week, and an estimated average rider count. The complete table reconstructs the daily 4-million- rider tidal flow across the system.

Why this dataset exists

The NYC subway requires riders to swipe in but not swipe out — meaning the MTA historically only knew where you got on, not where you got off. To produce a meaningful origin-destination matrix, the agency built a probabilistic algorithm: by analyzing the time, location, and frequency of each rider's subsequent entries (when they swiped in again at a different station, presumably to start a new trip from the destination of the previous one), the MTA can infer exit locations with reasonable accuracy.

The result is the cleanest view of NYC's transit circulatory system that exists publicly. Pre-MTA Open Data Act, this kind of analysis required FOILing the agency or guessing from turnstile data. Now it's a Socrata query away.

A different host

This dataset is the first in the NYC chapter that lives on data.ny.gov rather than data.cityofnewyork.us — the MTA is a state agency, not a city one. The Socrata adapter and worker proxy handle both transparently via the SOCRATA_DOMAINS allowlist; the domain field on the dataset config is the only thing that changes.

Source

  • Catalog page: NYS Open Data
  • Endpoint (SODA v3): POST https://data.ny.gov/api/v3/views/y2qv-fytt/query.json
  • O-D pairs: —

Caveats

  • The destination is an algorithmic estimate, not a measurement. For trips where the rider doesn't make a return entry within the inference window, the algorithm has to fall back to population-statistical averages, which means low-frequency riders are systematically less well-represented in destination data than commuters.
  • station_complex_id identifies a station complex — multiple platforms serving the same connected station — not an individual platform. To get per-line / per-direction granularity you'd need to join against the MTA's GTFS data.
  • Complex IDs are integers; the human-readable station names live in the MTA's separate station registry (a planned static JSON for the NYC chapter — until then, complex IDs ship as raw numbers in the UI).

Citation

Metropolitan Transportation Authority (2026). Subway Origin-Destination Ridership Estimate. Retrieved recently via NYS Open Data SODA v3.