article · 2026-05-10 · ~3 min · in development

The taxi data is coming

One-and-a-half billion rows. Fifty gigabytes. Sixteen years of pickups, drop-offs, fares, and now — for the first time — the per-trip cost of driving into Manhattan during peak hours.

The NYC Taxi & Limousine Commission Trip Record Data is the most ambitious mobility dataset any U.S. city has ever published. Every yellow cab, every green cab, every for-hire vehicle trip the city sees gets logged with timestamps, pickup and drop-off zones, distance, fare, and — since January 2025 — the new cbd_congestion_fee field tracking the Congestion Relief Zone's per-trip toll. It is the closest thing we have to a real-time map of Manhattan's economic circulatory system.

It also won't fit through our Socrata proxy.

Why this dataset is different

The other NYC datasets in this chapter — 311, PLUTO, HPD, restaurant inspections, lead pipes, MTA origin-destination — all live on Socrata's SODA API. We send a SoQL query, the city sends back JSON, the worker proxy caches the result in KV. That round trip works because even the biggest of those datasets (311, at 24 million rows) responds to aggregate queries in well under a second.

TLC is two orders of magnitude bigger. The canonical distribution isn't even on Socrata anymore — since 2022 the city publishes monthly Parquet files on www.nyc.gov/site/tlc, mirrored on the AWS Registry of Open Data at s3://nyc-tlc/. A single yellow-cab month is ~3 million rows. The full historical archive is ~50 GB.

What we're planning

Two tiers, both deferred for now and tracked in the design doc.

Tier 1: a build-time aggregation script. We pull the latest few months of Parquet from the AWS mirror, run DuckDB locally to compute the aggregates a story or chart actually needs (hourly hex-binned pickup density, post-event DBSCAN clusters, congestion-zone demand-elasticity diff-in-diff), and commit the JSON outputs to this repository under src/lib/data/static/nyc/tlc-aggregates/. A monthly cron in GitHub Actions opens a PR with refreshed data so review stays in the loop. Two stories already have outlines against this tier: one on the cellular dead-zones implied by the store_and_fwd_flag column, one on where Pride / Marathon / Halloween Parade crowds went after the parties ended.

Tier 2: a Playground tab that ships DuckDB-WASM and reads Parquet from a Cloudflare R2 mirror via httpfs. SoQL becomes a thin DuckDB SQL dialect; arbitrary queries against the full archive run in the browser. This is a 2 MB lazy-loaded bundle — a real choice, not a side-effect — and warrants a measurement pass before shipping.

Why not now

The NYC chapter is being rolled out in phases. Wiring the SocrataAdapter, getting one dataset end-to-end, and proving the city-filter UX is one phase. Adding the remaining seven Socrata- backed datasets is the next. TLC's pipeline is its own discipline — Parquet, DuckDB, R2, GitHub Actions, browser bundle measurement — and it deserves dedicated focus rather than being a side quest.

For now, this page is the placeholder. When the pipeline lands, the dataset shows up in the catalog with all five tabs (Overview, Views, Filters, Playground, About), and this article gets rewritten as a launch piece on what taxi data tells us about how the new congestion toll changed the city. Watch this space.