under the hood

The pipeline

How the data gets from city servers to the page you're reading — and why the answer is different for Philly 311 than for 1.5 billion taxi trips.

The problem with open data

Most open data portals are designed for download, not for querying at scale in a browser. A Philly 311 dataset with 3 million rows is easy to browse in a Socrata table. It's not easy to serve live to thousands of readers who want to filter by zip code and see a chart update in 200ms.

The pipeline here is designed around one principle: the browser should never see more data than it needs to render the current view. That sounds obvious. In practice it means every dataset goes through a different path depending on its size, freshness requirements, and query complexity.

Data sources

Carto SQL

live

What: Philly 311 service requests, L&I violations, PFD dispatches, shooting victims.

How: Live SvelteKit server route proxies queries to phl.carto.com. Results cached in Cloudflare KV (5 min for rows, 1h for aggregates, 24h for schema). Stories prerender by calling Carto directly at build time and freezing the numbers into HTML.

Why: Philly's open data lives on Carto's hosted PostGIS. The proxy keeps a single egress point, so caching is efficient and the page code never touches the external API directly.

Socrata SODA

live

What: NYC 311, restaurant inspections, lead service lines, affordable housing, subway ridership. NYC TLC yellow cab and FHVHV trip records (2017–2023).

How: SODA v3 POST for the dataset catalog tabs (runtime, KV-cached). SODA v2 GET for story prerendering — the v2 endpoint is unauthenticated, works during local builds, and returns small payloads for aggregation queries even against billion-row tables.

Why: NYC Open Data publishes per-year dataset IDs on Socrata. A GROUP BY query returning 12 monthly totals is the same size regardless of whether the underlying table has 3M or 300M rows. The API does the work; we freeze the result.

DuckDB (build-time)

build

What: NYC TLC taxi trip records for 2024–2025 — data too new for Socrata, available only as monthly Parquet files from the TLC's CloudFront distribution.

How: A Node script (scripts/build-tlc-aggregates.ts) downloads the relevant month Parquet files via curl, then runs DuckDB in-memory to produce small JSON aggregates committed to src/lib/data/static/nyc/tlc-aggregates/. The aggregates are read by story +page.server.ts files at build time with no runtime traffic.

Why: DuckDB reads only the columns it needs from Parquet via byte-range requests, making large files tractable. The script is intentionally manual — it costs real egress and we want human intent before refreshing.

DuckDB WASM

live

What: The /datasets/nyc-tlc Playground tab — arbitrary SQL against the TLC Parquet archive, running entirely in the browser.

How: The @duckdb/duckdb-wasm bundle (~2 MB gzipped) is lazy-loaded only when the Playground tab opens. The jsDelivr CDN serves the WASM binary and worker script; no self-hosted WASM serving required. DuckDB's httpfs extension reads byte ranges from remote Parquet files via fetch. External Parquet queries require CORS headers on the host — production will mirror to a Cloudflare R2 bucket. In-browser queries (generate_series, inline data) work immediately.

Why: DuckDB WASM is the only viable option for ad-hoc SQL against multi-GB Parquet in the browser. SQL.js (SQLite) has no native Parquet reader. A backend SQL service adds operational complexity and a server cost. WASM keeps the stack flat: Cloudflare Pages + Workers, no additional processes. The lazy-load pattern means visitors who never open the Playground tab pay zero bundle cost.

Why DuckDB specifically

The TLC taxi trip archive is the clearest example of why DuckDB matters here. 1.5 billion rows, 50+ GB of Parquet distributed in monthly files from 2009 onward. Three options existed:

  • A backend SQL service — a managed database (ClickHouse, BigQuery, Postgres) with the data loaded in. Rejected: doubles the operational footprint, adds an attack surface, and reintroduces the cold-start cost we avoid with prerendering.
  • SQL.js (SQLite WASM) — no native Parquet reader. Loading rows into in-memory SQLite tables defeats the purpose of Parquet entirely.
  • DuckDB — native Parquet reader with columnar vectorized execution, designed for OLAP workloads against external storage. The httpfs extension reads byte ranges from R2 without downloading whole files. A GROUP BY over a single column touches maybe 5% of a 60 MB Parquet file.

For build-time aggregates, DuckDB runs as a Node module (duckdb-async). For the browser Playground, it runs as WASM in a Web Worker. Same SQL dialect, same query plan, different execution environment.

Static-first principle

Stories prerender. Dataset catalog pages don't. This is the most important architectural decision on the site.

A story about illegal dumping in Philadelphia gets its key numbers frozen into the HTML at deploy time. The reader sees accurate statistics instantly, without waiting for an API call. The data is as fresh as the last deploy. For most stories, a monthly or quarterly refresh is enough — the narrative doesn't change if a few hundred more reports came in last Tuesday.

The dataset catalog pages — the filterable, queryable views under /datasets — are dynamic. They need to respond to user queries in real time, which means they need the live API path. The Cloudflare Worker proxy + KV cache is what makes that viable without burning through API rate limits.

What's been built and what's next

Phase 1 Foundation

SvelteKit + Tailwind + shadcn scaffold, dark mode, chart primitives, Philly 311 live data via Carto.

Phase 3 Real data + maps

DataSource interface, CartoAdapter, Cloudflare Worker proxy, MapLibre + deck.gl, Philly 311 end-to-end.

NYC chapter Socrata + TLC

SocrataAdapter, NYC 311 and restaurant datasets, TLC yellow cab stories (congestion pricing, tip geography, 3am, black car takeover), DuckDB build-time aggregates.

Phase 4 DuckDB WASM Playground

DuckDB WASM lazy-loaded in the browser via jsDelivr. Playground tab live at /datasets/nyc-tlc. External Parquet queries pending R2 CORS mirror; in-browser queries work now.

Next R2 Parquet mirror

Mirror TLC monthly Parquet to Cloudflare R2 with CORS headers. Unlocks full cross-month SQL in the Playground without CORS restrictions.

The stack

  • · SvelteKit 5 (runes)
  • · Tailwind CSS v4
  • · shadcn-svelte
  • · MDsveX
  • · Cloudflare Pages
  • · Cloudflare Workers
  • · Cloudflare KV
  • · MapLibre GL
  • · deck.gl
  • · DuckDB (Node + WASM)
  • · Socrata SODA API
  • · Carto SQL API