Converting Large GeoParquet Files to Vector Tiles

Converting large GeoParquet files to vector tiles requires a streaming, chunk-based pipeline that bypasses in-memory DataFrame loading. Read the dataset column-by-column using pyarrow, filter and simplify geometries at the source, stream features as GeoJSON Lines (NDJSON) to tippecanoe, and output optimized .mbtiles or directory-based MVTs. For production workloads, never load multi-gigabyte spatial files into a single geopandas DataFrame; instead, leverage spatial partitioning, attribute pruning, and zoom-level coalescing to keep the conversion process under 4GB RAM while generating cache-ready tiles.

Core Pipeline Architecture

Large-scale vector tile generation fails when the ingestion step attempts to materialize the entire dataset. GeoParquet’s columnar layout allows you to read only the geometry column and a strict subset of attributes, dramatically reducing I/O overhead. The conversion pipeline should follow a strict three-stage flow:

  1. Chunked Spatial Read: Partition the GeoParquet file by bounding box or row group. Use pyarrow.parquet.ParquetFile or pyarrow.dataset to scan only required columns and apply predicate pushdown for spatial extent filtering.
  2. Streaming Serialization: Convert each chunk to GeoJSON features and write them to a temporary NDJSON stream. This format is natively supported by tippecanoe and avoids the memory spike of wrapping millions of features in a JSON array.
  3. Tile Generation & Caching: Pipe the NDJSON stream directly into tippecanoe with aggressive simplification flags, outputting either a single .mbtiles SQLite database or a directory structure compatible with standard web tile servers.

When designing GeoParquet Input Processing stages, prioritize row-group alignment with your target tile grid. Misaligned partitions cause redundant geometry reads and increase conversion latency by 30–50%. For authoritative details on how GeoParquet encodes geometries and metadata, consult the official GeoParquet Specification.

Production-Ready Streaming Code

The following Python script demonstrates a memory-safe approach that reads a GeoParquet file in configurable batches, streams GeoJSON features to tippecanoe via standard input, and applies zoom-level optimization. It requires pyarrow>=12.0, shapely>=2.0, and tippecanoe installed on the host system.

python
import subprocess
import json
import sys
from pathlib import Path
import pyarrow.parquet as pq
from shapely.geometry import mapping
from shapely import wkb

def stream_geoparquet_to_mvt(
    parquet_path: str,
    output_path: str,
    max_zoom: int = 14,
    min_zoom: int = 5,
    simplification: float = 1.0,
    chunk_size: int = 100_000
) -> None:
    """Stream GeoParquet chunks directly to tippecanoe for MVT generation."""
    if not Path(parquet_path).exists():
        raise FileNotFoundError(f"Parquet file not found: {parquet_path}")

    # Open file for chunked iteration (avoids full load)
    pf = pq.ParquetFile(parquet_path)
    
    # Build tippecanoe command with production-safe defaults
    cmd = [
        "tippecanoe",
        "--output", output_path,
        "--maximum-zoom", str(max_zoom),
        "--minimum-zoom", str(min_zoom),
        "--simplification", str(simplification),
        "--drop-densest-as-needed",
        "--coalesce-densest-as-needed",
        "--force",
        "--quiet"
    ]

    # Start subprocess with stdin pipe for NDJSON streaming
    proc = subprocess.Popen(
        cmd,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        encoding="utf-8"
    )

    try:
        for batch in pf.iter_batches(chunk_size=chunk_size):
            # Extract columns
            geom_col = batch.column("geometry")
            attr_cols = {name: batch.column(name) for name in batch.column_names if name != "geometry"}
            
            # Pre-convert attribute columns to Python lists for faster indexing
            attr_lists = {k: v.to_pylist() for k, v in attr_cols.items()}
            geom_list = geom_col.to_pylist()

            for i, wkb_bytes in enumerate(geom_list):
                if wkb_bytes is None:
                    continue
                
                # Decode WKB -> Shapely -> GeoJSON mapping
                geom = wkb.loads(wkb_bytes)
                feature = {
                    "type": "Feature",
                    "geometry": mapping(geom),
                    "properties": {k: v[i] for k, v in attr_lists.items()}
                }
                
                proc.stdin.write(json.dumps(feature, separators=(",", ":")) + "\n")

        # Signal end of stream
        proc.stdin.close()
        proc.wait()

        if proc.returncode != 0:
            err = proc.stderr.read()
            raise RuntimeError(f"tippecanoe exited with code {proc.returncode}: {err}")
            
        print(f"✅ Successfully generated tiles at {output_path}")

    except Exception as e:
        proc.kill()
        raise RuntimeError(f"Streaming pipeline failed: {e}") from e

if __name__ == "__main__":
    # Example usage
    stream_geoparquet_to_mvt(
        parquet_path="data/large_dataset.parquet",
        output_path="output/tiles.mbtiles",
        max_zoom=14,
        min_zoom=4,
        simplification=1.0,
        chunk_size=250_000
    )

Memory & Performance Optimization

Streaming alone isn’t enough. You must tune both the read and tile-generation phases to prevent bottlenecks:

  • Attribute Pruning: Only read columns that will appear in tile properties. Every unused column increases RAM pressure during batch deserialization.
  • Geometry Simplification: Apply shapely.simplify() or ST_Simplify (via DuckDB) before serialization if your source data contains survey-grade precision. Tippecanoe’s --simplification flag operates at render time; pre-simplifying reduces NDJSON payload size by 40–60%.
  • Zoom-Level Coalescing: Use --drop-densest-as-needed and --coalesce-densest-as-needed to prevent tile bloat at lower zoom levels. For dense urban datasets, cap --maximum-zoom at 15 unless street-level detail is explicitly required.
  • Row-Group Alignment: When writing the source GeoParquet file, align row groups to spatial tiles (e.g., 100k rows per group, sorted by ST_Centroid). This allows pyarrow to skip irrelevant chunks entirely during predicate filtering.

For advanced flag combinations and layer configuration, reference the official Tippecanoe Documentation. When integrating this workflow into broader infrastructure, review Automated Generation Pipelines with Tippecanoe for CI/CD patterns and cache-invalidation strategies.

Validation & Deployment

After generation, verify tile integrity before deploying to a CDN or tile server:

  1. Schema Check: Use mbutil or sqlite3 to inspect the tiles table and confirm zoom_level, tile_column, and tile_row ranges match your expected bounds.
  2. Render Test: Serve the .mbtiles via tileserver-gl or mbview and pan across extreme zoom levels to check for geometry clipping or missing attributes.
  3. Size Audit: Ensure .mbtiles size scales logarithmically with zoom. If a single zoom level exceeds 2GB, increase --simplification or raise --minimum-zoom.

By enforcing chunked I/O, strict attribute filtering, and direct NDJSON piping, you can reliably convert multi-gigabyte GeoParquet datasets into production-grade vector tiles without exceeding standard container memory limits.