Taming S3 Shuffle at Scale
How we fixed the GET request explosion, prefix throttling, and threading edge cases that emerge when S3 shuffle meets production scale.
$ echo "building the future of ad tech"
Deep dives into architecture, data engineering, and the tools we build.
How we replaced bespoke PySpark scripts with a config-driven, hook-based framework inspired by Rust's composition model.
How we fixed the GET request explosion, prefix throttling, and threading edge cases that emerge when S3 shuffle meets production scale.
How an open-source Spark UI replacement helped us find data skew, partition bloat, and shuffle spill.
How S3 shuffle lets us run Spark executors 100% on spot instances with Karpenter, cutting compute costs 70-85%.
How we run production Spark jobs on Kubernetes with one SparkApp YAML, pre-baked images, and sub-10-second warm starts.
How we cut our daily upsert pipeline from an hour to under 4 minutes using storage partition joins and shuffle hash hints.
How we designed a modern lakehouse architecture using Iceberg, Lakekeeper, and Spark on AWS.