Titian: Data Provenance Support in Spark
Proceedings of the VLDB Endowment (PVLDB), 9(3): 216-227, November 2015.
Selected for The VLDB Journal's special issue on best
papers of VLDB 2016
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie
Debugging data processing logic in Data-Intensive
Scalable Computing (DISC) systems is a difficult and time
consuming effort. Today's DISC systems offer very little tooling
for debugging programs, and as a result programmers spend
countless hours collecting evidence (e.g., from log files) and
performing trial and error debugging. To aid this effort, we
built Titian, a library that enables data provenance -- tracking
data through transformations -- in Apache Spark. Data scientists
using the Titian Spark extension will be able to quickly
identify the input data at the root cause of a potential bug
or outlier result. Titian is built directly into the Spark
platform and offers data provenance support at interactive
speeds -- orders-of-magnitude faster than alternative
solutions -- while minimally impacting Spark job performance;
observed overheads for capturing data lineage rarely exceed 30%
above the baseline job execution time.
Superceded by this paper.