CS239: Parallel/Distributed Programming – Languages and Systems

Recent resurgence in parallel and distributed programming systems is due to need to perform large-scale computations on large-scale data. Study designed to bring students up-to-date on state of art. Students expected to read and present research papers in area of parallel programming, distributed data processing, and programming languages. Presentations intermixed with lectures that supply sufficient background necessary for upcoming research papers. Primarily research study with explicit goal of identifying new research projects. Requires formal foundations, but focus is on system optimizations and performance.

Recommended: Some understanding of and experience with Hadoop/Spark.

Date	Papers to Read	Material
	Introduction to large-scale data analytics
Mar 28	Map Reduce: Simplified Data Processing on Large Clusters	Slides
Mar 30	A comparison of approaches to large-scale data analysis	Quiz, Slides
	Language Integration
Apr 4	DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language	Quiz, Slides
Apr 6	FlumeJava: Easy, Efﬁcient Data-Parallel Pipelines	Quiz, Slides
	Runtimes & Performance
Apr 11	a) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing	Slides
	b) Spark SQL: Relational Data Processing in Spark
Apr 13	a) Reining in the Outliers in Map-Reduce Clusters using Mantri	Slides
	b) Making Sense of Performance in Data Analytics Frameworks
	Optimizations and Complex event processing
Apr 18	a) Opening the Black Boxes in Data Flow Optimization	Quiz, Slides
	b) Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE
Apr 20	a) Efficient pattern matching over event streams	Slides
	b) Distributed Complex Event Processing with Query Rewriting
	Array Processing and Streaming
Apr 25	a) The Architecture of SciDB + A Demonstration of SciDB: A Science-Oriented DBMS	Slides
	b) Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines
Apr 27	a) Discretized streams: Fault-tolerant streaming computation at scale	Slides
	b) The Dataﬂow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
	Approximation
May 2	Guest Lecture Andy Konwinski of Databricks
May 4	a) BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data	Slides
	b) Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters
	Graph Analytics
May 9	a) Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud	Discussion Qns, Slides
	b) Scalability! But at what COST?
May 11	a) Latency-Tolerant Software Distributed Shared Memory	Discussion Qns, Slides
	b) Arabesque: A System for Distributed Graph Mining
	Curation & Transactions
May 16	a) Wrangler: Interactive Visual Speciﬁcation of Data Transformation Scripts	Slides
	b) Data Curation at Scale: The Data Tamer System
May 18	a) No compromises: distributed transactions with consistency, availability, and performance	Discussion Qns, Slides
	b) Type-Aware Transactions for Faster Concurrent Code
	Large Scale Machine Learning + Project Presentations
May 23	a) Large Scale Distributed Deep Networks	Slides
	b) Scaling Distributed Machine Learning with the Parameter Server
	Class Project
May 25	No class (prepare for project)
May 30	No class (Memorial Day)
Jun 1	Project Presentations
	End of Class
Jun 6-10	No Exams