CS239: Parallel/Distributed Programming – Languages and Systems

Recent resurgence in parallel and distributed programming systems is due to need to perform large-scale computations on large-scale data. Study designed to bring students up-to-date on state of art. Students expected to read and present research papers in area of parallel programming, distributed data processing, and programming languages. Presentations intermixed with lectures that supply sufficient background necessary for upcoming research papers. Primarily research study with explicit goal of identifying new research projects. Requires formal foundations, but focus is on system optimizations and performance.

Recommended: Some understanding of and experience with Hadoop/Spark.

 

Date

Papers to Read

Material

 

Introduction to large-scale data analytics

 

Mar 28

Map Reduce: Simplified Data Processing on Large Clusters

Slides

Mar 30

A comparison of approaches to large-scale data analysis

Quiz, Slides

 

Language Integration

 

Apr 4

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

Quiz, Slides

Apr 6

FlumeJava: Easy, Efficient Data-Parallel Pipelines

Quiz, Slides

 

Runtimes & Performance

 

Apr 11

a)       Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Slides

 

b)      Spark SQL: Relational Data Processing in Spark

 

Apr 13

a)       Reining in the Outliers in Map-Reduce Clusters using Mantri

Slides

 

b)      Making Sense of Performance in Data Analytics Frameworks

 

 

Optimizations and Complex event processing

 

Apr 18

a)       Opening the Black Boxes in Data Flow Optimization

Quiz, Slides

 

b)      Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE

Apr 20

a)       Efficient pattern matching over event streams

Slides

 

b)      Distributed Complex Event Processing with Query Rewriting

 

Array Processing and Streaming

 

Apr 25

a)       The Architecture of SciDB  + A Demonstration of SciDB: A Science-Oriented DBMS

Slides

 

b)      Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines

Apr 27

a)       Discretized streams: Fault-tolerant streaming computation at scale

Slides

 

b)      The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

 

Approximation

 

May 2

Guest Lecture Andy Konwinski of Databricks

May 4

a)       BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Slides

 

b)      Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters

 

Graph Analytics

 

May 9

a)       Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

Discussion Qns,  Slides

 

b)      Scalability! But at what COST?

May 11

a)       Latency-Tolerant Software Distributed Shared Memory

Discussion Qns,

Slides

 

b)      Arabesque: A System for Distributed Graph Mining

 

Curation & Transactions

 

May 16

a)       Wrangler: Interactive Visual Specification of Data Transformation Scripts

Slides

 

b)      Data Curation at Scale: The Data Tamer System

May 18

a)       No compromises: distributed transactions with consistency, availability, and performance

Discussion Qns,

Slides

 

b)      Type-Aware Transactions for Faster Concurrent Code

 

Large Scale Machine Learning + Project Presentations

 

May 23

a)       Large Scale Distributed Deep Networks

Slides

 

b)      Scaling Distributed Machine Learning with the Parameter Server

 

Class Project

 

May 25

No class (prepare for project)

 

May 30

No class (Memorial Day)

 

Jun 1

Project Presentations

 

 

End of Class

 

Jun 6-10

No Exams