Date
|
Papers to Read
|
Material
|
|
Introduction to large-scale
data analytics
|
|
Mar 28
|
Map
Reduce: Simplified Data Processing on Large Clusters
|
Slides
|
Mar 30
|
A comparison of approaches
to large-scale data analysis
|
Quiz,
Slides
|
|
Language Integration
|
|
Apr 4
|
DryadLINQ:
A System for General-Purpose Distributed Data-Parallel Computing Using a
High-Level Language
|
Quiz, Slides
|
Apr 6
|
FlumeJava:
Easy, Efficient Data-Parallel Pipelines
|
Quiz, Slides
|
|
Runtimes & Performance
|
|
Apr 11
|
a)
Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing
|
Slides
|
|
b)
Spark
SQL: Relational Data Processing in Spark
|
|
Apr 13
|
a)
Reining
in the Outliers in Map-Reduce Clusters using Mantri
|
Slides
|
|
b)
Making
Sense of Performance in Data Analytics Frameworks
|
|
|
Optimizations and Complex
event processing
|
|
Apr 18
|
a)
Opening the
Black Boxes in Data Flow Optimization
|
Quiz, Slides
|
|
b)
Spotting
Code Optimizations in Data-Parallel Pipelines through PeriSCOPE
|
|
Apr 20
|
a)
Efficient pattern matching
over event streams
|
Slides
|
|
b)
Distributed
Complex Event Processing with Query Rewriting
|
|
|
Array Processing and Streaming
|
|
Apr 25
|
a)
The
Architecture of SciDB + A Demonstration of
SciDB: A Science-Oriented DBMS
|
Slides
|
|
b)
Decoupling
Algorithms from Schedules for Easy Optimization of Image Processing Pipelines
|
|
Apr 27
|
a)
Discretized
streams: Fault-tolerant streaming computation at scale
|
Slides
|
|
b)
The Dataflow
Model: A Practical Approach to Balancing Correctness, Latency, and Cost in
Massive-Scale, Unbounded, Out-of-Order Data Processing
|
|
|
Approximation
|
|
May 2
|
Guest Lecture Andy Konwinski of Databricks
|
|
May 4
|
a)
BlinkDB: Queries with
Bounded Errors and Bounded Response Times on Very Large Data
|
Slides
|
|
b)
Quickr:
Lazily Approximating Complex AdHoc Queries in BigData Clusters
|
|
|
Graph Analytics
|
|
May 9
|
a)
Distributed
GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
|
Discussion Qns, Slides
|
|
b)
Scalability! But at what
COST?
|
|
May 11
|
a)
Latency-Tolerant
Software Distributed Shared Memory
|
Discussion Qns,
Slides
|
|
b)
Arabesque:
A System for Distributed Graph Mining
|
|
|
Curation & Transactions
|
|
May 16
|
a)
Wrangler: Interactive Visual
Specification of Data Transformation Scripts
|
Slides
|
|
b)
Data Curation
at Scale: The Data Tamer System
|
|
May 18
|
a)
No
compromises: distributed transactions with consistency, availability, and
performance
|
Discussion Qns,
Slides
|
|
b)
Type-Aware
Transactions for Faster Concurrent Code
|
|
|
Large Scale Machine Learning +
Project Presentations
|
|
May 23
|
a)
Large
Scale Distributed Deep Networks
|
Slides
|
|
b)
Scaling
Distributed Machine Learning with the Parameter Server
|
|
|
Class Project
|
|
May 25
|
No class (prepare for project)
|
|
May 30
|
No class (Memorial Day)
|
|
Jun 1
|
Project Presentations
|
|
|
End of Class
|
|
Jun
6-10
|
No Exams
|
|