CS 239: Big Data Systems

Fall 2019, Mon Wed 4:00-5:50pm, Haines Hall Room A25

Instructor: Harry Xu, Office: Rm 496A Engineering VI, Office hour: by appointment, Credits: 4

Schedule 

Paper list

Project (TBD)

Course Overview 

Modern computing has entered the era of big data. This class will introduce the concepts and state-of-the-art in modern big data systems. Specifically, we will cover these topics: 

_      Key challenges in big data processing 

_      Storage systems: HDFS, GFS, Big Table, Spanner, Windows Azure Storage

_      General-purpose dataflow systems such as MapReduce, Dryad, Scope/Cosmos, and Spark 

_      Scheduling and resource management in cluster computing, such as YANN and Mesos

_      Batch processing systems such as Hive and Spark SQL

_      Streaming data processing systems such as Storm, Flink, DStream, and Naiad  

_      Distributed and single-machine systems for large scale graph processing, such as Pregel, Ligra, GraphX, and PowerGraph

_      Systems for machine learning, such as Parameter Server and TensorFlow.

Prerequisites 

There are no formal prerequisites, but it will help to have some background in the OS, distributed systems, or database systems. 

 

Course Organization 

The course will have the following three parts: 

 

(1) Lectures by the Instructor 

I'll spend one meeting covering background before starting a topic.  

 

(2) Student presentations

 

Each student will present one or two papers from the list. For each presentation, two students (other than the presenter) are responsible for giving feedback as to how to improve the presentation.

 

(3) Projects

Students will form groups to undertake a project that explores a new research idea in the related areas. Detailed project information can be found here. 

 

Grading 

The activities of the class include the following four parts:

 

(1) Paper critiques (15%): There are two presentations scheduled for each class. Students are required to carefully read a number of papers on the same topic before the class and write critiques for the two papers that will be presented. I need your critiques the noon before the class. For example, for the Monday class, your critiques are due 12pm on Monday.

 

(2) Paper presentation (30%): Two papers presented in each class cover similar topics, so we will have an opportunity to read and compare a range of papers solving similar problems. The presenter is also the discussion leader, who is expected to prepare for a set of interesting questions that can provoke further thoughts and discussions. Although the number of papers each student needs to present will be determined by the number of students in the class, I expect each student to present at least two papers in the quarter.

 

(3) In-class discussion (15%): We will have in-depth discussions not only on the papers presented, but also on related papers and creative ideas that may open up opportunities for future work. 

 

(4) Projects (40%): Research projects are conducted on a per-group-basis. Each group has two students who work together to develop a novel research idea. I will set up several meetings with each group to have in-depth discussions on the proposed project. Each group will report their projects twice in the class: a kickoff presentation to propose the project (typically in the third and/or fourth week) and a progress presentation at the end of the quarter. Each group is required to turn in a project writeup (like a research paper) that describes the idea, the implementation, and the experimental results. I hope some of the high-quality project reports can be turned into top-conference submissions.

 

Paper Reading/Reviewing Tips

How to read a paper 

Notes on Constructive and Positive Reviewing

The task of the referee

How to write a review for a system paper