CS 239 Tools and Environments for Developing Big Data Analytics-- Current Topics in Programming Languages and Systems, Winter 2018

CS 239: Tools and Environments for Developing Big Data Analytics, Winter 2018

Instructor: Dr. Miryung Kim (BH 4532H)

Lectures: Mondays and Wednesdays 12PM to 1:50 PM, ENG 6, Room 472
Office Hours: By appointment only

General Description

An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics. To process massive quantities of data, developers lever- age data-intensive scalable computing (DISC) systems in the cloud, such as Google’s MapReduce, Apache Hadoop, and Apache Spark. While DISC systems help to address the scalability challenges of big data analytics, they also introduce new challenges in developing, debugging, and testing applications. Because of ultra-large scale data, it is clearly infeasible for developers to read through the production data apriori and design test inputs for their application. This problem is exacerbated by the fact that data is originating from diverse sources and is often unstructured, ill-formatted, and schema less. When errors (e.g., program crash, outlier results, etc.) arise, developers cannot easily find a subset of the input data that is able to reproduce the problem. Furthermore, while running big data analytics, it is challenging for users to understand performance implications.

In this class, we discuss recent advances in tools and environments for developing big data analytics with focus on software tooling, environment, and system stacks. The topics include:

data science and engineering practices: studies of professional data scientists, quality issues in big data systems, example big data system stacks at large software companies
data intensive scalable computing with focus on Apache Spark
methods for interactive and automated debugging of big data analytics
methods for testing big data analytics
provenance techniques for explaining the outputs of big data analytics
techniques for performance understanding and debugging
environments for resource management and scheduling for cluster computing and cluster virtualization
stream processing
heterogenous programming models and languages for data center applications
interactive data wrangling techniques

Audience and Prerequisites

This is a seminar class geared towards PhD students. If you are a master student, this class is not suitable for you. The only exception will be those who will be doing a master's thesis or capstone project in the research area of software engineering. This class is intended for PhD students to research recent advances in tools and environments for developing big data analytics. For example, if you are not comfortable with reading academic research papers (2-4 papers per week and each of them is 12+ pages), this class will be challenging for you to keep up. A significant portion of your grade will be based on your ability to articulate your own in-depth analysis of research papers.

If you are unsure of your qualifications, please contact the instructor, who will determine whether this course is right for you. Regardless of your current registration status, I suggest you should try to get an "okay" from me---this is to ensure that you can keep up with this class and to provide you early information on whether I believe you have adequate knowledge. Please email me your CV, unofficial transcripts, your status in a graduate program, and the description of your current research project with your advisor. I will get back to you soon. If you can make your email title to be [CS239 Winter 2018: Qualification Inquiry], I can easily notice the email and will get back to you soon.

If you would like to learn about basic statistics, machine learning, and data mining, this may not be the class you want, because this class is not about teaching data science and it is about tools, system stacks, programming models, and environments relevant to big data systems and cloud computing.

Important Notes Before Registering for This Course

Computing Expenses: This course may require you having a computer that is capable of running Big Data Analytics (for example with a large hard disk, a large memory and duo or quad cores). Each docker image may easily exceed 3GB. Each download of data file could be in the order of 10GB. The instructor has applied for Google cloud education credit, which is $50 only. With your personal credit card information being entered, Google gives the first $300 credit free. However, this may not be enough to cover your education expenses. Please consider these facts that you may need to spend your own $$ for public cloud or purchase a powerful personal computer suitable for big data jobs.
Familiarity with Programming Language Scala: The homework assignments will involve the use of Scala programming language. Your familiarity with Scala or willingness to learn Scala before the class begins is important. The course will not teach Scala separately and will assume that you are comfortable with programming in Scala functional programming language.
Use of Research Prototypes: The course work will require the use of two cutting edge research tools on debuggable and explainable big data analytics, BigDebug and BigSift, developed in my research laboratory. Just like any other research prototypes in the early stage, you may encounter bugs or usability issues. If you are not interested in exploring cutting edge tools or such burden is unacceptable to you, we inform you now that this course experience may not suitable for your desired learning experience.
User Study and Data Collection: During this course, your homework submission will require using an instrumented notebook and Spark environment that collects your programming session logs. Our goal is to use such instrumented logs for further research purposes for runtime optimization. Your participation will be valuable for advancing scientific effort in the area of big data analytics. We will be asking your explicit permission and consent for data collection.

Class Schedule and Reading List

This google spreadsheet is read only and will be updated as we progress in the class.

Grading

Project and Assignments: 50%

Lab Study Participation
Project Milestone (you should expect about 4 milestones.)
TBD for individual breakdown.

In Depth Analysis of Papers and In-Class Discussion: 50%

25%: Paper Presentation and Tool Demo (you should expect about 3 presentations.)
25%: Pop-Quiz, In-Class Q&A, Index Card Submission, Attendance (These are the ways that we will check whether you read the papers or not and whether you are actively contributing to the class discussion.)
3%: Course Survey Participation.

Tool Demonstrations and Co-Teaching

Part A. Tool Demonstration: This class will emphasize the use of tools for big software data analytics. Each team will create a set of toy examples, on-line tutorials, and live in-class demonstration to teach tools and environments used for developing big data analytics. A tool demonstration should consists of 15 minute overview based on the corresponding paper followed by 25 minute demonstration. The presentation will be done individually.

Part B. Paper Presentation: Each person will discuss the assigned paper to discuss recent advances related to the lecture's topic. Each presentation should be about 30 minutes long. The presentation will be done individually. After the presentation, you should lead an in-class discussion with your fellow classmates.

Project Assignment

The course project will involve hands-on-experience of developing big data software analytics in Apache Spark. The assignments will be done in a sequence of 4 assignments. More details to follow through CCLE.

Reading Questions

Please download the above paper from CCLE. Please consider the following points as you read the papers.

Cool or significant ideas. What is new here? What are the main contributions of the paper? What did you find most interesting? Is this whole paper just a one-off clever trick or are there fundamental ideas here which could be reused in other contexts?
Fallacies and blind spots. Did the authors make any assumptions or disregard any issues that make their approach less appealing? Are there any theoretical problems, practical difficulties, implementation complexities, overlooked influences of evolving technology, and so on? Do you expect the technique to be more or less useful in the future? What kind of code or situation would defeat this approach, and are those programs or scenarios important in practice? Note: we are not interested in flaws in presentation, such as trivial examples, confusing notation, or spelling errors. However, if you have a great idea on how some concept could be presented or formalized better, mention it.
New ideas and connections to other work. How could the paper be extended? How could some of the flaws of the paper be corrected or avoided? Also, how does this paper relate to others we have read, or even any other research you are familiar with? Are there similarities between this approach and other work, or differences that highlight important facets of both?

Class Discussion: Think-Pair-Share

How Does It Work?
1) Think. The teacher provokes students' thinking with a question or prompt or observation. The students should take a few moments (probably not minutes) just to THINK about the question.

2) Pair. Using designated partners (such as with Clock Buddies), nearby neighbors, or a deskmate, students PAIR up to talk about the answer each came up with. They compare their mental or written notes and identify the answers they think are best, most convincing, or most unique.

3) Share. After students talk in pairs for a few moments (again, usually not minutes), the teacher calls for pairs to SHARE their thinking with the rest of the class. She can do this by going around in round-robin fashion, calling on each pair; or she can take answers as they are called out (or as hands are raised). Often, the teacher or a designated helper will record these responses on the board or on the overheads

Presentation grading scheme

5 pt: Excellent design, complete implementation, selection of a task that is highly intellectually challenging, creative, concise yet comprehensive writing, nearly perfect answers, beautifully written and verbally communicated, eloquent presentation within a time limit
4 pt: Very good, mostly correct answers, i.e., >85%, selection of a intellectually challenging project topic, well written, good verbal presentation, well practiced presentation within a time limit
3 pt: Good understanding of the key concepts, mostly correct answers, i.e., >70%, selection of a intellectually challenging project topic, well written, good verbal presentation, well practiced presentation within a time limit
2 pt or 1pt: Poor, shallow, minimally sufficient, or needlessly wordy, key concepts misunderstood or missing, selection of easy project tasks, poor written and verbal communication, presentation over time, not following specified formats