CS 239: Tools and Environments for Developing Big Data Analytics,
Winter 2018
Lectures: Mondays and Wednesdays 12PM to 1:50 PM, ENG 6, Room 472
Office Hours: By appointment only
General Description
This
is a seminar class geared towards PhD students. If
you are a master student, this class is not suitable for you. The
only exception will be those who will be doing a master's thesis or
capstone project in the research area of software engineering. This
class is intended for PhD students to research recent advances in
tools and environments for developing big data analytics. For example,
if you are not comfortable with reading academic research papers (2-4
papers per week and each of them is 12+ pages), this class will be
challenging for you to keep up. A significant portion of your grade
will be based on your ability to articulate your own in-depth analysis
of research papers.
If you are unsure of your qualifications, please contact the
instructor, who will determine whether this course is right for you.
Regardless of your current registration status, I suggest you should
try to get an "okay" from me---this is to ensure that you can keep up
with this class and to provide you early information on whether I
believe you have adequate knowledge. Please email me your CV,
unofficial transcripts, your status in a graduate program, and the
description of your current research project with your advisor. I will
get back to you soon. If you can make your email title to be [CS239
Winter 2018: Qualification Inquiry], I can easily notice the email and
will get back to you soon.
If you would like to learn about basic statistics, machine learning,
and data mining, this may not be the class you want, because this
class is not about teaching data science and it is about tools, system
stacks, programming models, and environments relevant to big data
systems and cloud computing.
Important Notes Before Registering for This Course
- Computing Expenses: This course may require
you having a computer that is capable of running Big Data Analytics
(for example with a large hard disk, a large memory and duo or quad
cores). Each docker image may easily exceed 3GB. Each download of
data file could be in the order of 10GB. The instructor has applied
for Google cloud education credit, which is $50 only. With your
personal credit card information being entered, Google gives the
first $300 credit free. However, this may not be enough to cover
your education expenses. Please consider these facts that you may
need to spend your own $$ for public cloud or purchase a powerful
personal computer suitable for big data jobs.
- Familiarity with Programming Language Scala: The
homework assignments will involve the use of Scala programming
language. Your familiarity with Scala or willingness to learn Scala
before the class begins is important. The course will not teach
Scala separately and will assume that you are comfortable with
programming in Scala functional programming language.
- Use of Research Prototypes: The course work
will require the use of two cutting edge research tools on
debuggable and explainable big data analytics, BigDebug and BigSift,
developed in my research laboratory. Just like any other research
prototypes in the early stage, you may encounter bugs or usability
issues. If you are not interested in exploring cutting edge tools or
such burden is unacceptable to you, we inform you now that this
course experience may not suitable for your desired learning
experience.
- User Study and Data Collection: During this
course, your homework submission will require using an instrumented
notebook and Spark environment that collects your programming
session logs. Our goal is to use such instrumented logs for further
research purposes for runtime optimization. Your participation will
be valuable for advancing scientific effort in the area of big data
analytics. We will be asking your explicit permission and consent
for data collection.
Class Schedule and Reading List
This
google spreadsheet is read only and will be updated as we progress
in the class.
Grading
Project and Assignments: 50%
- Lab Study Participation
- Project Milestone (you should expect about 4 milestones.)
- TBD for individual breakdown.
In Depth Analysis of Papers and In-Class Discussion: 50%
- 25%: Paper Presentation and Tool Demo (you should expect about 3
presentations.)
- 25%: Pop-Quiz, In-Class Q&A, Index Card Submission, Attendance
(These are the ways that we will check whether you read the papers
or not and whether you are actively contributing to the class
discussion.)
- 3%: Course Survey Participation.
Tool Demonstrations and Co-Teaching
Part A. Tool Demonstration: This class will emphasize
the use of tools for big software data analytics. Each team will create
a set of toy examples, on-line tutorials, and live in-class
demonstration to teach tools and environments used for developing big
data analytics. A tool demonstration should consists of 15 minute
overview based on the corresponding paper followed by 25 minute
demonstration.
The presentation will be done
individually.
Part B. Paper Presentation: Each person will discuss
the assigned paper to discuss recent advances related to the lecture's
topic. Each presentation should be about 30 minutes long. The
presentation will be done individually. After the presentation, you
should lead an in-class discussion with your fellow classmates.
Project Assignment
The course project will involve hands-on-experience of developing big
data software analytics in Apache Spark. The assignments will be done in
a sequence of 4 assignments. More details to follow through CCLE.
Reading Questions
Class Discussion: Think-Pair-Share
How Does It Work?
1) Think. The teacher provokes students' thinking with a
question or prompt or observation. The students should take a few
moments (probably not minutes) just to THINK about the question.
2) Pair. Using designated partners (such as with Clock Buddies),
nearby neighbors, or a deskmate, students PAIR up to talk about the
answer each came up with. They compare their mental or written notes
and identify the answers they think are best, most convincing, or most
unique.
3) Share. After students talk in pairs for a few moments (again,
usually not minutes), the teacher calls for pairs to SHARE their
thinking with the rest of the class. She can do this by going around
in round-robin fashion, calling on each pair; or she can take answers
as they are called out (or as hands are raised). Often, the teacher or
a designated helper will record these responses on the board or on the
overheads
Presentation grading scheme
- 5 pt: Excellent design, complete
implementation, selection of a task that is highly intellectually
challenging, creative, concise yet comprehensive writing, nearly
perfect answers, beautifully written and verbally communicated,
eloquent presentation within a time limit
- 4 pt: Very good, mostly correct answers, i.e.,
>85%, selection of a intellectually challenging project topic,
well written, good verbal presentation, well practiced presentation
within a time limit
- 3 pt: Good understanding of the key concepts,
mostly correct answers, i.e., >70%, selection of a intellectually
challenging project topic, well written, good verbal presentation,
well practiced presentation within a time limit
- 2 pt or 1pt: Poor, shallow, minimally
sufficient, or needlessly wordy, key concepts misunderstood or
missing, selection of easy project tasks, poor written and verbal
communication, presentation over time, not following specified
formats
Class Policy