CS 239: Data Science in Software Engineering
Current Topics in Programming Languages and Systems, Winter 2017
Lectures: Mondays and Wednesdays 12PM to 1:50 PM, BOELTER 5272
Office Hours: By appointment only
Final Exam: March 15th 12-2PM open book exam.
Final Presentation: Please be available on March 21st and/or 22nd
for individual presentations and demo with the instructor.
General Description
This
class
is intended for graduate students to introduce current research topics
in software engineering. Undergraduate level knowledge of data
structures and object-oriented program languages is required.
Knowledge of compilers, program analysis, and internal program
representations is encouraged. If you would
like to learn about basic statistics, machine learning, and data
mining, this may not be the class you want, because this
class is not about teaching data science but rather
improving programmer productivity and correctness through applications
of existing data science methods. If you are unsure of your
qualifications, please contact the instructor, who will be happy to
help you decide if this course is right for you. You are welcome to
just sit in for a few days and see how this class feels.
Class Schedule and Reading List
|
Lectures
|
Readings and Presentations |
Week 2
1/16 (MLK)
1/18
|
Software Analytics---What is it?
Data Scientists---Who are they?
|
Interactions
with Big Data Analytics, Fisher et al. ACM Interactions
2012
The Emerging
Role of Data Scientists on Software Development Teams, Kim
et al. ICSE' 16
|
Week 3
1/23
1/25
|
Change Recommendation
Automated Software Repair
|
Mining
Version Histories to Guide Software Changes, Zimmermann et
al. ICSE 2004
Automatically
Finding Patches using Genetic Programming, Weimer et al.
ICSE 2009
|
Week 4
1/30
2/1
|
Defect Prediction
|
Use
of Relative Code Churn Measures to Predict System Defect
Density, Nagappan and Ball, ICSE 2005
Cross
Project Defect Prediction, Zimmermann et al, ESEC FSE 2009
*Please Read Chapter 12 and 13 of "Probability and Statistics
for Engineering and the Sciences."
Project Proposal Description Due on 1/31 11:59PM
|
Week 5
2/6
2/8
|
Software Anomaly Detection and Debugging
|
Bug
Isolation via remote program sampling, Liblit et al. PLDI
2003
Detecting
Object Usage Anomalies, Wasylkowski et al. FSE 2007
|
Week 6
2/13
2/15
|
Software Anomaly Detection and Debugging
|
FaultTracer:
a spectrum-based approach to localizing failure-inducing
program edits, Zhang et al. JSEP 2013.
Debugging
in the (Very) Large: Ten Years of Implementation and
Experience, Glerum et al. SOSP 2009
|
Week 7
2/20 (President Day)
2/22
|
Hadoop and MapReduce |
MapReduce:
Simplified Data Processing on Large Clusters, Dean and
Ghematwat, OSDI 2004
Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing, Zaharia et al. NSDI 2012 |
Week 8
2/27
3/1
|
Big Data Analytics Assistance--Debugging |
Assisting
Developers of Big Data Analytics Applications When Deploying
on Hadoop Clouds, Shang et al. ICSE 2013
BigDebug:
Debugging Primitives for Interactive Big Data Processing in
Spark, Gulzar et al. ICSE 2016
|
Week 9
3/6
3/8
|
Big Data Analytics Assistance--Debugging |
Inspector
Gadget: A Framework for Custom Monitoring and Debugging of
Distributed Dataflows, Olston and Reed, VLDB 2011
Scalable
Lineage Capture for Debugging DISC Analytics, Logothetis
et al. SoCC 2013
Titian:
Data Provenance Support in Spark, Interlandi et al.
Optimizing
Interactive Development of Data-Intensive Applications,
Tetali et al.
|
Week 10
3/13
3/15
|
Buffers
|
Final Open Book Exam 3/15 12-2PM in Class
|
Final Week
|
Project final presentation on 3/21 and/or 3/22 for individual
scheduling.
|
Project final report is due on 3/20 11:59PM
|
Prerequisite Reading Materials
Doing Data Science, Straight Talk from the Frontline,
Cathy O'Neil & Rachel Schutt (Find an E-book on "Doing Data
Science" http://proquest.safaribooksonline.com/9781449363871). Linear
Regression: Pages 55-71 (Chapter 3: Algorithms)
Logistics Regression and Logit Function: Chapter 3: Logistics
Regression Pages 113- 134, Decision Tree and Entropy: Chapter 7, Pages
185- 187
"Probability and Statistics for Engineering and the Sciences."--Chapter
12 "Simple and Linear Regression and Correlation" and Chapter 13
"Nonlinear and Multiple Regression" share background on linear
regression, step wise regression, R^2, Adjusted R^2, SSE and SST.
There are multiple copies of the book at UCLA library.
http://catalog.library.ucla.edu/vwebv/holdingsInfo?searchId=6117&recCount=50&recPointer=11&bibId=59896
R Cookbook, The E-version of R Cookbook is available
at UCLA library. Read Chapter 9
Grading
Class Co-Teaching Presentation and Demonstrations: 40% (Tool 20% and
Paper 20%)
Project Final Report, Presentation, and Demo: 30%
Reading Questions and Discussion Participation (including Course
Evaluation Participation): 10%
Open Book Final Exam: 20%
Tool Demonstrations and Co-Teaching
Part A. Tool Demonstration: This class will emphasize
the use of tools for big software data analytics. Each team will create
a set of toy examples, on-line tutorials, and live in-class
demonstration to teach tools used for big software data analytics.
Example tools could include ASM Bytecode Instrumentation Toolkit,
AspectJ, R, R Studio, Hadoop, Spark, etc. The should submit the teaching
materials 5 days before the date you will present. When providing your
teaching material, please make your tutorial online and email its URL to
the instructor. You are welcome to consult me on the choice of your tool
or technology.
Each presentation should be about 15 minutes
long. The presentation can be done in a team of 2 people (or
occasionally 3).
Part B. Paper Presentation: Each person will select one
paper to discuss recent advances related to the lecture's topic. For
example, in Week 5, the class discussion topic is automated debugging
and the student will select a state-of-the-art recent papers (ideally in
the last two to three years) on the topic of debugging practices and
automated debugging support. The student will perform critical reading
of the paper and assess the technology and concept. The student will
then present their findings and analysis to complement my lecture and
class discussion, which are based on seminal papers in each topic. You
are welcome to consult me on the choice of your paper.
Each
presentation should be about 15 minutes long. The presentation will be
done individually.
Please sign up your name for tool demo and paper presentations. The
sign up schedule is available at http://tinyurl.com/cs239winter2017
Project Assignment
The course project will involve hands-on-experience of developing big
data software analytics. A project could be a replication of an existing
idea, a proposal of a new research project, a translation of a
methodology to a new problem domain, or assessment of an existing
technique via case studies or controlled experiment. For your final
presentation, you are required to perform live demonstration of your
project in class. The idea does not have to be novel. The goal of this
project is to gain hands on experience by re-implementing, re-producing,
and combining the ideas of research papers that we read in class. The
project teams should be formed in a group of 3-5 people.
Project Proposal Description Due on 1/31 11:59PM
Project Final Report Due on 3/20 11:59PM
Your report should be structured like a conference paper, meaning that
your report should contain:
Abstract
A well-motivated introduction
Related work with proper citations
Description of your methodology
Evaluation results
Discussion of your approach, threats to validity, and additional
experiments
Conclusions and future work
Appendix: Describe how to run and test your implementation.
If you are doing a project that involves implementation, please submit
your source code by sharing an on-line repository. Please describe how
to run and test your code in your report.
Here's grading guidelines for your project report.
Motivation & Problem Definition
Does this report sufficiently describe motivation of this
project?
Does this report describe when and how this research can be used by whom
in terms of examples and scenarios?
Does the report clearly define a research problem?
Related Work
Does the report adequately describe related work?
Does the report cite and use appropriate references?
Approach
Does the report clearly & adequately present your research
approach (algorithm description, pseudo code, etc.)?
Does the report include justifications for your approach?
Evaluation
Does this report clarify your evaluation’s objectives (research
questions raised by you)?
Does this report justify why it is worthwhile to answer such research
questions?
Does this report concretely describe what can be measured and compared
to existing approaches (if exist) to answer such research questions?
Is the evaluation study design (experiments, case studies, and user
studies) sound?
Results
Does the report include empirical results that support the
author’s claims/ research goals?
Does the report provide any interpretation on results?
Is the information in the report sound, factual, and accurate?
Discussions & Future Work
Does the report suggest future research directions or make
suggestions to improve or augment the current research?
Did the report demonstrate consideration of alternative approaches? Does
the report discuss threats to validity of this evaluation?
Clarity and Writing
Is the treatment of the subject a reasonable scope as a class
project?
How well are the ideas presented? (very difficult to understand =1, very
easy to understand =5)
Overall quality of writing and readability (very poor =1, excellent =5)
Reading Questions
Class Discussion: Think-Pair-Share
How Does It Work?
1) Think. The teacher provokes students' thinking with a
question or prompt or observation. The students should take a few
moments (probably not minutes) just to THINK about the question.
2) Pair. Using designated partners (such as with Clock Buddies),
nearby neighbors, or a deskmate, students PAIR up to talk about the
answer each came up with. They compare their mental or written notes
and identify the answers they think are best, most convincing, or most
unique.
3) Share. After students talk in pairs for a few moments (again,
usually not minutes), the teacher calls for pairs to SHARE their
thinking with the rest of the class. She can do this by going around
in round-robin fashion, calling on each pair; or she can take answers
as they are called out (or as hands are raised). Often, the teacher or
a designated helper will record these responses on the board or on the
overhead
Homework, project report, and presentation grading scheme
- 5 pt: Excellent design, complete
implementation, selection of a task that is highly intellectually
challenging, creative, concise yet comprehensive writing, nearly
perfect answers, beautifully written and verbally communicated,
eloquent presentation within a time limit
- 4 pt: Very good, mostly correct answers, i.e.,
>85%, selection of a intellectually challenging project topic,
well written, good verbal presentation, well practiced presentation
within a time limit
- 3 pt: Good understanding of the key concepts,
mostly correct answers, i.e., >70%, selection of a intellectually
challenging project topic, well written, good verbal presentation,
well practiced presentation within a time limit
- 2 pt or 1pt: Poor, shallow, minimally
sufficient, or needlessly wordy, key concepts misunderstood or
missing, selection of easy project tasks, poor written and verbal
communication, presentation over time, not following specified
formats
Class Policy