CS 239: Data Science in Software Engineering

Current Topics in Programming Languages and Systems, Winter 2017 

Instructor: Dr. Miryung Kim (BH 4532H)
    Lectures: Mondays and Wednesdays 12PM to 1:50 PM, BOELTER 5272
    Office Hours: By appointment only
    Final Exam: March 15th 12-2PM open book exam.
    Final Presentation: Please be available on March 21st and/or 22nd for individual presentations and demo with the instructor.

General Description

Software engineering is a data rich activity. Software produces large quantities of data such as user-oriented telemetry data, repository-based productivity and quality data, and business oriented process data. Data scientists work with data and they have "the sexiest job of the 21st century" according to Harvard Business Review. Data scientists  are now becoming a part of mainstream software development teams and they use statistics, data mining, big data engineering, and automated software analysis techniques together to measure software performance and quality, to analyze user engagement, to diagnose and debug software failures, to detect server log anomalies, etc.

This course covers data science methods, techniques, and tools used in (and for) software engineering. The topics include:
  • software instrumentation and profiling methods for collecting telemetry data
  • automated debugging using anomaly detection techniques
  • defect prediction and software failure rate estimation
  • automated software repair and change recommendation systems built on analytics
  • mining software repository studies that provide insights into software evolution in the wild, and
  • debugging tools for supporting big data analytics.
  • The course is interdisciplinary and we welcome and encourage students from different research areas. 

Audience and Prerequisites

This class is intended for graduate students to introduce current research topics in software engineering. Undergraduate level knowledge of data structures and object-oriented program languages is required. Knowledge of compilers, program analysis, and internal program representations is encouraged. If you would like to learn about basic statistics, machine learning, and data mining, this may not be the class you want, because this class is not about teaching data science but rather improving programmer productivity and correctness through applications of existing data science methods. If you are unsure of your qualifications, please contact the instructor, who will be happy to help you decide if this course is right for you. You are welcome to just sit in for a few days and see how this class feels.

Class Schedule and Reading List


Lectures
  Readings and Presentations
Week 2
1/16 (MLK)
1/18
Software Analytics---What is it?
Data Scientists---Who are they?
Interactions with Big Data Analytics, Fisher et al. ACM Interactions 2012
The Emerging Role of Data Scientists on Software Development Teams, Kim et al. ICSE' 16
Week 3
1/23
1/25
Change Recommendation
Automated Software Repair

Mining Version Histories to Guide Software Changes, Zimmermann et al. ICSE 2004
Automatically Finding Patches using Genetic Programming, Weimer et al. ICSE 2009
Week 4
1/30
2/1
Defect Prediction
Use of Relative Code Churn Measures to Predict System Defect Density, Nagappan and Ball, ICSE 2005 
Cross Project Defect Prediction, Zimmermann et al, ESEC FSE 2009
*Please Read Chapter 12 and 13 of "Probability and Statistics for Engineering and the Sciences."
Project Proposal Description Due on 1/31 11:59PM
Week 5
2/6
2/8
Software Anomaly Detection and Debugging
Bug Isolation via remote program sampling, Liblit et al. PLDI 2003
Detecting Object Usage Anomalies, Wasylkowski et al. FSE 2007
Week 6
2/13
2/15

Software Anomaly Detection and Debugging
FaultTracer: a spectrum-based approach to localizing failure-inducing program edits, Zhang et al. JSEP 2013.
Debugging in the (Very) Large: Ten Years of Implementation and Experience, Glerum et al. SOSP 2009
Week 7
2/20 (President Day)
2/22
Hadoop and MapReduce MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghematwat, OSDI 2004
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
, Zaharia et al. NSDI 2012
Week 8
2/27
3/1
Big Data Analytics Assistance--Debugging Assisting Developers of Big Data Analytics Applications When Deploying on Hadoop Clouds, Shang et al. ICSE 2013
BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark, Gulzar et al. ICSE 2016
Week 9
3/6
3/8
Big Data Analytics Assistance--Debugging Inspector Gadget: A Framework for Custom Monitoring and Debugging of Distributed Dataflows, Olston and Reed, VLDB 2011
Scalable Lineage Capture for Debugging DISC Analytics, Logothetis et al. SoCC 2013
Titian: Data Provenance Support in Spark, Interlandi et al.
Optimizing Interactive Development of Data-Intensive Applications, Tetali et al.
Week 10
3/13
3/15
Buffers

Final Open Book Exam 3/15 12-2PM in Class
Final Week
Project final presentation on 3/21 and/or 3/22 for individual scheduling.
Project final report is due on 3/20 11:59PM

Prerequisite Reading Materials

Doing Data Science, Straight Talk from the Frontline, Cathy O'Neil & Rachel Schutt (Find an E-book on "Doing Data Science" http://proquest.safaribooksonline.com/9781449363871). Linear Regression: Pages 55-71 (Chapter 3: Algorithms)
Logistics Regression and Logit Function: Chapter 3: Logistics Regression Pages 113- 134, Decision Tree and Entropy: Chapter 7, Pages 185- 187

"Probability and Statistics for Engineering and the Sciences."--Chapter 12 "Simple and Linear Regression and Correlation" and Chapter 13 "Nonlinear and Multiple Regression" share background on linear regression, step wise regression, R^2, Adjusted R^2, SSE and SST. There are multiple copies of the book at UCLA library.
 
http://catalog.library.ucla.edu/vwebv/holdingsInfo?searchId=6117&recCount=50&recPointer=11&bibId=59896

R Cookbook, The E-version of R Cookbook is available at UCLA library. Read Chapter 9

Grading

Class Co-Teaching Presentation and Demonstrations: 40% (Tool 20% and Paper 20%)
Project Final Report, Presentation, and Demo: 30%
Reading Questions and Discussion Participation (including Course Evaluation Participation): 10%
Open Book Final Exam: 20% 

Tool Demonstrations and Co-Teaching

Part A. Tool Demonstration: This class will emphasize the use of tools for big software data analytics. Each team will create a set of toy examples, on-line tutorials, and live in-class demonstration to teach tools used for big software data analytics. Example tools could include ASM Bytecode Instrumentation Toolkit, AspectJ, R, R Studio, Hadoop, Spark, etc. The should submit the teaching materials 5 days before the date you will present. When providing your teaching material, please make your tutorial online and email its URL to the instructor. You are welcome to consult me on the choice of your tool or technology. Each presentation should be about 15 minutes long. The presentation can be done in a team of 2 people (or occasionally 3).

Part B. Paper Presentation: Each person will select one paper to discuss recent advances related to the lecture's topic. For example, in Week 5, the class discussion topic is automated debugging and the student will select a state-of-the-art recent papers (ideally in the last two to three years) on the topic of debugging practices and automated debugging support. The student will perform critical reading of the paper and assess the technology and concept. The student will then present their findings and analysis to complement my lecture and class discussion, which are based on seminal papers in each topic. You are welcome to consult me on the choice of your paper. Each presentation should be about 15 minutes long. The presentation will be done individually.

Please sign up your name for tool demo and paper presentations. The sign up schedule is available at http://tinyurl.com/cs239winter2017

Project Assignment

The course project will involve hands-on-experience of developing big data software analytics. A project could be a replication of an existing idea, a proposal of a new research project, a translation of a methodology to a new problem domain, or assessment of an existing technique via case studies or controlled experiment. For your final presentation, you are required to perform live demonstration of your project in class. The idea does not have to be novel. The goal of this project is to gain hands on experience by re-implementing, re-producing, and combining the ideas of research papers that we read in class. The project teams should be formed in a group of 3-5 people.

Project Proposal Description Due on 1/31 11:59PM
Project Final Report Due on 3/20 11:59PM

Your report should be structured like a conference paper, meaning that your report should contain:
Abstract
A well-motivated introduction
Related work with proper citations
Description of your methodology
Evaluation results
Discussion of your approach, threats to validity, and additional experiments
Conclusions and future work
Appendix: Describe how to run and test your implementation.
If you are doing a project that involves implementation, please submit your source code by sharing an on-line repository. Please describe how to run and test your code in your report.

Here's grading guidelines for your project report.
Motivation & Problem Definition
Does this report sufficiently describe motivation of this project?
Does this report describe when and how this research can be used by whom in terms of examples and scenarios?
Does the report clearly define a research problem?
Related Work
Does the report adequately describe related work?
Does the report cite and use appropriate references?
Approach
Does the report clearly & adequately present your research approach (algorithm description, pseudo code, etc.)?
Does the report include justifications for your approach?
Evaluation
Does this report clarify your evaluation’s objectives (research questions raised by you)?
Does this report justify why it is worthwhile to answer such research questions?
Does this report concretely describe what can be measured and compared to existing approaches (if exist) to answer such research questions?
Is the evaluation study design (experiments, case studies, and user studies) sound?
Results
Does the report include empirical results that support the author’s claims/ research goals?
Does the report provide any interpretation on results?
Is the information in the report sound, factual, and accurate?
Discussions & Future Work
Does the report suggest future research directions or make suggestions to improve or augment the current research?
Did the report demonstrate consideration of alternative approaches? Does the report discuss threats to validity of this evaluation?
Clarity and Writing
Is the treatment of the subject a reasonable scope as a class project?
How well are the ideas presented? (very difficult to understand =1, very easy to understand =5)
Overall quality of writing and readability (very poor =1, excellent =5)

Reading Questions

Please download the above paper from the ACM Digital Library. Access to ACM digital library is free if you are using a computer on campus with a valid UCLA IP address. Please post one paragraph review in the form of comment of 1 or 2 questions through Piazza before each paper discussion. Your questions should reveal your critical analysis of the assigned reading. Reading questions and class discussion will be responsible for 10% of the grade. It is okay to miss reading questions for several papers and your grade will depend on the overall quantity and quality of your questions throughout the quarter.
  •     Cool or significant ideas. What is new here? What are the main contributions of the paper? What did you find most interesting? Is this whole paper just a one-off clever trick or are there fundamental ideas here which could be reused in other contexts?
  •     Fallacies and blind spots. Did the authors make any assumptions or disregard any issues that make their approach less appealing? Are there any theoretical problems, practical difficulties, implementation complexities, overlooked influences of evolving technology, and so on? Do you expect the technique to be more or less useful in the future? What kind of code or situation would defeat this approach, and are those programs or scenarios important in practice? Note: we are not interested in flaws in presentation, such as trivial examples, confusing notation, or spelling errors. However, if you have a great idea on how some concept could be presented or formalized better, mention it.
  •     New ideas and connections to other work. How could the paper be extended? How could some of the flaws of the paper be corrected or avoided? Also, how does this paper relate to others we have read, or even any other research you are familiar with? Are there similarities between this approach and other work, or differences that highlight important facets of both?

Class Discussion: Think-Pair-Share

How Does It Work?
1) Think. The teacher provokes students' thinking with a question or prompt or observation. The students should take a few moments (probably not minutes) just to THINK about the question.

2) Pair. Using designated partners (such as with Clock Buddies), nearby neighbors, or a deskmate, students PAIR up to talk about the answer each came up with. They compare their mental or written notes and identify the answers they think are best, most convincing, or most unique.

3) Share. After students talk in pairs for a few moments (again, usually not minutes), the teacher calls for pairs to SHARE their thinking with the rest of the class. She can do this by going around in round-robin fashion, calling on each pair; or she can take answers as they are called out (or as hands are raised). Often, the teacher or a designated helper will record these responses on the board or on the overhead

Homework, project report, and presentation grading scheme

Class Policy

  • You can use your laptops to take notes (no smart phones please). If you plan to use your laptop during the lectures, please send me an email in the beginning of the quarter.
  • I promise to return graded assignments within two weeks to provide timely feedback to you.
  • Class announcements will be made through Piazza. Not every class announcement will be made available via electronic means. 
  • Review questions are to help you review concepts and to prepare for quizzes. If you do not know answers to the questions, I'd be happy to go through them with you during my office hours.