Research Publications
Teaching Student CV Software Funding Activities Diversity

Research Projects 

The mission of Software Evolution and Analysis Laboratory is to improve developer productivity and software reliability during evolution of large software systems. With a primary focus on evolution, my students and I design, implement, and evaluate automated analysis algorithms and development tools that make code changes easy to reason about, reusable, and portable. I also conduct user studies with professional software engineers and carry out statistical analysis of version histories to allow data-driven decisions for designing novel tools.

These days, my research interest focuses on software engineering support for big data systems and understanding how data scientists work in software development organizations. I am also exploring how to leverage automated repair and testing for cyber physical systems, such as quadcopters.

Interactive and Automated Debugging for Big Data Analytics in Apache Spark

An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics. To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Apache Hadoop, and Apache Spark. However, the current cloud computing model lacks the kinds of expressive and interactive debugging features found in traditional desktop computing. We seek to address these challenges by providing providing interactive debugging primitives and tool-assisted fault localization services for big data analytics. We showcase the data provenance and optimized incremental computation features to effectively and efficiently support interactive debugging, and investigate new research directions on how to automatically pinpoint and repair the root cause of errors in large-scale distributed data processing. Big Data Debugging project has a separate project site link. This project is led by my PhD student Muhammad Ali Gulzar.

Data Scientists in Software Teams: Backgrounds, Activities, Tools, Challenges and Best Practices 

data scientists The demand for analyzing large scale telemetry, machine, and quality data is rapidly increasing in software industry. Data scientists are becoming popular within software teams. Facebook, LinkedIn and Microsoft are creating a new career path for data scientists.

We have conducted an in-depth study on the emerging roles of data scientists using a semi-structured interview and identified distinct working styles of data scientists and a set of strategies that they employ to increase the impact and actionability of their work. As a follow up, we conducted a large scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities.  We cluster these data scientists based on the time spent for various activities and identify 9 distinct clusters of data scientists, and their corresponding characteristics. We also discuss the challenges that they face and the best practices they share with other data scientists. This project is led by myself in collaboration with Microsoft Research.

Mining, Assessing, and Visualizing Code Examples at Scale

data scientists Programmers often consult an online Q&A forum such as Stack Overflow to learn new APIs. We study the prevalence and severity of API misuse on Stack Overflow (SO). To reduce manual assessment effort, we design ExampleCheck, an API usage mining framework that extracts patterns from over 380K Java repositories on GitHub and subsequently reports potential API usage violations in 217,818 SO posts. We find that 31% may have potential API usage violations that could produce unexpected behavior such as program crashes and resource leaks.

There are often a massive number of related code examples and it is difficult for a user to understand the commonalities and variances among them, while being able to drill down to concrete details. We introduce an interactive visualization, called Examplore, that summarizes hundreds of code examples in one synthetic code skeleton with statistical distributions for canonicalized statements and structures enclosing an API call.
This project is led by my PhD student Tianyi Zhang.

Cooperative Repair for Cyber Physical System Resiliency


Cyber physical systems are frequently designed by multiple contractors with heterogeneity in mind for defense purposes; however, similar exploitable vulnerabilities and latent bugs affect these variant systems. We develop methods for repairing programs cooperatively, where repairs generated from one system can be applied to other similar systems. Repairs synthesized for one system will be generalized into abstract repair scripts that can be instantiated and specialized to other related systems. We also increase the scope of automated repair beyond the current state of the art, investigating novel approaches to repair classes of defects unaddressed by existing work. Finally, we develop methods for increasing confidence in automated repairs?re-establishing trust that a repaired system will behave correctly in the future?using formal proofs, statistical evidence, and simulation and visualization of the effects and boundaries of repairs. In this project, we propose a whole-system evaluation using a specimen system showcase of quadrotor UAVs.

This project is sponsored by Air Force Research Laboratory Research Grant: FA 8750-15-2-0075.
  • Automated Transplantation and Differential Testing for Clones ICSE 2017

Interacting to Specify Software 

All sectors of our society rely on the proper functioning of software. While many tools exist to help software developers ensure important functional, security, and performance properties of their software, these tools generally require developers to provide a specification of the desired properties. Unfortunately writing specifications today is a tedious, error-prone, and costly proposition. Specifications are software artifacts in their own right, yet developers have almost no support in creating and evolving them. Therefore, developers tend to write highly simple or incomplete specifications, if they write specifications at all. This project aims to address that problem by producing techniques and tools that aid and incentivize developers in creating and maintaining high-quality specifications.

This project is sponsored by National Science Foundation Award CCF-1527923: Interacting to Specify Software.
  • Interactive Code Review for Systematic Changes ICSE 2015

Analysis and Automation of Systematic Software Modifications

Sydit Extension of existing software often requires systematic and pervasive edits?programmers apply similar, but not identical, enhancements, refactorings, and bug fixes to many similar methods. The vision of this research is to produce a novel example-based program transformation approach. Our key insight is that by learning abstraction transformation from examples, we can automate systematic edits in a flexible and easy-to-use manner. In our evaluation of real world bug fixes, our approach LASE found fix locations with 99% precision, 89% recall, and applied fixes with 91% correctness. It also fixed  locations missed by human developers, correcting errors of omissions. This project is sponsored by National Science Foundation CAREER Award CCF-1149391: Analysis and Automation of Systematic Software Modifications.

Keywords: Code Transformation; Refactoring; Static and Dynamic Analysis; Experimental Evaluation

Analytical Support for Investigating Software Modifications in Collaborative Development Environment

CHIME During collaborative software development, developers need to analyze past and present software modifications made by other programmers in various tasks such as carrying out a peer code reviews, bug investigations, and change impact analysis. CHIME project addresses the following fundamental questions about software modifications: (1) what is a concise and explicit representation of a program change? (2) how do we automatically extract the differences between two program versions into meaningful high-level representations? (3) How can we significantly improve developer productivity in investigating, searching, and monitoring software modifications made by other developers?

This CHIME project is sponsored by National Science Foundation Grant CCF-1117902: Analytical Support for Investigating Software Modifications in Collaborative Development Environment.

Keywords: Program Differencing; Code Change Analysis; Empirical Studies; Mining Software Archives; Collaborative Software Development

Pragmatic Techniques and Studies for Real-World Refactoring


Software is rarely written from scratch. Refactoring is a technique that is used for cleaning up legacy code or as a preparation for bug fixes or feature additions. Modern integrated development environments now ship with built-in refactoring support to automate refactoring.

By performing survey and interviews with professional engineers and by analyzing software version histories, we study the characteristics of real-world refactorings. Our study finds that real-world refactorings are not necessarily behavior preserving and they are beyond the scope and capability of existing refactoring engines. Real-world refactorings are often done manually and error-prone. Developers perceive that refactoring can be risky and they have a hard time justifying refactoring investments due to the difficulty of assessing refactoring benefits. Our goal is to design pragmatic techniques to help developers have a high confidence about carrying out real world refactoring. We are currently designing analysis algorithms that support refactoring-aware code reviews and an approach that monitors and assesses the impact of real world refactoring.

Keywords: Refactoring; Empirical Studies; Software Evolution; Code Reviews
  • A Field Study of Refactoring Benefits and Challenges at Microsoft (FSE 2012)
  • Refactoring Impact on Tests (ICSM 2012)
  • A Study of API Refactoring and Bug Fixes (ICSE 2011, Nominated for ACM SIGSOFT Distinguished Paper Award)
  • A Logic Query Approach to Refactoring Reconstruction (FSE demo 2010, ICSM 2010)

Physically Informed Assertions for Cyber Physical System Development and Debugging


The nature of systems that intertwine the cyber and the physical demands completely new approaches to developing robust, reliable, and verified software, hardware, and physical components jointly. In this project, we aim to make fundamental advances by creating a new assertion-driven approach for developing and debugging cyber-physical systems. As opposed to traditional uses of assertions in software engineering, our approach is unique in that we use mathematical models of physical phenomena to guide creation of assertions, to identify inconsistent or infeasible assertions, and to localize potential causes for CPS failures. Our goal is to produce methods and tools that facilitate the active use of physical models, which have existed for centuries, for verifying cyber-physical systems.

This BRACE project is sponsored by National Science Foundation Grant CNS-1239498 CPS: Synergy: Physically-Informed Assertions for CPS Development and Debugging.

Code Duplication and Software Forking

It has been long believed that duplicated code fragments indicate poor software quality and factoring out the commonality among them improves software quality; thus, previous studies focused on measuring the percentage of code clones and interpreted a large (or increasing) number as an indicator for poor quality. On the other hand, we investigated how and why duplicated code is actually created and maintained using two empirical analyses. we used an edit capture and replay approach to gather insights into copy and paste programming practices. To extend this type of change-centric analysis to programs without edit logs, we developed a clone genealogy analysis that tracks individual clones over multiple versions. By focusing on how code clones actually evolve, we found that clones are not inherently bad and that we need better support for managing clones.

Keywords: Code Duplication; Software Forking; Software Reuse
  • Long-Lived Clones (FASE 2011)
  • Clone Genealogies (FSE 2005, Nominated for ACM SIGSOFT Distinguished Paper Award)
  • An Ethnographic Study of Copy and Paste Programming at IBM (ISESE 2004)