demand for analyzing large scale telemetry, machine, and
quality data is rapidly increasing in software industry.
Data scientists are becoming popular within software teams.
Facebook, LinkedIn and Microsoft are creating a new career
path for data scientists.
We have conducted an in-depth study on the emerging roles of data scientists using a semi-structured interview and identified distinct working styles of data scientists and a set of strategies that they employ to increase the impact and actionability of their work. As a follow up, we conducted a large scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities. We cluster these data scientists based on the time spent for various activities and identify 9 distinct clusters of data scientists, and their corresponding characteristics. We also discuss the challenges that they face and the best practices they share with other data scientists. This project is led in collaboration with Microsoft Research.
|Modern software is bloated. Demand for new functionality
has led developers to include more and more features, many
of which become unneeded or unused as software evolves. This
phenomenon, known as software bloat, results in software
consuming more resources than it otherwise needs to. Various
debloating techniques have been proposed since the late
1990s. However, many of these techniques are built upon pure
static analysis and have yet to be extended and evaluated in
the context of modern Java applications where dynamic
language features are prevalent. To this end, we developed
an end-to-end bytecode debloating framework called JShrink.
It augments traditional static reachability analysis with
dynamic profiling and type dependency analysis and renovates
existing bytecode transformations to account for new
language features in modern Java. This work is
motivated and sponsored by Office of Naval Research Total
Protection Cyber Platform program. More information on this
debloating project can be found here.
| Programmers often
consult an online Q&A forum such as Stack Overflow to
learn new APIs. We design ExampleCheck,
an API usage mining framework that extracts patterns from
over 380K Java repositories on GitHub. ExampleCheck
subsequently reports potential API usage violations in 217K
SO posts. We find that 31% may have potential API usage
violations that could produce unexpected behavior such as
program crashes and resource leaks.
There are often a massive number of related code examples and it is difficult for a user to understand the commonalities and variances among them, while being able to drill down to concrete details. We introduce an interactive visualization, called Examplore, that summarizes hundreds of code examples in one synthetic code skeleton with statistical distributions for canonicalized statements and structures enclosing an API call. This project is led by my PhD student Tianyi Zhang.
Code duplication created by copy and paste is common in large software. Our research on how to cope with code duplication has enabled me to lead a new research team to address software debloating and delayering, which must be urgently addressed to secure our nation's cyber infrastructure. I am the PI of an Office of Naval Research (ONR) project, Synergistic Software Customization. Below are the details on code duplication search, differential testing, and clone removal refactoring.
Extension of existing software often requires systematic and pervasive edits?programmers apply similar, but not identical, enhancements, refactorings, and bug fixes to many similar methods. The vision of this research is to produce a novel example-based program transformation approach. Our key insight is that by learning abstract transformation from examples, we can automate systematic edits in a flexible and easy-to-use manner. In our evaluation of real world bug fixes, our approach LASE found fix locations with 99% precision, 89% recall, and applied fixes with 91% correctness. It also fixed locations missed by human developers, correcting errors of omissions. This project is sponsored by National Science Foundation CAREER Award: Analysis and Automation of Systematic Software Modifications.
Software is rarely written from scratch. Refactoring is a technique that is used for cleaning up legacy code or as a preparation for bug fixes or feature additions. Modern integrated development environments now ship with built-in refactoring support to automate refactoring.
By performing survey and interviews with software
engineers and by analyzing software version histories, we
study the characteristics of real-world refactorings.
Real-world refactorings are not necessarily behavior
preserving and they are beyond the scope and capability of
existing refactoring engines. Real-world refactorings are
often done manually and error-prone. Developers perceive
that refactoring is risky and they have a hard time
justifying refactoring investments. Our goal is to design
pragmatic techniques to help developers have high
confidence in carrying out refactoring. We design support
for refactoring-aware code review, refactoring error
detection, and refactoring assessment.
collaborative software development, developers need to
analyze past and present software modifications made by
other programmers in various tasks such as carrying out a
peer code reviews, bug investigations, and change impact
analysis. CHIME project addresses the following fundamental
questions about software modifications: (1) what is a
concise and explicit representation of a program
change? (2) how do we automatically extract the
differences between two program versions into meaningful
high-level representations? (3) How can
we significantly improve developer productivity in
investigating, searching, and monitoring software
modifications made by other developers?
|It has been long believed
that duplicated code fragments indicate poor software
quality and factoring out the commonality among them
improves software quality; thus, previous studies focused on
measuring the percentage of code clones and interpreted a
large (or increasing) number as an indicator for poor
quality. On the other hand, we investigated how and why
duplicated code is actually created and maintained using two
empirical analyses. we used an edit capture and replay
approach to gather insights into copy and paste programming
practices. To extend this type of change-centric analysis to
programs without edit logs, we developed a clone genealogy
analysis that tracks individual clones over multiple
versions. By focusing on how code clones actually evolve, we
found that clones are not inherently bad and that we need
better support for managing clones.