Home
Research Publications
Teaching Student Funding Service
Community
Talks
GitHub

Data Intensive Software Development

Our early, large-scale study scientifically characterized the emerging role of data scientists within software teams, influencing new data science majors at universities. Addressing the need for data-intensive development, we established new directions in software engineering for data-intensive computing and heterogeneous computing. Over the past decade, we've developed eleven data-intensive developer tools, making traditional code-centric analysis viable in complex data-intensive environments.

Software Engineering for Data-Intensive Computing


The rise of big data analytics highlights a critical debugging gap in current computing models, forcing data scientists into trial-and-error. To address this gap, we've created data-intensive developer tools for Apache Spark, enhancing reliability and efficiency by making code-centric analysis such as interactive debugging, fuzz testing, symbolic execution, and taint analysis viable in data-intensive environments.
 
This vision is summarized in ASE Keynote on "Re-engineering SE for data-centric world" and IEEE Software article on
"SE4DA: Software Engineering for Data Analytics." 

Software Engineering for Heterogeneous Computing

Data-intensive computing thrives on specialized hardware accelerators, but developing such heterogeneous applications requires niche expertise. To make it easier to leverage heterogeneous hardware, our research designs testing, debugging, and repair tools for incorporating hardware accelerators.

This vision for empowering broader development of heterogeneous applications is detailed in our ISSTA Keynote on
"Software Tools for Democratizing Heterogeneous Computing" (video).

Data Scientists in Software Teams 

data scientists research.pdf Our early, large-scale study of 793 professionals first characterized the expanding data-focused roles in software teams, identifying 9 distinct sub-categories. This work directly impacted Microsoft's new data scientist career path and spurred the proliferation of data science degrees to meet this industry demand.
research.pdf
  • The Emerging Roles of Data Scientists in Software Development Teams, ICSE 2016
  • Data Scientists in Software Teams: State of the Art and Challenges, TSE 2018

Code Clones

Our early work on code clones laid the groundwork for automated clone detection, removal, and management. These early insights into code clones became foundational for the current landscape of AI-powered coding assistants. We received two Test of Time awards for understanding Android API stability and adoption, and for refactoring identification from version histories by tracking code clones. Our study on refactoring practices in industry also contributed understanding into how large-scale systems are re-architected.

This two decades of research on code clones is summarized in our Dagstuhl Keynote on "A Journey through Searching Similar Code."

Searching, Tracking, and Visualizing Similar Code in Stack Overflow and GitHub

data scientists Leveraging vast open-source repositories like GitHub, we address the challenge of understanding commonalities in massive code collections. We design ultra-scale API usage mining, interactive visualization, code search, and recommendation systems, all by actively leveraging code similarity at scale.

Automating Similar Changes, Fixes, and API Usage Adaptation


Sydit
Code clones, stemming from copy-paste, are common in large software, requiring systematic edits across similar methods. We pioneered example-based program transformation, which served as the basis for automated patch generation and influenced subsequent work. Our contributions also include an automated clone removal refactoring technique.

  • Automated Clone Removal Refactoring, ICSE 2015
  • In Situ Code Completion Using Edit Recipe, ICSE 2014 Demo
  • Automated Patch Generation: Learning Transformation from Multiple Examples, ICSE 2013
  • Automation of Similar Edits: Generating Transformation from a Single Example, PLDI 2011
  • Automated API Usage Adaptation, OOPSLA 2011
  • Refactoring Identification, Studies, and Bugs with Code Similarity

    windows7rearch

    Refactoring restructures code to improve design without changing external behavior. Our work uses code similarity to identify refactoring and track impact. To establish a scientific foundation, we quantified a Windows re-architecting effort, analyzing version history and developer surveys to assess its impact on bugs, size, complexity, and other metrics.

  • Quantifying Refactoring Benefits of Windows Re-architecting, TSE 2014
  • Refactoring Practices in Industry, FSE 2012
  • Impact of API Refactoring on Bug Fixes ICSE 2011, Nominated for ACM SIGSOFT Distinguished Paper Award
  • Android Ecosystem API Stability and Adoption, 10-Year Retrospective Test of Time Award, ICSME 2013,

  • Our group developed techniques to identify refactorings from version histories and detect associated bugs.

    Detecting Similar Change: Clone Clones and Code Change Analysis

    CHIME Our group focused on code change analysis by tracking clones. We developed tools like RefFinder, using logic queries for refactoring identification, based on our insight that recurring edit patterns can be expressed as logical constraints. We also created techniques to find copy-and-paste bugs, detect refactorings, and analyze API evolution, all by tracking code clones in evolving software.

  • Detecting Semantic Inconsistencies in Code Clones, ASE 2013
  • Omission Errors in Recurring Changes, MSR 2012, ASE 2014
  • Recurring Changes to Forked Software Variants, FSE 2012
  • Long-Lived Code Clones, FASE 2011
  • Copy and Paste Programming Practices, ISESE 2004
  • Clone Genealogies FSE 2005, Nominated for ACM SIGSOFT Distinguished Paper Award
  • LSDiff: Logical Program Differencing via Abstracting Recurring Changes, TSE 2013, ICSE 2009, ICSE 2007
  • VDiff: Differencing for Verilog HDL, ASE 2010 ACM SIGSOFT Distinguished Paper Award
  • FaultTracer: Change Impact Analysis on Regression Tests, ICSM 2011

  • Simplification & Debloating

    To address software bloat, which consumes resources and expands attack surfaces, we developed the bytecode debloating framework for software simplification by augmenting static analysis with dynamic profiling for modern Java, enhancing security.

    Our work was motivated and sponsored by the Office of Naval Research, and 
    we are one of only five teams selected for its technology transfer to the Navy. Information on debloating can be found here.