Miryung Kim: Research

Talks

GitHub

Research Projects

The mission of Software Evolution and Analysis Laboratory is to improve developer productivity.

Debugging and Testing for Data Intensive Scalable Computing

An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics. The current big data computing model lacks the kinds of debugging features found in traditional desktop computing, forcing data scientists to debug by trial and error. To address this challenge, we designed interactive debugging, data provenance, delta debugging, taint analysis, flow analysis, symbolic execution, and fuzz testing for Apache Spark.

BigDebug: Interactive Debugger ICSE 2016
BigSift: Automated Delta Debugging SoCC 2017
PerfDebug: Performance Debugging SoCC 2019
FlowDebug: Dynamic Taint Analysis with Influence-Function SoCC 2020
OptDebug: Operational-Level Taint Analysis with Spectra-based Fault Localization, SoCC 2021
BigTest: Symbolic-Execution based Test Input Generation, FSE 2019
BigFuzz: Fuzz Testing, ASE 2020
Titian: Data Provenance, VLDB 2016
DepFuzz: Co-Dependence Aware Mutation for Big Data Analytics, FSE 2023
NaturalFuzz: Mix and Match Mutation for Big Data Analytics, ASE 2023
NaturalSym: Natural Symbolic Execution for Big Data Analytics, FSE 2024

Software Engineering for Heterogeneous Computing

Specialized hardware accelerators like GPUs, FPGAs, and ASICs become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To democratize heterogeneous computing, our goal is to design testing, debugging, and repair tools for software development with HW accelerators.

SynthFuzz: Testing MLIR-based AI Accelerator Compilers with Custom Mutation Synthesis (ICSE 2025)
HFuzz: Fuzzing with and for Hard Acceleration, FSE 2023
HeteroGen: Automated Transpilation and Repair from C to HLS-C, ASPLOS 2022
HeteroFuzz: Test Input Generation for Differential Testing CPU vs. FPGA, FSE 2021
HeteroRefactor: Refactoring for Heterogeneous Applications with FPGA, ICSE 2020
QDiff: Automated Testing of Quantum Software Stacks, SIGSOFT Research Highlights, ASE 2021

Mining, Assessing, and Visualizing Code Examples at Internet Scale

There is a growing interest in leveraging large collections of open-source repositories such as GitHub. Currently, it is difficult for a user to understand the commonalities and variances among a massive number of related code examples.
To tackle the new frontier of mining software repositories research, we design ultra-scale API usage mining, interactive visualization, code search, and recommendation.

SURF: Scaling Code Pattern Inference with Interactive What-If Analysis (demo) ICSE 2024

PARALIB: Concept-Annotated Library Comparsion UIST 2022
ALICE: Code Example Search, ICSE 2019
ExampleCheck: API Usage Mining and API Misuse ICSE 2018
Examplore: Visualizing Code Examples at Scale CHI 2018
ExampleStack: Adapting On-line Code Examples, ICSE 2019

Java Bytecode Debloating for Size Reduction and Security

Modern software is bloated. Demand for new functionality has led developers to include more and more features, many of which become unneeded or unused. This phenomenon, known as software bloat, results in software consuming more resources and an unnecessary increase in attack surfaces.

To this end, we developed an end-to-end bytecode debloating framework called JDebloat. It augments traditional static reachability analysis with dynamic profiling, and it accounts for new dynamic language features in modern Java. This work is motivated and sponsored by Office of Naval Research Total Protection Cyber Platform program and has made a tech transfer impact to Navy. Information on debloating can be found here.

JDebloat GitHub
JDebloat Tutorial Presentation Slides
JDebloat Tutorial Webpage Link
JShrink Paper (ESEC/FSE 2020)
JShrink GitHub
2020 and 2021 ONR Software Security Summer School

Data Scientists in Software Teams: Backgrounds, Activities, Tools, Challenges and Best Practices

research.pdf I initiated academia and industry coalition to investigate the emerging role of data scientists. We conducted an in-depth study on the emerging roles of data scientists, and we conducted a large scale survey with 793 professional data scientists.
This quantification and sub-categorization of data scientists is important---although many companies are hiring data scientists and universities are creating new graduate programs, we lack scientific understandings of who data scientists are.
research.pdf

The Emerging Roles of Data Scientists ICSE 2016
A Large Scale Survey with 793 Data Scientists TSE 2018

Code Clone Detection, Management, and Removal

Code duplication created by copy and paste is common in large software and changing software often requires systematic edits---similar but not identical enhancements, refactorings, and bug fixes to many similar methods. We developed novel example-based program transformation, clone removal, differential testing, and code review.

Clone Transplantation and Differential Testing ICSE 2017
Interactive Clone Search, ICSE 2015

Automated Clone Removal, ICSE 2015

Learning Transformation from Multiple Examples ICSE 2013
Generating Transformation from a Single Example PLDI 2011

The following techniques find copy and paste bugs and reconstructs clone evolution.

Copy and Paste Error Detection ASE 2013
Omission Errors MSR 2012, ASE 2014
Cross System Porting Analysis FSE 2012
Long-Lived Clones FASE 2011
Clone Genealogies FSE 2005, Nominated for ACM SIGSOFT Distinguished Paper Award
Copy and Paste Programming Practices ISESE 2004

Refactoring Automation, Inspection, Testing, and Studies

Refactoring is a technique that is used for cleaning up legacy code for bug fixes or feature additions. To create a scientific foundation on refactoring, we quantified the impact of a multi-year Windows re-architecting effort---we analyzed version history data, conducted a survey of over 300 developers, and interviewed the architects and development leads to assess the impact of refactoring on size, churn, complexity, test coverage, failure, and organization metrics.

A Field Study of Refactoring at Microsoft FSE 2012, TSE 2014

API Refactoring and Bug Fixes ICSE 2011, Nominated for ACM SIGSOFT Distinguished Paper Award

API Stability and Adoption Most Influential Paper Award from ICSME 2013 ICSM 2013,

The following techniques find refactoring bugs

Refactoring Error Detection TSE 2017

Refactoring Change Impact Analysis ICSM 2012

Refactoring API Usage Adaptation OOPSLA 2010

Modularity Violations ICSE 2011

Logical Program Differencing

We invented a suite of analysis tools that can help programmers investigate code modifications. We also developed RefFinder, a logic-query approach to refactoring reconstruction. Our insight was that the skeleton of refactoring edits can be expressed as a logical constraint.

LSDiff: Rule-based Program Differencing TSE 2013, ICSE 2009, ICSE 2007
RefFinder: Refactoring Reconstruction ICSM 2010 Most Influential Paper Award from ICSME 2010
VDiff: Differencing for Verilog HDL ASE 2010 ACM SIGSOFT Distinguished Paper Award
FaultTracer: Change Impact Analysis ICSM 2011