Miryung Kim

The rapid rise of big data and AI created a critical disconnect in the technology sector: without a shared, recognized model for employing data scientists into traditional software engineering teams, technology companies could not fully capitalize on their data assets and advance their AI capabilities. Recognizing this data driven industry transformation as early as 2014, Professor Kim led research into the emerging roles of data scientists in software teams (Kim et al. 2018, Kim et al. 2016). Her large-scale study at Microsoft provided the first in-depth look at how data-focused roles were reshaping software development. The findings directly influenced Microsoft's creation of a formal "data scientist" career path—a job category that other tech companies also created. This formalization sparked a parallel transformation in academia, contributing to the rapid increase of data science and artificial intelligence (AI) degree programs. Professor Kim’s work not only shaped the industry's workforce but also catalyzed subsequent research into how AI and machine learning are changing software engineering practices.

The Emerging Roles of Data Scientists in Software Development Teams, ICSE 2016
Data Scientists in Software Teams: State of the Art and Challenges, TSE 2018

As AI and big data have become integral to modern software systems, ensuring the reliability of big data software has grown urgent. Professor Kim found a problem that traditional debugging and testing techniques are not equipped to handle the scale and complexity of big data, when software is running on top of large-scale, distributed data processing platforms (e.g., Apache Spark). This problem left developers and data scientists to rely on inefficient, trial-and-error debugging, introducing reliability risks and slowing the adoption of AI technologies. Over the the past decade, Professor Kim has led a research agenda that rethinks how automated debugging and testing techniques should be re-designed for big data systems. Professor Kim articulated software development challenges with high data and compute resource requirements and Professor Kim’s team at UCLA then designed a series of automated debugging and testing techniques that can handle the scale of big data (Kim 2023). This effort not only fills a longstanding tooling gap but also establishes engineering foundations for reliable big data software—advancing software engineering in the era of AI. This vision is presented in our ASE Keynote on "Re-engineering SE for data-centric world" and IEEE Software article on "SE4DA: Software Engineering for Data Analytics."

NaturalSym: Natural Symbolic Execution for Big Data Analytics, FSE 2024
NaturalFuzz: Mix and Match Mutation for Big Data Analytics, ASE 2023
DepFuzz: Co-Dependence Aware Testing for Big Data Analytics, FSE 2023
FlowDebug: Taint Analysis with Influence-Function SoCC 2020
BigTest: Symbolic-Execution based Test Input Generation, FSE 2019
BigFuzz: Fuzz Testing with Framework Abstraction, ASE 2020
BigDebug: Interactive Debugger ICSE 2016
Titian: Data Provenance, VLDB 2016

Data-intensive computing relies on specialized hardware, but developing heterogeneous applications requires significant hardware expertise. We thus try to lower the entry barrier for leveraging hardware accelerators with enhanced developer tools. This vision is presented in our ISSTA keynote on "Software Tools for Democratizing Heterogeneous Computing" (video).

SynthFuzz: Testing MLIR-based Compilers with Custom Mutation Synthesis, ICSE 2025
DuoReduce: Debugging MLIR-based Compilers, FSE 2025
HFuzz: Fuzzing with and for Hardware Acceleration, FSE 2023
HeteroGen: Automated Transpilation and Repair from C to HLS-C, ASPLOS 2022
HeteroFuzz: Test Input Generation for Differential Testing CPU vs. FPGA, FSE 2021
QDiff: Automated Testing of Quantum Software Stacks, SIGSOFT Research Highlights ASE 2021
VDiff: Differencing for Verilog HDL, ASE 2010, ASEJ2012 (ACM SIGSOFT Distinguished Paper Award)

Similar code fragments cause a significant amount of redundant developer effort, as the same bug may appear across similar code fragments that later require a similar fix. Professor Kim has been a leading force in software analytics—the systematic study of large-scale code repositories to detect and automate recurring developer tasks. As a pioneer in Mining Software Repositories (MSR), Professor Kim was the first to study copy and paste programming practices and investigate the nature and extent of repetitive code. By tracking similar code across software histories, she challenged the view that copy-and-paste coding is flawed (clone genealogies). Instead, she demonstrated that code repetitiveness reveals patterns that can accelerate coding tasks. Professor Kim’s later work studied large-scale data from GitHub and Stack Overflow (ExampleStack, ExampleCheck), showing that developers frequently make similar changes to similar code across multiple codebases—highlighting a global opportunity to reduce redundancy. This helped establish a new direction on using historical data to support automation. These insights laid the groundwork for tools that lower the entry barrier to software engineering by making expert-level code patterns accessible. This research effort on code duplication is summarized in our Dagstuhl Keynote on "A Journey through Searching Similar Code."

Visualizing code similarity with Examplore

SURF: Scaling Code Pattern Inference with What-If Analysis (demo) ICSE 2024
ALICE: Active Learning for Code Example Search, ICSE 2019
ExampleCheck: API Usage Mining and API Misuse Detection, ICSE 2018
Examplore: Visualizing Code Examples at Scale, CHI 2018
ExampleStack: Adapting Online Code Examples, ICSE 2019

The challenge of redundant developer effort is magnified because the same necessary bug fixes must be applied across thousands of separate software systems in response to security vulnerabilities or library updates. Professor Kim pioneered two distinct approaches to reduce this repetitive effort. First, she designed an automated program repair method that adapts an existing bug fix to a new location with different code content (Sydit, Lase). Second, she designed an automated bug detection method that finds inconsistent or missed updates by summarizing similar updates into rules and finding rule violations (LSdiff, Spa). These ideas influenced many subsequent program repair approaches that reuse an existing patch in new contexts (e.g., Kim et al., 2013; Sidiroglou-Douskos et al., 2015; Le et al., 2016).

Automated Patch Generation: Learning Transformation from Multiple Examples, ICSE 2013
Automation of Similar Edits: Generating Transformation from a Single Example, PLDI 2011
Detecting Semantic Inconsistencies in Code Clones, ASE 2013
Omission Errors in Recurring Changes, MSR 2012, ASE 2014
LSDiff: Logical Program Differencing via Abstracting Recurring Changes — TSE 2013, ICSE 2009, ICSE 2007

Refactoring diagram from Windows 7 study

Refactoring—the practice of improving software’s architecture—is necessary for reducing long-term maintenance cost in long-lived systems like Microsoft Windows. Despite the decade of investment in refactoring, engineering managers lacked a framework to measure the benefits of large-scale refactoring. Professor Kim developed an automated method to reconstruct refactoring from software histories, earning a 10-Year Test-of-Time Award in 2020 (RefFinder). Then at Microsoft, she led the first systematic analysis of a decade of refactoring effort on the Windows operating system by analyzing software histories. This provided the first quantitative validation of refactoring's real-world costs and benefits, delivering a new framework for engineering leaders to justify refactoring investments in the world's largest software companies (Kim et al.).

Quantifying Refactoring Benefits of Windows Re-architecting, TSE 2014
Refactoring Practices in Industry, FSE 2012
Template-based Refactoring Identification — 10-Year Retrospective Test of Time Award, ICSME 2010
Impact of API Refactoring on Bug Fixes, ICSE 2011 — Nominated for ACM SIGSOFT Distinguished Paper Award
Characterizing and Identifying Composite Refactoring, MSR 2020

An important problem is that updates to one software component’s interface can trigger a chain reaction of changes across thousands of applications that depend on the application programming interface (API) in a software ecosystem. Professor Kim took a dual approach to tackle this systemic challenge. First, in a study of the Android ecosystem (API stability and adoption), Professor Kim analyzed GitHub data to document the scale of the problem, finding that rapidly evolving APIs average 115 updates per month while client mobile applications often lag by more than a year. Second, to address the error-prone nature of these necessary updates, Professor Kim introduced the first automated approach for migrating API usage across dependent applications (LibSync). The API eco-system study, which received a 10-Year Test-of-Time Award in 2023, sparked a wave of subsequent research into API stability and cascading changes. Concurrently, the automated migration approach established a new direction for reducing developer burden during large-scale, repetitive updates.

Android Ecosystem API Stability and Adoption — 10-Year Retrospective Test of Time Award, ICSME 2013
Recurring Changes to Forked Software Variants, FSE 2012
Automated API Usage Adaptation, OOPSLA 2011
Clone Genealogies, FSE 2005 (ACM SIGSOFT Distinguished Paper Award Nomination)

To address software bloat—which consumes resources and expands attack surfaces—we developed a bytecode debloating framework for software simplification. Our approach augments static analysis with dynamic profiling for modern Java, enhancing both efficiency and security.

Our work was supported by the Office of Naval Research, and we are one of only five teams selected for technology transfer to the Navy. More information is available here.

Data Intensive Software Development

Defining the Role of Data Scientists in Software Industry

Testing and Debugging for Big Data Software

Software Analytics to Reduce Repetitive Developer Effort

Automated Program Repair Through Patch Reuse

Measuring Real-World Benefit of Refactoring

Understanding Cascading Updates in Software Ecosystems

Software Debloating