The rapid rise of big data and AI created a critical disconnect in the technology sector: without a shared, recognized model for employing data scientists into traditional software engineering teams, technology companies could not fully capitalize on their data assets and advance their AI capabilities. Recognizing this data driven industry transformation as early as 2014, Professor Kim led research into the emerging roles of data scientists in software teams (Kim et al. 2018, Kim et al. 2016). Her large-scale study at Microsoft provided the first in-depth look at how data-focused roles were reshaping software development. The findings directly influenced Microsoft's creation of a formal "data scientist" career path—a job category that other tech companies also created. This formalization sparked a parallel transformation in academia, contributing to the rapid increase of data science and artificial intelligence (AI) degree programs. Professor Kim’s work not only shaped the industry's workforce but also catalyzed subsequent research into how AI and machine learning are changing software engineering practices.
As AI and big data have become integral to modern software systems, ensuring the reliability of big data software has grown urgent. Professor Kim found a problem that traditional debugging and testing techniques are not equipped to handle the scale and complexity of big data, when software is running on top of large-scale, distributed data processing platforms (e.g., Apache Spark). This problem left developers and data scientists to rely on inefficient, trial-and-error debugging, introducing reliability risks and slowing the adoption of AI technologies. Over the the past decade, Professor Kim has led a research agenda that rethinks how automated debugging and testing techniques should be re-designed for big data systems. Professor Kim articulated software development challenges with high data and compute resource requirements and Professor Kim’s team at UCLA then designed a series of automated debugging and testing techniques that can handle the scale of big data (Kim 2023). This effort not only fills a longstanding tooling gap but also establishes engineering foundations for reliable big data software—advancing software engineering in the era of AI. This vision is presented in our ASE Keynote on "Re-engineering SE for data-centric world" and IEEE Software article on "SE4DA: Software Engineering for Data Analytics."
Data-intensive computing relies on specialized hardware, but developing heterogeneous applications requires significant hardware expertise. We thus try to lower the entry barrier for leveraging hardware accelerators with enhanced developer tools. This vision is presented in our ISSTA keynote on "Software Tools for Democratizing Heterogeneous Computing" (video).
Similar code fragments cause a significant amount of redundant developer effort, as the same bug may appear across similar code fragments that later require a similar fix. Professor Kim has been a leading force in software analytics—the systematic study of large-scale code repositories to detect and automate recurring developer tasks. As a pioneer in Mining Software Repositories (MSR), Professor Kim was the first to study copy and paste programming practices and investigate the nature and extent of repetitive code. By tracking similar code across software histories, she challenged the view that copy-and-paste coding is flawed (clone genealogies). Instead, she demonstrated that code repetitiveness reveals patterns that can accelerate coding tasks. Professor Kim’s later work studied large-scale data from GitHub and Stack Overflow (ExampleStack, ExampleCheck), showing that developers frequently make similar changes to similar code across multiple codebases—highlighting a global opportunity to reduce redundancy. This helped establish a new direction on using historical data to support automation. These insights laid the groundwork for tools that lower the entry barrier to software engineering by making expert-level code patterns accessible. This research effort on code duplication is summarized in our Dagstuhl Keynote on "A Journey through Searching Similar Code."
The challenge of redundant developer effort is magnified because the same necessary bug fixes must be applied across thousands of separate software systems in response to security vulnerabilities or library updates. Professor Kim pioneered two distinct approaches to reduce this repetitive effort. First, she designed an automated program repair method that adapts an existing bug fix to a new location with different code content (Sydit, Lase). Second, she designed an automated bug detection method that finds inconsistent or missed updates by summarizing similar updates into rules and finding rule violations (LSdiff, Spa). These ideas influenced many subsequent program repair approaches that reuse an existing patch in new contexts (e.g., Kim et al., 2013; Sidiroglou-Douskos et al., 2015; Le et al., 2016).
Refactoring—the practice of improving software’s architecture—is necessary for reducing long-term maintenance cost in long-lived systems like Microsoft Windows. Despite the decade of investment in refactoring, engineering managers lacked a framework to measure the benefits of large-scale refactoring. Professor Kim developed an automated method to reconstruct refactoring from software histories, earning a 10-Year Test-of-Time Award in 2020 (RefFinder). Then at Microsoft, she led the first systematic analysis of a decade of refactoring effort on the Windows operating system by analyzing software histories. This provided the first quantitative validation of refactoring's real-world costs and benefits, delivering a new framework for engineering leaders to justify refactoring investments in the world's largest software companies (Kim et al.).
An important problem is that updates to one software component’s interface can trigger a chain reaction of changes across thousands of applications that depend on the application programming interface (API) in a software ecosystem. Professor Kim took a dual approach to tackle this systemic challenge. First, in a study of the Android ecosystem (API stability and adoption), Professor Kim analyzed GitHub data to document the scale of the problem, finding that rapidly evolving APIs average 115 updates per month while client mobile applications often lag by more than a year. Second, to address the error-prone nature of these necessary updates, Professor Kim introduced the first automated approach for migrating API usage across dependent applications (LibSync). The API eco-system study, which received a 10-Year Test-of-Time Award in 2023, sparked a wave of subsequent research into API stability and cascading changes. Concurrently, the automated migration approach established a new direction for reducing developer burden during large-scale, repetitive updates.
To address software bloat—which consumes resources and expands attack surfaces—we developed a bytecode debloating framework for software simplification. Our approach augments static analysis with dynamic profiling for modern Java, enhancing both efficiency and security.
Our work was supported by the Office of Naval Research, and we are one of only five teams selected for technology transfer to the Navy. More information is available here.