As AI and big data become central to modern software, traditional debugging and testing have proven inadequate for the scale and complexity of distributed platforms like Apache Spark. This gap forced developers into inefficient trial-and-error processes, compromising reliability and slowing AI adoption. Over the past decade, Professor Kim has led a research agenda to redesign automated debugging and testing for big data systems (Kim 2023). This vision is presented in our ASE Keynote on "Re-engineering SE for data-centric world" and IEEE Software article on "SE4DA: Software Engineering for Data Analytics."
Data-intensive computing relies on specialized hardware, but developing heterogeneous applications requires significant hardware expertise. We thus try to lower the entry barrier for leveraging hardware accelerators with enhanced developer tools. This vision is presented in our ISSTA keynote on "Software Tools for Democratizing Heterogeneous Computing" (video).
Recognizing data-driven industry transformation as early as 2014, we led research into emerging roles of data scientists in the software teams and identified 9 distinct clusters of data scientists. Our studies on data scientists provided the insights on how data-focused roles were reshaping software engineering, informing the establishment of a "data scientist" career path at Microsoft. This job category was soon adopted across the technology industry, catalyzing a parallel transformation in higher education and expanding Data Science and Artificial Intelligence (AI) majors.
To reduce redundant effort, we study repetitive code and clone genealogies within large-scale repositories. Our research challenged the view that copy-and-paste programming is detrimental. Instead we demonstrated that recurring code patterns can accelerate development tasks. We demonstrated how recurring code patterns from large-scale repositories could be analyzed to automate bug fixes, refactoring, and API updates---insights that inform modern AI-driven developer tools, a journey summarized in her Dagstuhl Keynote, "A Journey through Searching Similar Code."
To address redundant developer effort, we developed two strategies for automating software maintenance. First, we designed program repair methods (Sydit and Lase) which adapt existing bug fixes to new, differing code contexts. Second, we developed bug detection techniques (LSdiff, Spa), which summarize similar updates into rules to identify inconsistent updates. This research provided a foundation for subsequent work in context-aware patch reuse and automated repair (e.g., Kim et al. 2013; Le et al. 2016).
To quantify the impact of refactoring on maintenance, we developed RefFinder, an automated method to reconstruct refactorings from version histories, which received a 10-Year Test-of-Time Award in 2020. At Microsoft, we studied a decade of refactoring effort on the Windows operating system. This work provided the first quantitative validation of refactoring's costs and benefits, establishing a framework for engineering leaders to justify refactoring investments.
API updates often trigger a chain reaction across software ecosystems, a challenge we addressed through a dual-pronged approach. Our study of the Android API evolution documented the severity of this lag---while APIs evolve rapidly, dependent applications often fall over a year behind. We introduced LibSync, the first automated approach for migrating API usage across dependent applications. This API eco-system study received a Test-of-Time Award in 2023 and sparked new a wave of subsequent research into API stability and adoption.
To address software bloat—which consumes resources and expands attack surfaces—we developed a bytecode debloating framework for software simplification. Our approach augments static analysis with dynamic profiling for modern Java, enhancing both efficiency and security. Our work was supported by the Office of Naval Research, and we are one of only five teams selected for technology transfer to the Navy. More information is available here.