Automated Debugging in Data-Intensive Scalable Computing

Errors are hard to diagnose for big data analytics. An error could occur due to a bug in program logic, or it could be due to a wrong assumption or anomalies in input data. For precise and automated fault localization of failure inducing inputs in data workflows, we have built BigSift. BigSift's underlying algorithm combines data provenance and delta debugging to effectively and efficiently pinpoint the root cause of errors in large-scale distributed data processing. The optimization techniques of BigSift intelligently leverages in-memory data processing, predicate pushdown and resource aware job scheduling to reduce fault localization time by several orders of magnitude

BigDebug: Interactive Debugging for Big Data Analytics in Apache Spark

An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics. To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Apache Hadoop, and Apache Spark. The vision of BigDebug is to provide interactive, real-time debugging primitives for big data processing programs in modern DISC systems like Apache Spark. Designing BigDebug requires re-thinking the traditional step-through debugging primitives as provided by tools such as gdb. The goal of BigDebug project is to meet the requirements of low overhead, scalability, and fine granularity, while providing expressive debugging primitives for big data cloud computing. [Project Site]

A Classification Based Framework to Predict Viral Threads

An effective social media strategy requires companies to mine large volumes of structured, unstructured, and semi-structured online textual data in order to gain insights into the underlying traits of the consumers and prevailing public opinion. We built and evaluated a classification based framework to predict thread lengths in online discussion forums in order to identify potential topics that may of interest to a particular online community. Our predictions on Health 2.0 dataset corroborate with the data provided by FDA, and in fact, found these anomalies earlier than FDA had documented them.

OCCAM: Object Culling and Concretization for Assurance Maximization

Feature intensive applications with large code bases can provide functionality to a wide range of users each with their own specific requirements. I contributed to a tool chain, built on LLVM that specializes programs to a certain specifications and configurations in order to gain performance and security benefits including improved cache performance, optimized storage space, and a reduced attack surface.

BGP is high on SDN

BGP convergence is notorious for its unpredictable and unbounded limits. We present a novel way where we dissipate BGP state change messages from multiple Autonomous Systems. This results in multiplication of BGP propagation effort. We used OpenFlow capable routers distributed across the Internet to optimally automate this process. Our approach also helps to mitigate IP prefix hijack and Multiple Origin Autonomous System (MOAS) conflicts. Experimental results showed that this scheme can significantly increase BGP propagation and alleviate BGP security issues.