Efficient and Robust Machine Learning from Massive Datasets
The goal of our research is to develop the next generation of resource-efficient and reliable techniques to learn from massive datasets. To this end, the challenges we focus on include:
- Identifying representative data points from massive datasets;
- Developing data-efficient machine learning methods to reduce substantial costs of learning from big data;
- Designing reliable and safe machine learning algorithms with rigorous guarantees for safety-critical systems;
- Designing better neural network architectures to learn more effectively from large datasets.
At the core of our research lie theoretically rigorous techniques that provide strong guarantees for the quality of the learned models’ accuracy and robustness. We are in particular excited about applications of my techniques in domains such as medical and health-care, Internet of Things (IoT), self-driving cars, financial services, and urban planning, where a tremendous amount of data is generated every second, and demands fast and accurate analysis.
1. Identifying Representative Elements from Massive Datasets
For the last several years, we have witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, unstructured and usually high-resolution data does not allow for effective data-intensive inference.
In this regard, data summarization is a compelling (and sometimes the only) approach that aims at both exploiting the richness of large scale data and being computationally tractable. Instead of operating on complex and large data directly, carefully constructed summaries not only enable the execution of various data analytics tasks but also improve their efficiency and scalability.
To this end, we develop practical distributed and streaming summarization methods with strong theoretical guarantees to extract representative elements from massive datasets. Our research broke new ground by developing truly scalable methods to extract the information volume for machine learning. For example, our techniques enabled us to tackle the largest exemplar clustering problem considered in the literature on 80 million images, and to extract real-time video summaries 1,700x faster than existing methods.


2. Data-efficient Machine Learning
Large datasets have been crucial to the success of modern machine learning models in the recent years. Such models keep performing better as the size of the training data increases. However, training modern machine learning models on massive datasets is contingent on exceptionally large and expensive computational resources, and incurs a substantial environmental cost due to the significant energy consumption.
To reduce substantial costs of learning from massive data, we develop data-efficient methods to find subsets that can generalize on par with the full data when trained on. More precisely, we are focusing on extracting subset that are representative for training machine learning models, and developing theoretically rigorous optimization techniques to effectively learn from the extracted subsets.
Our methods speed up various stochastic gradient methods, including SGD, SAGA, SVRG, Adam, Adagrad, and NAG by up to 10x for convex and 3x for non-convex (training deep networks) loss functions, while achieving better generalization performance.

3. Reliable and Safe Machine Learning
The functional safety of many intelligent systems, such as autonomous robots and self-driving cars, remains largely dependent on the robustness of the underlying machine learning model. These systems are expected to operate flawlessly, and hence need to be trained on examples of all possible situational conditions.
The great success of deep neural networks relies heavily on the quality of training data. However, maintaining the data quality becomes very expensive for large datasets, and hence mislabeled and corrupted data are ubiquitous in large real-world datasets. Furthermore, adversarial examples can be intentionally designed to cause the model to make mistakes. As deep neural networks have the capacity to essentially memorize the training data, noisy labels and adversarial attacks have a drastic effect on the generalization performance of deep neural networks.
We develop scalable robust frameworks that can provide guarantees for the quality of inference under noisy data, noisy labels, and adversarial attacks. Our methods are in particular useful for safety critical systems with very high accuracy requirements.

4. Designing Better Neural Network Architectures
The new generation of large neural networks with billions of parameters have obtained notable gains in accuracy across many tasks. However, these accuracy improvements depend on substantially large computational resources that incur a significant financial and environmental costs. On the other hand, compressive networks with fast training and inference time are desired in many real world applications, e.g. robotics, self-driving cars, and augmented reality.
To reduce costs of training large neural networks and to accelerate training and inference time, we develop methods to effectively reduce over-parametrized neural networks to smaller networks that reach test accuracy comparable to the original network. Training compressed networks with optimized architecture requires orders of magnitude less computation resources, and enables efficient inference in a variety of real-world applications.
