A CS Ph.D. candidate at UCLA, focusing on design automation.
I am a fifth-year Ph.D. candidate in Computer Science department at UCLA advised by Prof. Jason Cong and a member of the VLSI Architecture, Synthesis & Technology (VAST) Laboratory. Prior to this, I received my B.S. degree in Electrical Engineering from Sharif University of Technology, with a minor degree in Computer Science. My research interests lie in Parallel/Distributed architecture and programming, customized computing, and deep learning. I am involved in research projects focusing on combining customized computing and deep learning to make them benefit from each other.
With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: 1) the different dimensions within same-type layers, 2) the different convolution layers especially transposed and dilated convolutions, and 3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully-pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 15.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.
@article{flexcnn2022, title={FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA}, author={Basalama, Suhail and Sohrabizadeh, Atefeh and Wang, Jie and Guo, Licheng and Cong, Jason}, journal={ACM Transactions on Reconfigurable Technology and Systems}, year={2022}, publisher={ACM New York, NY} }
In the past few years, domain-specific accelerators (DSAs), such as Google's Tensor Processing Units, have shown to offer significant performance and energy efficiency over general-purpose CPUs. An important question is whether typical software developers can design and implement their own customized DSAs, with affordability and efficiency, to accelerate their applications. This article presents our answer to this question.
@misc{cong22cacm, doi = {10.48550/ARXIV.2209.02951}, url = {https://arxiv.org/abs/2209.02951}, author = {Chi, Yuze and Qiao, Weikang and Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason}, title = {Democratizing Domain-Specific Computing}, publisher = {arXiv}, year = {2022} }
FPGAs have been proven to be powerful computational accelerators across many types of workloads. The mainstream programming approach is high level synthesis (HLS), which maps high-level languages (e.g. C + #pragmas) to hardware. Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: Although HLS compilation is fast, the downstream physical design takes hours to days; FPGA reconfiguration time limits the time-multiplexing ability of hardware, and tools do not reason about cross-workload flexibility. Overlay architectures mitigate the above by mapping a programmable design (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization. Our essential idea is to develop a hardware generation framework targeting a highly-customizable overlay, so that the abstraction gap can be lowered by tuning the design instance to applications of interest. We leverage and extend prior work on customizable spatial architectures, SoC generation, accelerator compilers, and design space explorers to create an end-to-end FPGA acceleration system. Our novel techniques address inefficient networks between on-chip memories and processing elements, as well as improving DSE by reducing the amount of recompilation required. Our framework, OverGen, is highly competitive with fixed-function HLS-based designs, even though the generated designs are programmable with fast reconfiguration. We compared to a state-of-the-art DSE-based HLS framework, AutoDSE. Without kernel-tuning for AutoDSE, OverGen gets 1.2x geomean performance, and even with manual kernel-tuning for the baseline, OverGen still gets 0.55x geomean performance -- all while providing runtime flexibility across workloads.
@inproceedings{liu2022overgen, title={OverGen: Improving FPGA Usability through Domain-specific Overlay Generation}, author={Liu, Sihao and Weng, Jian and Kupsh, Dylan and Sohrabizadeh, Atefeh and Wang, Zhengrong and Guo, Licheng and Liu, Jiuyang and Zhulin, Maxim and Mani, Rishabh and Zhang, Lucheng and others}, booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)}, pages={35--56}, year={2022}, organization={IEEE} }
@inproceedings{bai2022maml, title={Automated Accelerator Optimization Aided by Graph Neural Networks}, author={Bai, Yunsheng and Sohrabizadeh, Atefeh and Sun, Yizhou and Cong, Jason}, booktitle={2022 59th ACM/IEEE Design Automation Conference (DAC)}, year={2022} }
Recently, there is a growing interest in developing learning-based models as a surrogate of the High-Level Synthesis (HLS) tools, where the key objective is rapid prediction of the quality of a candidate HLS design for automated design space exploration (DSE). Training is usually conducted on a given set of computation kernels (or kernels in short) needed for hardware acceleration. However, the model must also perform well on new kernels. The discrepancy between the training set and new kernels, called domain shift, frequently leads to model accuracy drop which in turn negatively impact the DSE performance. In this paper, we investigate the possibility of adapting an existing meta-learning approach, named MAML, to the task of design quality prediction. Experiments show the MAML-enhanced model outperforms a simple baseline based on fine tuning in terms of both offline evaluation on hold-out test sets and online evaluation for DSE speedup results.
@inproceedings{sohrabizadeh2022gnn, title={Automated Accelerator Optimization Aided by Graph Neural Networks}, author={Sohrabizadeh, Atefeh and Bai, Yunsheng and Sun, Yizhou and Cong, Jason}, booktitle={2022 59th ACM/IEEE Design Automation Conference (DAC)}, year={2022} }
High-level synthesis (HLS) has freed the computer architects from developing their designs in a very low-level language and needing to exactly specify how the data should be transferred in register-level. With the help of HLS, the hardware designers must describe only a high-level behavioral flow of the design. Despite this, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level that requires more time to explore. It also takes several minutes to hours to get feedback from the HLS tool on the quality of each design candidate. In this paper, we propose to solve this problem by modeling the HLS tool with a graph neural network (GNN) that is trained to be used for a wide range of applications. The experimental results demonstrate that by employing the GNN-based model, we are able to estimate the quality of design in milliseconds with high accuracy which can help us search through the solution space very quickly.
@inproceedings{sohrabizadeh2022streamgcn, title={StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing}, author={Sohrabizadeh, Atefeh and Chi, Yuze and Cong, Jason}, booktitle={2022 IEEE Custom Integrated Circuits Conference (CICC)}, pages={1--8}, year={2022}, organization={IEEE} }
While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on accelerating deep learning applications involving graphs. The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU. To address these challenges while exploiting all the available sparsity, we propose a flexible architecture called StreamGCN for accelerating Graph Convolutional Networks (GCN), the core computation unit in deep learning algorithms on graphs. The architecture is specialized for streaming processing of many small graphs for graph search and similarity computation. The experimental results demonstrate that StreamGCN can deliver a high speedup compared to a multi-core CPU and a GPU implementation, showing the efficiency of our design.
@inproceedings{song2022sextans, title={Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication}, author={Song, Linghao and Chi, Yuze and Sohrabizadeh, Atefeh and Choi, Young-kyu and Lau, Jason and Cong, Jason}, booktitle={Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, pages={65--77}, year={2022} }
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications, including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) anon-general-purpose accelerator design where one accelerator can only process a fixed-size problem. In this paper, we present Sextans, an accelerator for general-purpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to off-chip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth comparable to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80).
@article{sohrabizadeh2022autodse, title={AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators}, author={Sohrabizadeh, Atefeh and Yu, Cody Hao and Gao, Min and Cong, Jason}, journal={ACM Transactions on Design Automation of Electronic Systems (TODAES)}, volume={27}, number={4}, pages={1--27}, year={2022}, publisher={ACM New York, NY} }
Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework-AutoDSE- that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38x while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.
@inproceedings{sohrabizadeh2020end, title={End-to-End Optimization of Deep Learning Applications}, author={Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason}, booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, pages={133--139}, year={2020} }
Received best paper runner-up award at MICRO'22 for proposing a domain-specific overlay generator for FPGA (OverGen).
Honored to be among the recipients of the 2020-2021 Cadence Women in Technology Scholarship. Thank you, Cadence!
Presented our effort on "Employing graph neural networks to improve the optimization of HLS design" for SpatialML center (International Centre for Spatial Computational Learning - Rethinking Machine Learning Architectures and Algorithms) .
Presented our DAC'22 paper on "Automated Accelerator Optimization Aided by Graph Neural Networks".
One of the speakers of the FPGA'22 workshop on "Open-Source Source-to-Source Transformation for High-Level Synthesis (HLS)". Presented our efforts in making FPGA programming easier as discussed in our TODAES'22 paper "AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators".
Presented our FPGA'20 paper on "End-to-end Optimization of Deep Learning Applications".
Advisor: Prof. Jason Cong I worked on developing FlexCNN, a customized accelerator for CNN computation. Currently, I am focusing on raising the abstraction level for FPGA design with the ultimate goal of democratizing customized computing. Please refer to the publication section for more details of my work.
I returned to the Memory Solutions Lab (MSL) for a second internship. Delighted to have spent another summer at lovely SSI. My project was on developing a learning model to estimate the quality of an FPGA design.
I worked for Memory Solutions Lab (MSL) directed by Sungwook Ryu, under the supervision of Xuebin Yao with Caroline Kahn as my mentor. My project invloved acceleration of SimGNN targeting SmartSSD.
Atefeh clearly has an excellent grasp of the course material - she has taken this class before with great performance and is very willing to share her knowledge and experience with us. Her discussion sessions are very well-prepared, clearly organized and efficiently delivered, so they're a great complement to help us better understand the theories covered in lectures. Also she is genuinely willing to help with our questions on labs - I personally see she has gone beyond the expectations to devote extra time after sessions and office hours to guide us on these projects. As a nature of these projects, the solutions cannot be easily got by a straight derivation from the theory, so many trails-and-errors could be unavoidable. She handled it with a great degree of balance that guides us along meanwhile not sacrificing the educational values of the projects themselves. Overall I would highly recommend Atefeh to TA this course or any related courses again in the future if she hopes to do so. Thank you!