Atefeh Sohrabizadeh

Hi, I'm Atefeh Sohrabizadeh

A CS Ph.D. candidate at UCLA, focusing on design automation.

atefehsz@cs.ucla.edu

More About Me

About Me

Publications

  1. Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, Jason Cong, "Robust GNN-based Representation Learning for HLS", IEEE/ACM ICCAD'23. Best Paper Award candidate.

    The efficient and timely optimization of microarchitecture for a target application is hindered by the long evaluation runtime of a design candidate, creating a serious burden. To tackle this problem, researchers have started using learning algorithms such as graph neural networks (GNNs) to accelerate the process by developing a surrogate of the target tool. However, challenges arise when developing such models for HLS tools due to the program’s long dependency range and deeply coupled input program and transformations (i.e., pragmas). To address them, in this paper, we present HARP (Hierarchical Augmentation for Representation with Pragma optimization) with a novel hierarchical graph representation of the HLS design by introducing auxiliary nodes to include high-level hierarchical information about the design. Additionally, HARP decouples the representation of the program and its transformations and includes a neural pragma transformer (NPT) approach to facilitate a more systematic treatment of this process. Our proposed graph representation and model architecture of HARP not only enhance the performance of the model and design space exploration based on it but also improve the model’s transfer learning capability, enabling easier adaptation to new environments.

  2. Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, Jason Cong, "FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA", ACM TRETS'23.

    With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: 1) the different dimensions within same-type layers, 2) the different convolution layers especially transposed and dilated convolutions, and 3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully-pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 15.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.

    @article{flexcnn2022,
    title={FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA},
    author={Basalama, Suhail and Sohrabizadeh, Atefeh and Wang, Jie and Guo, Licheng and Cong, Jason},
    journal={ACM Transactions on Reconfigurable Technology and Systems},
    year={2022},
    publisher={ACM New York, NY}
    }

  3. Yuze Chi, Weikang Qiao, Atefeh Sohrabizadeh, Jie Wang, Jason Cong, "Democratizing Domain-Specific Computing", Communications of the ACM'22.

    In the past few years, domain-specific accelerators (DSAs), such as Google's Tensor Processing Units, have shown to offer significant performance and energy efficiency over general-purpose CPUs. An important question is whether typical software developers can design and implement their own customized DSAs, with affordability and efficiency, to accelerate their applications. This article presents our answer to this question.

    @misc{cong22cacm,
    doi = {10.48550/ARXIV.2209.02951},
    url = {https://arxiv.org/abs/2209.02951},
    author = {Chi, Yuze and Qiao, Weikang and Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason},
    title = {Democratizing Domain-Specific Computing},
    publisher = {arXiv},
    year = {2022}
    }

  4. Sihao Liu, Jian Weng, Dylan Kupsh, Atefeh Sohrabizadeh, Zhengrong Wang, Licheng Guo, Jiuyang Liu, Maxim Zhulin, Lucheng Zhang, Rishabh Mani, Jason Cong, Tony Nowatzki, "OverGen: Improving FPGA Usability through Domain-specific Overlay Generation", ACM/IEEE MICRO'22. Best Paper Runner-up Award.

    FPGAs have been proven to be powerful computational accelerators across many types of workloads. The mainstream programming approach is high level synthesis (HLS), which maps high-level languages (e.g. C + #pragmas) to hardware. Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: Although HLS compilation is fast, the downstream physical design takes hours to days; FPGA reconfiguration time limits the time-multiplexing ability of hardware, and tools do not reason about cross-workload flexibility. Overlay architectures mitigate the above by mapping a programmable design (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization. Our essential idea is to develop a hardware generation framework targeting a highly-customizable overlay, so that the abstraction gap can be lowered by tuning the design instance to applications of interest. We leverage and extend prior work on customizable spatial architectures, SoC generation, accelerator compilers, and design space explorers to create an end-to-end FPGA acceleration system. Our novel techniques address inefficient networks between on-chip memories and processing elements, as well as improving DSE by reducing the amount of recompilation required. Our framework, OverGen, is highly competitive with fixed-function HLS-based designs, even though the generated designs are programmable with fast reconfiguration. We compared to a state-of-the-art DSE-based HLS framework, AutoDSE. Without kernel-tuning for AutoDSE, OverGen gets 1.2x geomean performance, and even with manual kernel-tuning for the baseline, OverGen still gets 0.55x geomean performance -- all while providing runtime flexibility across workloads.

    @inproceedings{liu2022overgen,
    title={OverGen: Improving FPGA Usability through Domain-specific Overlay Generation},
    author={Liu, Sihao and Weng, Jian and Kupsh, Dylan and Sohrabizadeh, Atefeh and Wang, Zhengrong and Guo, Licheng and Liu, Jiuyang and Zhulin, Maxim and Mani, Rishabh and Zhang, Lucheng and others},
    booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)},
    pages={35--56},
    year={2022},
    organization={IEEE}
    }

  5. Yunsheng Bai, Atefeh Sohrabizadeh, Yizhou Sun, Jason Cong, "Improving GNN-Based Accelerator Design Automation with Meta Learning", ACM/IEEE DAC'22.

    @inproceedings{bai2022maml,
    title={Automated Accelerator Optimization Aided by Graph Neural Networks},
    author={Bai, Yunsheng and Sohrabizadeh, Atefeh and Sun, Yizhou and Cong, Jason},
    booktitle={2022 59th ACM/IEEE Design Automation Conference (DAC)},
    year={2022}
    }

    Recently, there is a growing interest in developing learning-based models as a surrogate of the High-Level Synthesis (HLS) tools, where the key objective is rapid prediction of the quality of a candidate HLS design for automated design space exploration (DSE). Training is usually conducted on a given set of computation kernels (or kernels in short) needed for hardware acceleration. However, the model must also perform well on new kernels. The discrepancy between the training set and new kernels, called domain shift, frequently leads to model accuracy drop which in turn negatively impact the DSE performance. In this paper, we investigate the possibility of adapting an existing meta-learning approach, named MAML, to the task of design quality prediction. Experiments show the MAML-enhanced model outperforms a simple baseline based on fine tuning in terms of both offline evaluation on hold-out test sets and online evaluation for DSE speedup results.

  6. Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, Jason Cong, "Automated Accelerator Optimization Aided by Graph Neural Networks", ACM/IEEE DAC'22.

    @inproceedings{sohrabizadeh2022gnn,
    title={Automated Accelerator Optimization Aided by Graph Neural Networks},
    author={Sohrabizadeh, Atefeh and Bai, Yunsheng and Sun, Yizhou and Cong, Jason},
    booktitle={2022 59th ACM/IEEE Design Automation Conference (DAC)},
    year={2022}
    }

    High-level synthesis (HLS) has freed the computer architects from developing their designs in a very low-level language and needing to exactly specify how the data should be transferred in register-level. With the help of HLS, the hardware designers must describe only a high-level behavioral flow of the design. Despite this, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level that requires more time to explore. It also takes several minutes to hours to get feedback from the HLS tool on the quality of each design candidate. In this paper, we propose to solve this problem by modeling the HLS tool with a graph neural network (GNN) that is trained to be used for a wide range of applications. The experimental results demonstrate that by employing the GNN-based model, we are able to estimate the quality of design in milliseconds with high accuracy which can help us search through the solution space very quickly.

  7. Atefeh Sohrabizadeh, Yuze Chi, Jason Cong, "StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing", IEEE CICC'22.

    @inproceedings{sohrabizadeh2022streamgcn,
    title={StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing},
    author={Sohrabizadeh, Atefeh and Chi, Yuze and Cong, Jason},
    booktitle={2022 IEEE Custom Integrated Circuits Conference (CICC)},
    pages={1--8},
    year={2022},
    organization={IEEE}
    }

    While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on accelerating deep learning applications involving graphs. The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU. To address these challenges while exploiting all the available sparsity, we propose a flexible architecture called StreamGCN for accelerating Graph Convolutional Networks (GCN), the core computation unit in deep learning algorithms on graphs. The architecture is specialized for streaming processing of many small graphs for graph search and similarity computation. The experimental results demonstrate that StreamGCN can deliver a high speedup compared to a multi-core CPU and a GPU implementation, showing the efficiency of our design.

  8. Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, Jason Cong, "Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication", ACM FPGA'22.

    @inproceedings{song2022sextans,
    title={Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication},
    author={Song, Linghao and Chi, Yuze and Sohrabizadeh, Atefeh and Choi, Young-kyu and Lau, Jason and Cong, Jason},
    booktitle={Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
    pages={65--77},
    year={2022}
    }

    Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications, including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) anon-general-purpose accelerator design where one accelerator can only process a fixed-size problem. In this paper, we present Sextans, an accelerator for general-purpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to off-chip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth comparable to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80).

  9. Atefeh Sohrabizadeh, Cody Hao Yu, Min Gao, Jason Cong, "AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators", ACM TODAES'22. Best Paper Award.

    @article{sohrabizadeh2022autodse,
    title={AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators},
    author={Sohrabizadeh, Atefeh and Yu, Cody Hao and Gao, Min and Cong, Jason},
    journal={ACM Transactions on Design Automation of Electronic Systems (TODAES)},
    volume={27},
    number={4},
    pages={1--27},
    year={2022},
    publisher={ACM New York, NY}
    }

    Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework-AutoDSE- that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38x while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.

  10. Atefeh Sohrabizadeh, Jie Wang, Jason Cong, "End-to-End Optimization of Deep Learning Applications", ACM FPGA'20.

    @inproceedings{sohrabizadeh2020end,
    title={End-to-End Optimization of Deep Learning Applications},
    author={Sohrabizadeh, Atefeh and Wang, Jie and Cong, Jason},
    booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
    pages={133--139},
    year={2020}
    }

    The irregularity of recent Convolutional Neural Network (CNN) models such as less data reuse and parallelism due to the extensive network pruning and simplification creates new challenges for FPGA acceleration. Furthermore, without proper optimization, there could be significant overheads when integrating FPGAs into existing machine learning frameworks like TensorFlow. Such a problem is mostly overlooked by previous studies. However, our study shows that a naive FPGA integration into TensorFlow could lead to up to 8.45x performance degradation. To address the challenges mentioned above, we propose several SW/HW co-design approaches to perform the end-to-end optimization of deep learning applications. We present a flexible and composable architecture called FlexCNN. It can deliver high computation efficiency for different types of convolution layers using techniques including dynamic tiling and data layout optimization. FlexCNN is further integrated into the TensorFlow framework with a fully-pipelined software-hardware integration flow. This alleviates the high overheads of TensorFlow-FPGA handshake and other non-CNN processing stages. We use OpenPose, a popular CNN-based application for human pose recognition, as a case study. Experimental results show that with the FlexCNN architecture optimizations, we can achieve 2.3x performance improvement. The pipelined integration stack leads to a further 5x speedup. Overall, the SW/HW co-optimization produces a speedup of 11.5x and results in an end-to-end performance of 23.8FPS for OpenPose with floating-point precision, which is the highest performance reported for this application on FPGA in the literature.

Awards

Talks

Work Experience

Number of visits since 10 / 8 / 2020: contador gratuito

Powered by w3.css