Accelerating Task Parallel WorkloadsVidushi Dadu Tony Nowatzki
Reconfigurable architectures and task parallelism seem to be at odds, as the former requires repetitive and simple program structure, and the latter breaks program structure to create small, individually scheduled program units. Our insight is that if tasks and their potential for communication structure are first-class primitives in the hardware, it is possible to recover program structure with extremely low overhead. We propose a task execution model for accelerators called TaskStream, which annotates task dependences with information sufficient to recover inter-task structure. TaskStream enables work-aware load balancing, recovery of pipelined inter-task dependences, and recov- ery of inter-task read sharing through multicasting.
Exposing the Value of Flexibility in Graph Processing AcceleratorsVidushi Dadu, Sihao Liu, Tony Nowatzki
Traditionally, parallel graph processing is implemented using either a simpler synchronous model or asynchronous model that offers work-efficiency benefits. Because of the comparable trade-offs, we find that there is no clear winner among the accelerators specializing in either of these models. Our goal is to enable efficient asynchronous execution while retaining the benefits of the synchronous model depending on algorithm requirements. In this work, we explore whether "tasks" exposed as a first-order primitive in a reconfigurable accelerator can enable efficient graph acceleration.
Using Memory Traces To Drive Spatial Architecture StudiesVidushi Dadu, Kermin Elliott Fleming
Spatial architectures have found applications in many specialization domains (eg. FPGA, CGRA, ASIC) but their evaluation has advanced haphazardly. The reason is that the flexibility in these architectures opens the design room for possible architecture techniques, making it hard to maintain a scientific baseline which is consistent with most hardware proposals. To drive the evaluation of spatial architecture studies, we have developed a broadly applicable memory trace format along with a trace generation algorithm. Using the traces, we performed workload characterization for important workloads in signal processing, linear algebra, graph processing, and bioinformatics.
Towards General Purpose Acceleration by Exploiting Data-Dependence FormsVidushi Dadu, Jian Weng, Sihao Liu, Tony Nowatzki
Programmable hardware accelerators (eg. vector processors, GPUs) have been extremely successful at targeting algorithms with regular control and memory patterns to achieve order-of-magnitude performance and energy efficiency. However, they perform far under the peak on important irregular algorithms, like those from graph processing, database querying, genomics, advanced machine learning, and others. We find that the reason is that they try to handle arbitrary control and memory dependence while data-processing algorithms exhibit only a handful of characteristics. By capturing the problematic behavior at a domain-agnostic level, we propose an accelerator that is sufficiently general, matches domain-specific accelerator performance, and significantly outperforms traditional CPUs and GPUs.
Simple Scheduling and Partitioning Techniques for Subarray-Aware Memories [paper]Vidushi Dadu, Saugata Ghose, Kevin Chang, Onur Mutlu
Prior works exploit implicit parallelism present in DRAMs (in the form of ranks, banks, and subarrays) to effectively distribute shared main memory to applications in a multi-core system. Although such techniques are expected to improve memory throughput, data distribution with lower granularity often complicates the scheduling memory requests problem. We suggest a hybrid application-aware scheduling and memory mapping technique to mitigate inter-application interference in subarray-aware memories. The application's characteristics are captured during a profiling phase using a novel metric called subarray sensitivity.
Image source: Yoongu Kim et. al., A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM, ISCA 2012.