# BeaconGNN: Large-Scale GNN Acceleration with Asynchronous In-Storage Computing

Yuyue Wang<sup>1</sup>, Xiurui Pan<sup>2</sup>, Yuda An<sup>2</sup>, Jie Zhang<sup>2</sup>, **Glenn Reinman**<sup>1</sup>





## What is GNN, why does it matters

- Graph, a universal structure
  - Social network
  - Recommendation system
  - Pandemic...
- Graph information
  - Node: a vector of feature
  - Edge: relation between nodes

Nodes and edges provide rich information to analyze



# They used to be processed in separate



| Data type         | Representation    | Analysis method                                       |     |
|-------------------|-------------------|-------------------------------------------------------|-----|
| Edge (connection) | Adjacency matrix, | Classical graph analytics algorithms (e.g. page rank) |     |
| Node              | Feature vectors   | Machine learning to extract high level features       | UCL |

#### They used to be processed in separate



| Data type         | Representation    | Analysis method                                       |
|-------------------|-------------------|-------------------------------------------------------|
| Edge (connection) | Adjacency matrix, | Classical graph analytics algorithms (e.g. page rank) |
| Node              | Feature vectors   | Machine learning to extract high level features       |

# Graph neural network (GNN) bridges the two domains





# Graph neural network (GNN) bridges the two domains





# Graph neural network (GNN) bridges the two domains



GNN extracts both graph structure and node features



# System-level challenge of GNN

• The dataset is getting larger and larger

| # Node      | Feature length | # Edge     | Total size     |
|-------------|----------------|------------|----------------|
| 500 Million | 200 (Int16)    | 50 Billion | (200 + 400) GB |

• Easily exceeds the Server **DIMM Capacity** 



Several hundreds of GB



# System-level challenge of GNN

• The dataset is getting larger and larger

| # Node      | Feature length | # Edge     | Total size     |
|-------------|----------------|------------|----------------|
| 500 Million | 200 (Int16)    | 50 Billion | (200 + 400) GB |

• But entirely fits into a single Solid-State Drive (SSD)!







#### Large number of flash chips







#### High flash sensing delay







#### Narrow PCIe 3.0 x4 bandwidth







#### Block granular interface







#### In-storage computing



SSD internal architecture w/ compute units (simplified)

Two types of offloads:

- Early predicate execution E.g. Database filter
- Compute in-situation
  E.g. Database aggregate

Both reduces data movement



Slow PCIe interconnect?

- PCIe 4.0: 2GB/s per lane!
- PCIe 5.0: even faster
- Not a convincing motivation any longer



Slow PCIe interconnect?

- PCIe 4.0: 2GB/s per lane!
- PCle 5.0: even faster
- Not a convincing motivation any longer

#### Flash are high latency media?

- Ultra-low latency Z-SSD (3 µs flash read)
- More pressure to host storage stack (~ 10 us)



Slow PCIe interconnect?

- PCIe 4.0: 2GB/s per lane!
- PCle 5.0: even faster
- Not a convincing motivation any longer

Flash are high latency media?

- Ultra-low latency Z-SSD (3 µs flash read)
- More pressure to host storage stack (~ 10 us)

Technology shifts bring new challenges and opportunities!





Hop 1

GNN subgraph generation: iterations of node sampling





GNN subgraph generation: iterations of node sampling





GNN subgraph generation: iterations of node sampling



To sample a new hop: need the host to locate





- Resubmission requests traverse the whole OS stack
- Layer batch amortizes communication, but brings barriers



- Flash sense time: 3 µs
- Channel transfer rate: 800 MT/s



- Flash sense time: 3 μs
- Channel transfer rate: 800 MT/s





- Flash sense time: 3 μs
- Channel transfer rate: 800 MT/s









Limited improvement

Flash dies are underutilized Flash channels transfer useless data

UCLA

• Scheduler polls I/O completion





- Scheduler polls I/O completion
- Manage request

| SSD<br>←→ data ←→                   | control/addr      | Dependency queue  |  |  |
|-------------------------------------|-------------------|-------------------|--|--|
| Flash firmware<br>I/O poller<br>FTL | (((( Hardware     | er (channel, die) |  |  |
| Manage requests                     |                   |                   |  |  |
| SSD<br>DRAM Cache                   | Mapping table Req | uest queues       |  |  |

UC

- Scheduler polls I/O completion
- Manage request
- Locate next request address



UCLA

Controller has 1-4 cores, while backend flash has about 100 dies in active

Huge mismatch!



#### **Optimization 1: Address translation fusion**





#### **Optimization 1: Address translation fusion**





## Optimization 1: Address translation fusion

#### Static graph with address mapping stored in flash



DirectGraph format



# Optimization 2: In-flash sampling

#### Flash dies area budget

Add more control logic (offload sampling & vector retrieving)





## **Optimization 2: In-flash sampling**

FSM to sample node features, generate resubmit request



UCLA

# Optimization 2: In-flash sampling

FSM to sample node features, generate resubmit request

Example task: primary section  $\xrightarrow{\text{sample}}$  5 nodes

Get: 3 nodes (primary section sample request), 1 resubmit request to sample 2 nodes from a secondary section



## Optimization 3: Hardware-based resubmission

### Route commands between channels ( $n \rightarrow 0$ )



UCLA

## Optimization 3: Hardware-based resubmission

### Route commands between channels ( $n \rightarrow 0$ )



Router at channel 0



- GNN runtime
  - Interact w/ host
  - Submit flash request
  - Schedule DNN execution



**Overall architecture** 



- GNN runtime
  - Interact w/ host
  - Submit flash request
  - Schedule DNN execution
- Flash die
  - Sample/Retrieval
  - Generate new requests



**Overall architecture** 



- GNN runtime
  - Interact w/ host
  - Submit flash request
  - Schedule DNN execution
- Flash die
  - Sample/Retrieval
  - Generate new requests
- Flash interface
  - Route resubmit commands



**Overall architecture** 



- GNN runtime
  - Interact w/ host
  - Submit flash request
  - Schedule DNN execution
- Flash die
  - Sample/Retrieval
  - Generate new requests
- Flash interface
  - Route resubmit commands

Hardware-based

c request resubmission



| СС    | CPU-centric architecture, with PCIe Accelerator 128x128 systolic array, 32 MB SRAM, 1 GHz |
|-------|-------------------------------------------------------------------------------------------|
| BG-1  | Basic in-storage computing architecture                                                   |
| BG-DG | BG-1 with DirectGraph GNN format                                                          |
| BG-SP | BG-1 with in-flash node sampling and vector retrieving                                    |
| BG-2  | BG-DGSP with inter-channel hw-based command resubmission                                  |

Simulated platforms

| Interface          | NVMe, PCIe 4.0 x4                                                            |
|--------------------|------------------------------------------------------------------------------|
| Controller         | 4 ARM Cortex-A9 Cores                                                        |
| DRAM               | DDR4-3200, 25.6 GB/s, 1 GB                                                   |
| Flash              | 16 Channel, 8 Die/Channel, 4 KB Page<br>3 us read, 800 MB/s channel transfer |
| ISC<br>Accelerator | ISC: 64x64 systolic array<br>6 MB SRAM, 800 MHz                              |

Default SSD configuration







Throughput on five large-scale GNN dataset











Energy breakdown on amazon dataset



- Technical shifts, from both device and interconnect, break tradition of ISC design
- Control & Data path of traditional I/O can be a new bottleneck
- Automating such paths with hardware can offer huge performance benefit

