BNS-GCN: Random Boundary Vertex Sampling to Accelerate Distributed GCN Training

MLSys’22 paper code

BNS-GCN believes that the communication overhead of distributed GCN training is proportional to the number of boundary points. On this basis, in order to reduce communication and memory usage, before each epoch, they sample the boundary points (Boundary node sampling), through a large number of The experiments verified the performance of BNS-GCN, and also gave theoretical derivation on the improvement of training accuracy.

Introduction

As the scale of the graph increases, the training of GCN has higher and higher memory requirements, and the feature and adjacency matrix of nodes are usually difficult to store in a single machine. To solve this problem, there are currently three methods:

Sampling, which can effectively reduce the amount of computation and storage overhead. For example, GraphSage and VR-GCN build mini-batches through neighbor sampling; Cluster-gcn, Graphsaint uses subgraph sampling as training samples.
Distributed training, the graph is divided, multiple machines are trained in parallel, and the necessary synchronization is performed through communication. For example NeuGraph, ROC, CAGNET, Dorylus, PipeGCN. The main problem of distributed training is communication, which limits the efficiency of training.
Asynchronous training, introduces asynchronous computation and communication, but suffers from staleness issues, such as Dorylus.

This paper of BNS is based on distributed training and studies the root cause of communication and memory explosion under distributed training. There are three main contributions:

The main challenges of distributed training are analyzed and identified, and the main problem is located at the excessive number of boundary nodes rather than boundary edges.
- huge traffic
- High memory overhead
- Unbalanced memory overhead
Proposed BNS-GCN, which randomly samples boundary vertices in each iteration, this method can effectively reduce communication and memory footprint, and at the same time have better accuracy generalization ability, the author said that they are currently the only one that does not affect accuracy and introduce Under the premise of additional computing resource overhead, the work of reducing the traffic of distributed GCN training.
Provide theoretical analysis for improving the convergence of BNS-GCN. Through a large number of experiments, BNS-GCN can achieve a 16.2x increase in throughput and a 58% reduction in memory usage with the same accuracy or even better accuracy.

BackGround and Related Works

GCN training has two main steps: neighbor aggregation and vertex update.

Where $\zeta^{(l)}$ is the aggregation of neighbors, where the mean aggregator is used, $\phi^{(l)}$ is the update of vertices, and BNS adopts the same update operation as Graph:

Sampling-Based GCN Training

Sampling-based GCN training generally includes: neighbor sampling, layer sampling, and subgraph sampling.

They have the following problems:

Incorrect feature estimates, although most samples provide unbiased estimates of vertex features, the variance of these estimates hurts the accuracy of the model, and the smaller the variance, the better the accuracy. (Minimal variance sampling with provable guarantees for fast training of graph neural networks)
Neighbor explosion, with the increase of the number of layers, the method based on neighbor sampling, the number of samples increases exponentially (GraphSage), and the memory requirement is still very high under the limit of the number of neighbor expansion. (Stochastic training of graph convolutional networks with variance reduction.)
Sampling overhead, all sampling algorithms require extra time to generate mini-batches, accounting for 25%+ (Graphsaint) of training time.

Distributed Training for GCNs

The challenges of GCN training and traditional DNN training are different:

DNNs usually have small training samples and large models.
There are no dependencies between data samples.

The basic paradigm of distributed GCN training (1a), divides the entire graph into small subgraphs, each subgraph can be completely placed in the GPU, and then trained in parallel, the communication occurs at the boundary points between the subgraphs, Feture for swapping vertices.

ROC, NeuGraph, and AliGraph follow this paradigm, divide the graph and store it on the CPU, and then exchange part of the subgraph to the GPU for training. The training efficiency is limited by the CPU-GPU swap overhead.

CAGNET and $P^3$ further segment vertex features and layers to achieve intra-layer parallelism. This can cause serious communication overhead, especially when the dimension of the feturea is large.

Dorylus uses a large number of threads to pipeline each fine-grained computational operation to speed up distributed training, but communication is still a bottleneck.

BNS-GCN Framework

Boundary Node Sampling first partitions the graph so that the boundary points are as few as possible, and then randomly samples the boundary points to further reduce communication and memory overhead.

Figure 2 shows the process of distributed GCN training. First, the graph needs to be divided, and each subgraph needs to be assigned to a different worker. Before aggregating neighbors, it is necessary to obtain the features of remote dependent nodes through communication, and then perform aggregation operations. and the vertex update operation to obtain the vertex representation of the next layer, so that the GCN calculation of one layer is completed, and then the calculation of each layer is a similar process, and finally the loss is calculated for reverse calculation, reverse calculation logic and forward calculation. It is basically similar, and finally the parameters are updated through AllReduce.

Challenges in Partition-Parallel Training

Traditional training methods have three main challenges:

A lot of communication overhead, the number of Boundary Nodes increases with the increase of partitions, which severely limits scalability.
Serious memory usage, the GPU needs to store both Inner Nodes and Boundary Nodes.
Unbalanced memory usage, a large number of Boundary Nodes lead to high memory usage of some GPUs, and also lead to insufficient memory utilization of other GPUs.

The authors argue that all three problems are due to boundary vertices.

Communication Cost Analysis

The total amount of communication is equal to the number of boundary vertices per partition.

Memory Cost Analysis

According to GraphSage (formula 1, 2), the memory footprint is derived:

The memory usage increases linearly with the boundary points. After dividing Reddit into 10 partitions using METIS, the ratio of boundary points and interior points in each partition is as high as 5.5x in some partitions, resulting in high communication and memory overhead.

Memory Imbalance Analysis

Due to the unbalanced distribution of boundary points, the memory occupation is unbalanced. As the number of partitions increases, the unbalanced phenomenon becomes more obvious. Figure 3 shows the distribution of boundary-inner ratio by dividing ogbn-papers100M into 192 partitions. There is a small proportion of up to 8, which will not only cause a large amount of memory usage, but also cause insufficient memory utilization of other partitions.

BNS-GCN Technique

Graph Partition

Two goals for graph partitioning:

Minimize the number of border points, as the number of border points directly affects the traffic.
The calculation of partitions is balanced, because synchronous communication is required at each layer, and the calculation of partitions is not balanced, which will block the execution of other partitions.

In the existing work, CAGNET and DistDGL only focus on objective 2, while BNS-GCN considers both objectives 1 and 2.

For objective 2, BNS-GCN approximates that the computational complexity of each vertex is the same. For GraphSage, the computational complexity is mainly determined by formula 2. In this case, the graph partition is to set equal-sized partitions .

Then optimize for goal 1. BNS-GCN uses METIS for graph division by default. The goal optimization is the minimum communication overhead, that is, the minimum number of boundary vertices.

The complexity of METIS is $O(E)$, and it only needs to be executed once in the preprocessing stage.

Boundary Node Sampling (BNS)

Even with the best partitioning method, the problem of boundary points still exists, Table 1. Therefore, it is necessary to propose new methods to reduce the number of boundary points.

This approach needs to achieve 3 goals:

Significantly reduces the number of border points
minimal extra overhead
Guaranteed training accuracy

The core idea of BNS-GCN is to randomly select a subset of boundary points before each epoch training, and storage and communication only happen on this subset, not the original entire set of boundary points.

The specific algorithm steps of BNS-GCN are as follows:

The input is the subgraph information on each partition and the corresponding feature and label.

First, get Inner Nodes according to the boundary points. For each epoch, sample the boundary points according to the sampling probability p to get the training subgraph, and then notify other nodes of the sampled node information by broadcasting; the nodes receive information from different partitions. After the sampling information is obtained, the intersection of Inner Nodes is obtained to determine the features sent to different partitions; in the specific training process, the features of the remote nodes are first obtained through communication, and then the forward calculation is performed locally; the reverse calculation logic is similar. The required gradient is sent to the corresponding partition for reverse calculation; finally, the parameters are updated through AllReduce.

The sampling overhead of BNS-GCN is very low, accounting for about 0%-7% of the training time. The distributed training based on communication can learn from this algorithm of BNS-GCN.

Variance Analysis

The approximate variance of the feature determines the upper bound of the gradient noise (Minimal variance sampling with provable guarantees for fast training of graph neural networks. SIGKDD).

Sampling-based methods with lower approximation variance have faster convergence and higher accuracy (Sgd: General analysis and improved rates. PMLR 2019).

Table 2 is the variance comparison between BNS-GCN and different sampling algorithms. VR-GCN cannot be directly compared, and it is smaller than others.

BNS-GCN vs Existing Sampling Methods

Vertex sampling, GraphSAGE and VR-GCN use vertex sampling, which may sample the same node multiple times, limited by the number of layers and training efficiency of GCN, BNS-GCN will not sample neighbor nodes, which significantly reduces sampling variance estimation and sampling s expenses.
Layer sampling, BNS-GCN is similar to layer sampling, nodes in the same partition share the same sampling boundary nodes in the previous layer. Compared with FastGCN, AS-GCN, and LADIES, BNS-GCN has denser sampling layers which may lead to higher accuracy.
Subgraph sampling, BNS-GCN can be regarded as subgraph sampling, because it chooses to discard some boundary points, ClusterGCN and GraphSAINT are also subgraph sampling, but they select very few vertices, accounting for 1.3%-5.3% of the total points, so will cause high variance in gradient estimates.
Edge sampling, in the distributed training GCN scenario, the application of edge sampling is not high, and it does not directly reduce the number of boundary points.

Experiments

The experiments use 4 datasets:

BNS-GCN is implemented on DGL, using Gloo as the communication backend.

For Reddit, ogbn-products, Yelp running on a Xeon [email protected] (187GB) with 10 RTX-2080Ti (11GB), the corresponding minimum partitions are 2, 5, 3.

For ogbn-papers100M use 32 machines, each with 6 pieces of V100 (16GB) with IBM Power9 (605GB).

For reproducibility and robustness, the hyperparameters were not adjusted in the experiments.

Comparison with the SOTA Baselines

Full-Graph Training Speedup

The figure below is a comparison of GPU throughput under distributed training with ROC and CAGNET. At p=0.01, the throughput is 8.9x-16.2x higher than ROC, and 9.2x-13.8x higher than CAGNET (c=2). Even at p=1, the throughput of BNS-GCN is 1.8x-3.7x higher than ROC and 1.0-5.5x higher than CAGNET (c=2). The reason is not only because the sampling of boundary points reduces the communication overhead. Compared with ROC, BNS-GCN has no CPU-GPU swap overhead. Compared with CAGNET, BNS-GCN has no redundant broadcast synchronization overhead.

Also as the number of partitions increases, BNS-GCN (p < 1) performs better, thanks to dropping boundary nodes.

Full-Graph Accuracy

This experiment shows that BNS-GCN can maintain or even improve the accuracy of training, which is compared with 7 sampling algorithms respectively.

First, when p=1, compared with other sampling algorithms, BNS-GCN can achieve comparable or higher accuracy, which is consistent with the results of ROC.

When p=0.1/0.01, BNS-GCN can always maintain or even improve the overall accuracy.

When p=0, compared with p>0, the performance is the worst on the three datasets, because each partition discards all boundary points and calculates it independently.

Improvement over Sampling-based Methods

Table 5 is a comparison of the training performance of the ogbn-products dataset and the sampling-based method. It can be seen that the training performance and accuracy are better than other sampling-based methods when p=0.1/0.01.

Performance Analysis

Training Time Improvement Breakdown

This implementation is an analysis of BNS performance, which counts the time of calculation, boundary point communication, and parameter synchronization.

It can be seen that the communication of the boundary point dominates the training time (p=1), which accounts for up to 67% on Reddit and 64% for obgn-products (p=1).

When p<1, the communication time of boundary points is significantly reduced. When p=0.01, compared with p=1, it is reduced by 74%-93% on Reddit and 83%-91% on obgn-products.

In addition to the single-machine multi-card environment, the performance of BNS-GCN is verified under multiple machines at the same time.

When p=0.01, the overall computing time is shortened by 99%, which also shows that in distributed scenarios, communication is the main bottleneck, which is why BNS-GCN is preferable.

Memory Saving

This experiment compares the memory usage reduction with p=1. On Reddit data, p=0.01 can save 58% memory usage, and for sparse ogbn-products, it can also reduce memory usage by 27%.

The memory saving is better with the increase of partitions, which is that the boundary points increase with the increase of partitions, which further illustrates the scalability of BNS-GCN in training large-scale graphs.

Generalization Improvement

To illustrate the generalization ability of BNS-GCN, this experiment observes the convergence of test accuracy under different partitions.

The ogbn-products dataset is chosen here because the distribution of its test set is quite different from the training set.

It can be seen that when the full image training p=0 and the independent training p=1, the model will quickly overfit, but when p=0.1/0.01, the overfitting is alleviated and the overall accuracy is improved, which is Because BNS-GCN randomly modifies the graph structure during training.

Relatively speaking, the accuracy of p=0.1 is better, and the convergence and convergence gap of p=0 are also the largest, because it completely removes the boundary points.

Figure 9 shows the performance on the other two datasets, which basically meet the overall expectations.

Balanced Memory Usage

In order to analyze the memory balance of the partitions, the ogbn-paper100M dataset is divided into 192 partitions here.

For the non-sampling case p=1, there is a serious memory imbalance, the memory usage of straggler is 20% higher than the upper limit, and the memory usage of $\frac{3}{4}$ is less than 60%.

On the contrary, when p=0.1/0.01, the memory usage is more balanced, and the memory usage of all partitions is higher than 70%.

Ablation Studies

BNS-GCN with Random Partition

In order to verify whether the effectiveness of BNS-GCN relies on the METIS divider, an experiment is performed here, and the METIS division is replaced by random division.

As can be seen in Table 7, Random and METIS performed comparable at p=0.1 (-0.2 to +0.27). Therefore, BNS-GCN and graph partitioning are orthogonal and are applicable to any graph partitioning method.

The author further conducted experiments to verify the impact of different partitioning strategies on the performance of BNS-GCN. It can be seen from Table 8 that random partitioning has higher throughput and lower memory footprint, because random partitioning will generate more boundary points.

The Special Case

For the special case p=0, it is not recommended in general practice.

First of all it has the worst accuracy among all datasets and partitioning methods (Tables 4, 7), especially in random partitioning, the accuracy drops from 97.11% to 93.37%.
Convergence is slowest on different datasets and different numbers of partitions.
Severe overfitting problem, Figure 7.

Therefore, it is recommended that p be 0.1/0.01. Regarding the choice of p, the author did an experiment and found that p=0.1 has the best performance in throughput, communication reduction, memory reduction, convergence speed, accuracy, and sampling overhead.

BNS-GCN vs Boundary Edge Sampling

Aligraph, Distdgl, Scalable and expressive graph neural networks via historical embeddings, think that the communication overhead comes from the edges between partitions, so they use the minimum edge cut, further reduce the communication by sampling the edges such as DropEdge, or even only sampling the boundary edges instead of Not for the whole picture.

This experiment compares BES and DropEdge. For the fairness of the comparison, it is guaranteed to drop the same number of edges.

On the Reddit dataset, the communication between DropEdge and BES is 10x, 7x that of BNS-GCN, and the total time is 2.0x and 1.4x that of BNS-GCN, because multiple boundary edges in the graph may be connected to the same boundary point , even if some boundary edges are dropped, the remaining undropped edges still need to communicate those boundary points, and the communication of distributed GCN training is only proportional to the number of boundary points (Equation 3).

BNS-GCN Benefit on GAT

In order to verify the versatility of BNS-GCN in different models, an experiment of the GAT model (2layer 10 partitions) is done here, and it can be seen that BNS-GCN still has a speedup of 58% to 106%.

This article is reprinted from: https://sanzo.top/Paper/bns-gcn/
This site is for inclusion only, and the copyright belongs to the original author.