TY - GEN
T1 - Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks
AU - Cong, Weilin
AU - Forsati, Rana
AU - Kandemir, Mahmut
AU - Mahdavi, Mehrdad
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/8/23
Y1 - 2020/8/23
N2 - Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed intoembedding approximation variance in the forward stage andstochastic gradient variance in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
AB - Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed intoembedding approximation variance in the forward stage andstochastic gradient variance in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.
UR - http://www.scopus.com/inward/record.url?scp=85090421821&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090421821&partnerID=8YFLogxK
U2 - 10.1145/3394486.3403192
DO - 10.1145/3394486.3403192
M3 - Conference contribution
AN - SCOPUS:85090421821
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 1393
EP - 1403
BT - KDD 2020 - Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020
Y2 - 23 August 2020 through 27 August 2020
ER -