Skip to main navigation Skip to search Skip to main content

Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization

  • Farzin Haddadpour
  • , Mohammad Mahdi Kamani
  • , Mehrdad Mahdavi
  • , Viveck Ramesh Cadambe

Research output: Contribution to journalConference articlepeer-review

Abstract

Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms to train large neural networks. In recent years, there has been a great deal of research to alleviate communication cost by compressing the gradient vector or using local updates and periodic model averaging. In this paper, we advocate the use of redundancy towards communication-efficient distributed stochastic algorithms for non-convex optimization. In particular, we, both theoretically and practically, show that by properly infusing redundancy to the training data with model averaging, it is possible to significantly reduce the number of communication rounds. To be more precise, we show that redundancy reduces residual error in local averaging, thereby reaching the same level of accuracy with fewer rounds of communication as compared with previous algorithms. Empirical studies on CIFAR10, CIFAR100 and ImageNet datasets in a distributed environment complement our theoretical results; they show that our algorithms have additional beneficial aspects including tolerance to failures, as well as greater gradient diversity.

Original languageEnglish (US)
Pages (from-to)2545-2554
Number of pages10
JournalProceedings of Machine Learning Research
Volume97
StatePublished - 2019
Event36th International Conference on Machine Learning, ICML 2019 - Long Beach, United States
Duration: Jun 9 2019Jun 15 2019

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization'. Together they form a unique fingerprint.

Cite this