TY - GEN
T1 - Not All Layers Are Equal
T2 - 31st ACM World Wide Web Conference, WWW 2022
AU - Ko, Yunyong
AU - Lee, Dongwon
AU - Kim, Sang Wook
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/25
Y1 - 2022/4/25
N2 - A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across "layers"in training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive lr scaling to adjust lr for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.
AB - A large-batch training with data parallelism is a widely adopted approach to efficiently train a large deep neural network (DNN) model. Large-batch training, however, often suffers from the problem of the model quality degradation because of its fewer iterations. To alleviate this problem, in general, learning rate (lr) scaling methods have been applied, which increases the learning rate to make an update larger at each iteration. Unfortunately, however, we observe that large-batch training with state-of-the-art lr scaling methods still often degrade the model quality when a batch size crosses a specific limit, rendering such lr methods less useful. To this phenomenon, we hypothesize that existing lr scaling methods overlook the subtle but important differences across "layers"in training, which results in the degradation of the overall model quality. From this hypothesis, we propose a novel approach (LENA) toward the learning rate scaling for large-scale DNN training, employing: (1) a layer-wise adaptive lr scaling to adjust lr for each layer individually, and (2) a layer-wise state-aware warm-up to track the state of the training for each layer and finish its warm-up automatically. The comprehensive evaluation with variations of batch sizes demonstrates that LENA achieves the target accuracy (i.e., the accuracy of single-worker training): (1) within the fewest iterations across different batch sizes (up to 45.2% fewer iterations and 44.7% shorter time than the existing state-of-the-art method), and (2) for training very large-batch sizes, surpassing the limits of all baselines.
UR - http://www.scopus.com/inward/record.url?scp=85129795467&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129795467&partnerID=8YFLogxK
U2 - 10.1145/3485447.3511989
DO - 10.1145/3485447.3511989
M3 - Conference contribution
AN - SCOPUS:85129795467
T3 - WWW 2022 - Proceedings of the ACM Web Conference 2022
SP - 1851
EP - 1859
BT - WWW 2022 - Proceedings of the ACM Web Conference 2022
PB - Association for Computing Machinery, Inc
Y2 - 25 April 2022 through 29 April 2022
ER -