TY - GEN
T1 - Exploiting Frame Similarity for Efficient Inference on Edge Devices
AU - Ying, Ziyu
AU - Zhao, Shulin
AU - Zhang, Haibo
AU - Mishra, Cyan Subhra
AU - Bhuyan, Sandeepa
AU - Kandemir, Mahmut T.
AU - Sivasubramaniam, Anand
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Deep neural networks (DNNs) are being widely used in various computer vision tasks as they can achieve very high accuracy. However, the large number of parameters employed in DNNs can result in long inference times for vision tasks, thus making it even more challenging to deploy them in the compute- and memory-constrained mobile/edge devices. To boost the inference of DNNs, some existing works employ compression (model pruning or quantization) or enhanced hardware. How-ever, most prior works focus on improving model structure and implementing custom accelerators. As opposed to the prior work, in this paper, we target the video data that are processed by edge devices, and study the similarity between frames. Based on that, we propose two runtime approaches to boost the performance of the inference process, while achieving high accuracy.Specifically, considering the similarities between successive video frames, we propose a frame-level compute reuse algorithm based on the motion vectors of each frame. With frame-level reuse, we are able to skip 53% of frames in inference with negligible overhead and remain within less than 1% mAP (accuracy) drop for the object detection task. Additionally, we implement a partial inference scheme to enable region/tile-level reuse. Our experiments on a representative mobile device (Pixel 3 Phone) show that the proposed partial inference scheme achieves 2 × speedup over the baseline approach that performs full inference on every frame. We integrate these two data reuse algorithms to accelerate the neural network inference and improve its energy efficiency. More specifically, for each frame in the video, we can dynamically select between (i) performing a full inference, (ii) performing a partial inference, or (iii) skipping the inference altogether. Our experimental evaluations using six different videos reveal that the proposed schemes are up to 80% (56% on average) energy efficient and 2.2× performance efficient compared to the conventional scheme, which performs full inference, while losing less than 2% accuracy. Additionally, the experimental analysis indicates that our approach outperforms the state-of-the-art work with respect to accuracy and/or performance/energy savings.
AB - Deep neural networks (DNNs) are being widely used in various computer vision tasks as they can achieve very high accuracy. However, the large number of parameters employed in DNNs can result in long inference times for vision tasks, thus making it even more challenging to deploy them in the compute- and memory-constrained mobile/edge devices. To boost the inference of DNNs, some existing works employ compression (model pruning or quantization) or enhanced hardware. How-ever, most prior works focus on improving model structure and implementing custom accelerators. As opposed to the prior work, in this paper, we target the video data that are processed by edge devices, and study the similarity between frames. Based on that, we propose two runtime approaches to boost the performance of the inference process, while achieving high accuracy.Specifically, considering the similarities between successive video frames, we propose a frame-level compute reuse algorithm based on the motion vectors of each frame. With frame-level reuse, we are able to skip 53% of frames in inference with negligible overhead and remain within less than 1% mAP (accuracy) drop for the object detection task. Additionally, we implement a partial inference scheme to enable region/tile-level reuse. Our experiments on a representative mobile device (Pixel 3 Phone) show that the proposed partial inference scheme achieves 2 × speedup over the baseline approach that performs full inference on every frame. We integrate these two data reuse algorithms to accelerate the neural network inference and improve its energy efficiency. More specifically, for each frame in the video, we can dynamically select between (i) performing a full inference, (ii) performing a partial inference, or (iii) skipping the inference altogether. Our experimental evaluations using six different videos reveal that the proposed schemes are up to 80% (56% on average) energy efficient and 2.2× performance efficient compared to the conventional scheme, which performs full inference, while losing less than 2% accuracy. Additionally, the experimental analysis indicates that our approach outperforms the state-of-the-art work with respect to accuracy and/or performance/energy savings.
UR - http://www.scopus.com/inward/record.url?scp=85140921458&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140921458&partnerID=8YFLogxK
U2 - 10.1109/ICDCS54860.2022.00107
DO - 10.1109/ICDCS54860.2022.00107
M3 - Conference contribution
AN - SCOPUS:85140921458
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 1073
EP - 1084
BT - Proceedings - 2022 IEEE 42nd International Conference on Distributed Computing Systems, ICDCS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 42nd IEEE International Conference on Distributed Computing Systems, ICDCS 2022
Y2 - 10 July 2022 through 13 July 2022
ER -