TY - GEN
T1 - S2M3
T2 - 45th IEEE International Conference on Distributed Computing Systems, ICDCS 2025
AU - Yoon, Jin Yi
AU - Lee, Ji Ho
AU - He, Ting
AU - Choi, Nakjung
AU - Ji, Bo
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - With the advancement of Artificial Intelligence (AI) towards multiple modalities (language, vision, speech, etc.), multi-modal models have increasingly been used across various applications (e.g., visual question answering or image generation/captioning). Despite the success of AI as a service for multi-modal applications, it relies heavily on clouds, which are constrained by bandwidth, latency, privacy concerns, and unavailability under network or server failures. While on-device AI becomes popular, supporting multiple tasks on edge devices imposes significant resource challenges. To address this, we introduce S2M3, a split-and-share multi-modal architecture for multi-task inference on edge devices. Inspired by the general-purpose nature of multi-modal models, which are composed of multiple modules (encoder, decoder, classifier, etc.), we propose to split multi-modal models at functional-level modules; and then share common modules to reuse them across tasks, thereby reducing resource usage. To address cross-model dependency arising from module sharing, we propose a greedy module-level placement with per-request parallel routing by prioritizing compute-intensive modules. Through experiments on a testbed consisting of 14 multi-modal models across 5 tasks and 10 benchmarks, we demonstrate that S2M3 can reduce memory usage by up to 50% and 62% in single-task and multi-task settings, respectively, without sacrificing accuracy. Furthermore, S2M3 achieves optimal placement in 89 out of 95 instances (93.7%) while reducing inference latency by up to 56.9% on resource-constrained devices, compared to cloud AI.
AB - With the advancement of Artificial Intelligence (AI) towards multiple modalities (language, vision, speech, etc.), multi-modal models have increasingly been used across various applications (e.g., visual question answering or image generation/captioning). Despite the success of AI as a service for multi-modal applications, it relies heavily on clouds, which are constrained by bandwidth, latency, privacy concerns, and unavailability under network or server failures. While on-device AI becomes popular, supporting multiple tasks on edge devices imposes significant resource challenges. To address this, we introduce S2M3, a split-and-share multi-modal architecture for multi-task inference on edge devices. Inspired by the general-purpose nature of multi-modal models, which are composed of multiple modules (encoder, decoder, classifier, etc.), we propose to split multi-modal models at functional-level modules; and then share common modules to reuse them across tasks, thereby reducing resource usage. To address cross-model dependency arising from module sharing, we propose a greedy module-level placement with per-request parallel routing by prioritizing compute-intensive modules. Through experiments on a testbed consisting of 14 multi-modal models across 5 tasks and 10 benchmarks, we demonstrate that S2M3 can reduce memory usage by up to 50% and 62% in single-task and multi-task settings, respectively, without sacrificing accuracy. Furthermore, S2M3 achieves optimal placement in 89 out of 95 instances (93.7%) while reducing inference latency by up to 56.9% on resource-constrained devices, compared to cloud AI.
UR - https://www.scopus.com/pages/publications/105019744228
UR - https://www.scopus.com/pages/publications/105019744228#tab=citedBy
U2 - 10.1109/ICDCS63083.2025.00089
DO - 10.1109/ICDCS63083.2025.00089
M3 - Conference contribution
AN - SCOPUS:105019744228
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 868
EP - 878
BT - Proceedings - 2025 IEEE 45th International Conference on Distributed Computing Systems, ICDCS 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 July 2025 through 23 July 2025
ER -