TY - JOUR
T1 - CS-Mixer
T2 - A Cross-Scale Vision Multi-Layer Perceptron with Spatial–Channel Mixing
AU - Cui, Jonathan
AU - Araujo, David A.
AU - Saha, Suman
AU - Kabir, Md Faisal
N1 - Publisher Copyright:
IEEE
PY - 2024
Y1 - 2024
N2 - Despite simpler architectural designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLPs have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of Vision MLP architectures that achieve token-mixing with specifically engineered convolution- or Attention-like mechanisms. However, existing methods such as S2-MLPs and PoolFormers typically model spatial information in equal-sized spatial regions and do not consider cross-scale spatial interactions, thus delivering subpar performance compared with Transformer models that employ global token-mixing. Further, these MLP token-mixers, along with most Vision Transformers, only model 1- or 2-axis correlations among space and channels, avoiding simultaneous 3- axis spatial–channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for tokens aggregated across scales, both locally and globally. Such aggregation allows for token-mixing that explicitly models spatial–channel interactions, made computationally possible by a multi-head design that projects to low-dimensional subspaces. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
AB - Despite simpler architectural designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLPs have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of Vision MLP architectures that achieve token-mixing with specifically engineered convolution- or Attention-like mechanisms. However, existing methods such as S2-MLPs and PoolFormers typically model spatial information in equal-sized spatial regions and do not consider cross-scale spatial interactions, thus delivering subpar performance compared with Transformer models that employ global token-mixing. Further, these MLP token-mixers, along with most Vision Transformers, only model 1- or 2-axis correlations among space and channels, avoiding simultaneous 3- axis spatial–channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for tokens aggregated across scales, both locally and globally. Such aggregation allows for token-mixing that explicitly models spatial–channel interactions, made computationally possible by a multi-head design that projects to low-dimensional subspaces. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.
UR - http://www.scopus.com/inward/record.url?scp=85196546638&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196546638&partnerID=8YFLogxK
U2 - 10.1109/TAI.2024.3415551
DO - 10.1109/TAI.2024.3415551
M3 - Article
AN - SCOPUS:85196546638
SN - 2691-4581
SP - 1
EP - 13
JO - IEEE Transactions on Artificial Intelligence
JF - IEEE Transactions on Artificial Intelligence
ER -