CS-Mixer: A Cross-Scale Vision Multi-Layer Perceptron with Spatial–Channel Mixing

Jonathan Cui, David A. Araujo, Suman Saha, Md Faisal Kabir

Research output: Contribution to journalArticlepeer-review

Abstract

Despite simpler architectural designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLPs have demonstrated strong performance and high data efficiency for image classification and semantic segmentation. Following pioneering works such as MLP-Mixers and gMLPs, later research proposed a plethora of Vision MLP architectures that achieve token-mixing with specifically engineered convolution- or Attention-like mechanisms. However, existing methods such as S2-MLPs and PoolFormers typically model spatial information in equal-sized spatial regions and do not consider cross-scale spatial interactions, thus delivering subpar performance compared with Transformer models that employ global token-mixing. Further, these MLP token-mixers, along with most Vision Transformers, only model 1- or 2-axis correlations among space and channels, avoiding simultaneous 3- axis spatial–channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for tokens aggregated across scales, both locally and globally. Such aggregation allows for token-mixing that explicitly models spatial–channel interactions, made computationally possible by a multi-head design that projects to low-dimensional subspaces. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.

Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Artificial Intelligence
DOIs
StateAccepted/In press - 2024

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'CS-Mixer: A Cross-Scale Vision Multi-Layer Perceptron with Spatial–Channel Mixing'. Together they form a unique fingerprint.

Cite this