Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture

Anusha Devulapally, Md Fahim Faysal Khan, Siddharth Advani, Vijaykrishnan Narayanan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the field of robotics and autonomous navigation, accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors, capture asynchronous changes in brightness at the pixel level, offering benefits such as high temporal resolution, no motion blur, and a wide dynamic range. However, unlike traditional cameras that measure absolute intensity, event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study, we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities, a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events, convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error, achieving state-of-the-art results in most cases. Additionally, the performance is also seen in other metrics like RMSE, absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at: https://github.com/anusha-devulapally/ER-F2D.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PublisherIEEE Computer Society
Pages2081-2089
Number of pages9
ISBN (Electronic)9798350365474
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
Country/TerritoryUnited States
CitySeattle
Period6/16/246/22/24

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture'. Together they form a unique fingerprint.

Cite this