TY - GEN
T1 - Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture
AU - Devulapally, Anusha
AU - Khan, Md Fahim Faysal
AU - Advani, Siddharth
AU - Narayanan, Vijaykrishnan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In the field of robotics and autonomous navigation, accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors, capture asynchronous changes in brightness at the pixel level, offering benefits such as high temporal resolution, no motion blur, and a wide dynamic range. However, unlike traditional cameras that measure absolute intensity, event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study, we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities, a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events, convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error, achieving state-of-the-art results in most cases. Additionally, the performance is also seen in other metrics like RMSE, absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at: https://github.com/anusha-devulapally/ER-F2D.
AB - In the field of robotics and autonomous navigation, accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors, capture asynchronous changes in brightness at the pixel level, offering benefits such as high temporal resolution, no motion blur, and a wide dynamic range. However, unlike traditional cameras that measure absolute intensity, event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study, we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities, a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events, convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error, achieving state-of-the-art results in most cases. Additionally, the performance is also seen in other metrics like RMSE, absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at: https://github.com/anusha-devulapally/ER-F2D.
UR - http://www.scopus.com/inward/record.url?scp=85206463180&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85206463180&partnerID=8YFLogxK
U2 - 10.1109/CVPRW63382.2024.00213
DO - 10.1109/CVPRW63382.2024.00213
M3 - Conference contribution
AN - SCOPUS:85206463180
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 2081
EP - 2089
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
Y2 - 16 June 2024 through 22 June 2024
ER -