Self-supervised monocular depth and ego-motion estimation for CT-bronchoscopy fusion

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The management of lung cancer necessitates robust diagnostic tools, with three-dimensional (3D) computed tomography (CT) imaging and bronchoscopy standing as pivotal complementary resources. Bronchoscopy captures live endobronchial video, providing striking detail of the airway tree's interior, while 3D CT scans contribute extensive anatomical knowledge. A significant gap persists, however, in linking these data-rich sources, such as in the fusion of video data from bronchoscopic airway exams and airway surface data from 3D CT scans. The main issue is the difficulty in simultaneously acquiring depth and camera pose information for bronchoscopic video frames. A solution to this problem can facilitate CT-video fusion/rendering, multimodal registration, and 3D cancer lesion localization. Deep-learning networks have been recently employed to estimate the depth and ego-motion information. Unfortunately, it is challenging to acquire the required training data, consisting of ground-truth pairs of bronchoscopic video frames and corresponding depth maps. Along this line, generative adversarial networks (GANs) have shown promise in domain transformation from CT-based endoluminal surface views into synthesized bronchoscopic frames. These synthesized views are consequently aligned with their CT-derived depth map, generating valuable training data. Nonetheless, such domain transformation techniques fail to utilize frame sequence knowledge and supply no information about the camera's ego-motion. Parallel studies in other domains, such as endoscopy, have emphasized the photometric consistency between adjacent frames to jointly offer depth and ego-motion estimation. Nevertheless, the texture-less and smooth endoluminal surface inside the airway restricts the generation of distinct depth maps with enhanced clarity and detail. To address this problem, we present a self-supervised training strategy that incorporates both domain transformation and photometric consistency for the Monodepth2 deep learning architecture, improving the depth and ego-motion prediction of bronchoscopic video frames. Results drawing on well-registered test data illustrate that the proposed strategy achieves clear and precise prediction. In addition, effective reference scaling factors are summarized from the test dataset, enabling real-world applications, such as 3D surface reconstruction, camera trajectory generation, and fusion between CT and bronchoscopic video.

Original languageEnglish (US)
Title of host publicationMedical Imaging 2024
Subtitle of host publicationImage-Guided Procedures, Robotic Interventions, and Modeling
EditorsJeffrey H. Siewerdsen, Maryam E. Rettmann
ISBN (Electronic)9781510671607
StatePublished - 2024
EventMedical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling - San Diego, United States
Duration: Feb 19 2024Feb 22 2024

Publication series

NameProgress in Biomedical Optics and Imaging - Proceedings of SPIE
ISSN (Print)1605-7422


ConferenceMedical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling
Country/TerritoryUnited States
CitySan Diego

All Science Journal Classification (ASJC) codes

  • Electronic, Optical and Magnetic Materials
  • Atomic and Molecular Physics, and Optics
  • Biomaterials
  • Radiology Nuclear Medicine and imaging

Cite this