NICER-SLAM

Neural Implicit Scene Encoding for RGB SLAM

3DV 2024 (Oral, Best Paper Honorable Mention)

Zihan Zhu1*     Songyou Peng1,2 *     Viktor Larsson3     Zhaopeng Cui4     Martin R. Oswald1,5        
Andreas Geiger6     Marc Pollefeys1,7        

* Equal Contribution

1ETH Zürich     2MPI for Intelligent Systems, Tübingen   3Lund University  
4State Key Lab of CAD&CG, Zhejiang University   5University of Amsterdam  
6University of Tübingen, Tübingen AI Center     7Microsoft

TL;DR: We present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, allowing high-quality 3D reconstruction and novel view synthesis.

Abstract

Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM. However, existing works either rely on RGB-D sensors or require a separate monocular SLAM approach for camera tracking, and fail to produce high-fidelity 3D dense reconstructions. To address these shortcomings, we present NICER-SLAM, a dense RGB SLAM system that simultaneously optimizes for camera poses and a hierarchical neural implicit map representation, which also allows for high-quality novel view synthesis. To facilitate the optimization process for mapping, we integrate additional supervision signals including easy-to-obtain monocular geometric cues and optical flow, and also introduce a simple warping loss to further enforce geometric consistency. Moreover, to further boost performance in complex large-scale scenes, we also propose a local adaptive transformation from signed distance functions (SDFs) to density in the volume rendering equation. On multiple challenging indoor and outdoor datasets, NICER-SLAM demonstrates strong performance in dense mapping, novel view synthesis, and tracking, even competitive with recent RGB-D SLAM systems.

Video

Method

NICER-SLAM takes only an RGB stream as input and outputs both the camera poses as well as a learned hierarchical scene representation for geometry and colors. To realize an end-to-end joint mapping and tracking, we render predicted colors, depths, normals and optimize wrt. the input RGB and monocular cues. Moreover, we further enforce the geometric consistency with an RGB warping loss and an optical flow loss.

Additional Results

Replica Dataset

Reconstruction

Rendering

7-Scenes Dataset

NICE-SLAM NeRF-SLAM DIM-SLAM DROID-SLAM NICER-SLAM (Ours) GT

Self-Captured Outdoor Dataset

Rendering

COLMAP TANDEM NeRF-SLAM DROID-SLAM NICER-SLAM (Ours)

Reconstruction

COLMAP TANDEM NeRF-SLAM DROID-SLAM NICER-SLAM (Ours)

BibTeX

@inproceedings{Zhu2023NICER,
  author={Zhu, Zihan and Peng, Songyou and Larsson, Viktor and Cui, Zhaopeng and Oswald, Martin R and Geiger, Andreas and Pollefeys, Marc},
  title     = {NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM},
  booktitle = {International Conference on 3D Vision (3DV)},
  month     = {March},
  year      = {2024},
}

Acknowledgements

This project is partially supported by the SONY Research Award Program and a research grant by FIFA. The authors thank the Max Planck ETH Center for Learning Systems (CLS) for supporting Songyou Peng and the strategic research project ELLIIT for supporting Viktor Larsson. We thank Zehao Yu, Yiming Zhao, Weicai Ye, Boyang Sun, Jianhao Zheng, and Heng Li for their helpful discussion.