3D Reconstruction and Radiance Fields




3D Reconstruction Background
Two end games of computer vision and graphics are obtaining a constantly updated open source photorealistic 3D reconstruction (mapping) of the world and a fast method for high quality object capture. High quality world maps enable self-driving cars and drones to localize themselves against the map, phones/VR/AR devices to localize themselves and project realistic graphics, and the development of motion planning based on ground truth from the existing map. High quality object capture promises to revolutionize several industries and decentralize technology. In a hypothetical future, as the real world is observed by our cameras, the images could be sent to the cloud and the world map continuously updated. At the same time, the world map could be continuously refined with bundle adjustment. Google, among the computer vision and graphics communities, has been thinking about this since the early 2000’s with Google Street View and Waymo cars. I assume Google uses Street View to localize their cars and for ground truthing against the world map. They drive cars around with a sensor suite, and run Visual SLAM (aka structure from motion) on the world to give us street view, ceres solver is what they use to optimize the world map with bundle adjustment. Visual SLAM gives us a RGBD point cloud that is meshed together with techniques such as Poisson Reconstruction, this meshed representation is the map.

A world map is evidently not the secret ingredient to winning self driving cars, although I think we all could use a little redefining what winning self driving cars looks like, but it does let you know where the robot is and what is globally around the robot, enabling several self driving car companies to operate successfully in mapped regions (Cadillac supercruise, Cruise, Waymo, etcetera), and rovers to map planetary bodies. Comma.ai does not publicly state they use maps, and instead markets their end to end camera-only system, which is truly revolutionary and similar to Tesla Autopilot except open source (woo!). Since the world map updates are slow, we can use relative depth estimation techniques. Basically, while we wait for the world map to update we can, in the meantime, determine what is around the robot with per frame relative depth maps. These depth maps can optionally be integrated into the metric world map through a Truncated Signed Distance Function representation (Kinect Fusion), among other techniques. Relative depth estimation can be computed with stereo, lidar, monocular depth estimation, etcetera. Relative depth estimation is good for real time obstacle avoidance etcetera but remember, if we have a posed camera provided by SLAM, we can always project the world map into this camera to get a depth map. It is just that the world map may not be accurate since it takes a while to update.

Two paradigms have been established. One paradigm involves a photorealistic map of the world, and the other paradigm is real time depth estimation. A groundbreaking hybrid of these two paradigms was introduced with LSD-SLAM. This is a real time direct visual slam system that can give semi-dense depth maps and world maps on CPU. No feature detection or matching required, just advanced math with factor graphs for scaling. However, for real time depth estimation in the case of “am I about to hit something” I believe relative depth estimation techniques (stereo, lidar, etcetera) have a formidable spot. See these pages on past and modern scalable solutions to visual SLAM [1, 2].


3D Reconstruction Breakthrough:
Neural Radiance Fields
I am now going to focus on getting a photorealistic map. We have all heard of the metaverse, and one version of the metaverse is this photorealistic world map with optional graphics included. The metaverse does not solely include a world map, it also entails object capture, which is getting a 3D model of an object which can then be included alongside a map, it is just a 3D model that could be rendered in a video game for example. Neural Radiance Fields (NeRF) have redefined how to best obtain a photorealistic map. Up to now, photorealistic maps were obtained with visual SLAM (or SfM, MVS, Photometric Stereo, etecetera) that gives us RGBD point clouds that were then meshed with a technique such as Poisson Reconstruction to get a map. This RGBD representation is a surface level representation that bakes in the myriad light transport effects. For example, scattering, diffusion, and caustics are superimposed onto each other and we only get the resulting RGB pixel value. This limits how photorealistic our reconstruction can be because when we simulate synthetic lighting, the reconstruction does not accurately reflect the light. This is why Phong Shading is the de facto standard in computer graphics for video game rendering since the 3D models in the video game typically cannot have more than a RGBD representation due to computation constraints. For more on shading (Bidirectional Reflectance Distribution Function) and computer graphics, see scratchapixel.

The key insight of NeRF is representing scenes as a volume instead of a RGBD surface representation for view synthesis (aka projecting a photorealistic map into unseen views). This is based on light fields and volume rendering. Given posed input images from SLAM, we can train a Multi Layer Perceptron (!!!) to predict a color and volume density for a desired pose. By integrating along a ray through the volume we can get an expected ray termination based on estimated volume density and use this as a depth value. So, instead of running SLAM on a collection of images to get a RGBD representation, we can use NeRF to obtain a volumetric representation that gives a more physically accurate RGBD representation. Once we have the per view depth maps from NeRF we must still backproject the depth maps into a point cloud and mesh them as usual. Caveats: input images must be posed with SLAM first, takes a day to train, Convergence relies on quality coverage of the object. For offline object capture and world mapping though these caveats are not really caveats and are not too different from existing 3D reconstruction pipelines in terms of heuristics. I can now train a NeRF MLP to get a more photorealistic mesh compared to SLAM.

Google recently ran NeRF on Street View images for an extremely accurate world map (Urban Radiance Fields). Urban Radiance Fields is one of the reasons I wrote this blog since until now I assumed NeRF was mostly an object capture technique with limited robotics applications, this changed my perspective. I probably sound like a broken record, but, instead of running SLAM on the input images + X to obtain a map, they run NeRF and regularize with ground truth lidar to represent the world as a volume. This gives highly accurate depth and depth for unseen views. I think NeRF will drastically improve SLAM techniques: collecting images with a drone, phone, or VR headset and then offline using NeRF will give much more accurate maps than SLAM in some instances. For example a drone can inspect a bridge and run NeRF for offline inspection. You might be wondering what happens if something in the world moves as this invalidates the map and the NeRF MLP must be retrained. Bundle Adjusting Radiance Fields approaches this problem by using bundle adjustment to allow for imperfect posed input images, possibly allowing for a continuously updated map in the future, but as of now dynamic scenes are the main limitation of radiance field techniques for robotics applications in my opinion, although dynamic scenes might simply be the wrong problem for radiance fields.


Rethinking NeRF:
Plenoxels
Recently, the paper “Plenoxels: Radiance Fields without Neural Networks" replaced the volume representation of NeRF with a spherical harmonic voxel representation. The spherical harmonic coefficients are optimized during “training” and interpolated to give the predicted color and density. Plenoxels can train in 8 minutes compared to 24 hours of NeRF. The volume rendering ideas for the loss are the same but the representation is now voxels. The downside is the trained model of Plenoxels is larger due to the voxel representation but again in many applications this is probably fine. We can now train a plenoxel model to give a 3D model in reasonable time, this truly revolutionizes object capture. I have not seen an Urban Radiance Field extension to plenoxels but I imagine it will come, revolutionizing robotics as well. LUMA AI is a new startup that just hired Alex Yu, one of the Plenoxel authors. It seems they are marketing themselves as a quick object capture service, think of a more accurate COLMAP in an app, I am looking forward to seeing what they build.


Conclusion
I assume that real time depth estimation and SLAM will remain supreme in real time robotics mapping applications due to the maturity of the algorithms and having to retrain the radiance fields for dynamic scenes, though I already see work handling dynamic scenes so this may change. SLAM allows for easier map updating than radiance fields as of today, but the recent work in iMAP: Implicit Mapping and Positioning in Real-Time questions this assumption. Urban Radiance Fields suggests it is probably a good idea to have your robot collect images and then train a radiance field offline if your application allows it. Real time depth estimation still seems strongly positioned for certain robotics applications, albeit maybe computed in a different representation than today, and maybe eliminated altogether if dense map updating can happen at a similar frame rate for certain applications. In conclusion, Radiance Fields have positioned themselves as a new technique for computer vision researchers and practicioners to try for 3D Reconstruction.


Brevin Tilmon
https://btilmon.github.io/
January 12, 2022


Additional Resources
  • Approachable introduction to NeRF by Frank Dellaert [link]
  • Mathematical Foundations of Computer Vision and Computer Graphics [link]
  • Light fields
    • Original light field paper [link]
    • Light field photography with a hand-held plenoptic camera [link]
  • Computer Vision: Algorithms and Applications, 2nd Edition [link]