Before preparing your own data, you should checkout the dataset module carefully.
Overall, you need to place your data with the following structure and create a corresponding config file.
manhattan_sdf
├───data
| ├───$scene_name
| | ├───intrinsic.txt
| | ├───images
| | | ├───0.png
| | | ├───1.png
| | | └───...
| | ├───pose
| | | ├───0.txt
| | | ├───1.txt
| | | └───...
| | ├───depth_colmap
| | | ├───0.npy
| | | ├───1.npy
| | | └───...
| | └───semantic_deeplab
| | ├───0.png
| | ├───1.png
| | └───...
| └───...
├───configs
| ├───$scene_name.yaml
| └───...
└───...
You should place RGB images in data/$scene_name/images
folder, note that the filenames can be arbitrary but you need to make sure that they are consistent with files in other folders under data/$scene_name
.
Save the 4x4
intrinsic matrix in data/$scene_name/intrinsic.txt
.
You can solve camera poses with COLMAP or other tools you like. Then you need to normalize the camera poses and modify some configs to ensure that:
- The scene is inside bounding radius, note that bounding radius should be lower than π so that positional encoding can work well.
- The center of the scene is near the origin and the geometric initialization radius is appropriate (surface of initialized sphere and target geometry should not be too far).
- Make sure that the sampling range (near, far) can cover the whole scene.
To achieve these, the simplest way is to normalize the camera poses to be inside a unit sphere, which is similar to the implementation of VolSDF and neurecon. Then you can set the geometric initialization radius as 1.0 and bounding radius as 2.0 (note that indoor scenes are scanned from inside, which is different from object dataset such as DTU, so you need to set the geometric initialization radius and bounding radius larger than the range of camera poses).
This can work well if the images are captured by walking around the scene (namely, the camera trajectory is relatively close to the boundary of the scene), since the scale of the scene can be regarded as slightly larger than the range of camera poses. If it cannot be guaranteed, you need to rescale the camera poses to be inside a smaller sphere or adjust the hyperparameters heuristically. A more general way is to first run sparse reconstruction and define a region of interest manually, please refer to NeuS.
Save the normalized poses as 4x4
matrices in txt
format under data/$scene_name/pose
folder. Remember to save the scale and offset used to normalize here so that you can transform to original coordinate if you want to extract mesh and compare with ground truth mesh.
You need to run sparse and dense reconstruction of COLMAP first. Please refer to this instruction if you want to use known camera poses.
After dense reconstruction, you can obtain depth prediction for each view. However, the depth predictions can be noisy, so we recommend you to run fusion to filter out most noises. Since original COLMAP does not have fusion mask for each view, you need to compile and run this customized version, which is a submodule of NerfingMVS.
We provide a Python script here for reference, which includes camera poses normalization and COLMAP depth maps generation.
You need to run 2D semantic segmentations to generate semantic predictions from images. We provide our trained model and inference code here.