Official implementation for V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians.
Penghao Wang*, Zhirui Zhang*, Liao Wang*, Kaixin Yao, Siyuan Xie, Jingyi Yu†, Minye Wu†, Lan Xu†
SIGGRAPH Asia 2024 (ACM Transactions on Graphics)
| Webpage | Paper | Video | Training Code | SIBR Viewer Code | IOS Viewer Code |
title={V\^{} 3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians},
author={Wang, Penghao and Zhang, Zhirui and Wang, Liao and Yao, Kaixin and Xie, Siyuan and Yu, Jingyi and Wu, Minye and Xu, Lan},
journal={ACM Transactions on Graphics (TOG)},
publisher={ACM New York, NY, USA}
Create a new environment
conda create -n videogs python=3.9
conda activate videogs
First install CUDA and PyTorch, our code is evaluated on CUDA 11.6 and PyTorch 1.13.1+cu116. Then install the following dependencies:
pip install -r requirements.txt
pip install submodules/diff-gaussian-rasterization
pip install submodules/simple-knn
Install modified NeuS2 for key frame point cloud generation, please clone it to external
folder and build it.
cd external
git clone --recursive
cd NeuS2_K
cmake . -B build
cmake --build build --config RelWithDebInfo -j
Our code mainly evaluated on multi-view human centric datasets including ReRF, HiFi4G, and HumanRF datasets. Please download the data you needed.
Our dataset format is structed as follows:
| |---xxx (data name)
| | |---%d
| | | |---images
| | | | |---%d.png
| | | |---transforms.json
The transforms.json is based on NGP calibration format:
"frames": [
"file_path": "xxx/xxx.png" (file path to the image),
"transform_matrix": [
xxx (extrinsic)
"K": [
xxx (intrinsic, note can be different for each view)
"fl_x": xxx (focal length x),
"fl_y": xxx (focal length y),
"cx": xxx (cx),
"cy": xxx (cx),
"w": xxx (image width),
"h": xxx (image height)
"aabb_scale": xxx (aabb scale for NeuS2),
"white_transparent": true (if the background is white)
The dataset is structured as follows:
| |---xxx (data name)
| | |---image_undistortion_white
| | | |---%d - The frame number, starts from 0.
| | | | |---%d.png - Multi-view images, starts from 0.
| | |---colmap/sparse/0 - Camera extrinsics and intrinsics in Gaussian Splatting format.
Then you need to restruct the dataset and convert colmap calibration to ngp format of transforms.json, simply run the following command:
cd preprocess
python --input xxx --output xxx
Command Line Arguments for
Input folder to the original hifi4g dataset
Output folder to the processed hifi4g dataset
If move the images to the output folder or copy. True for move, False for copy.
The processed dataset is structured as follows:
| |---xxx (data name)
| | |---%d
| | | |---images
| | | | |---%d.png
| | | |---transforms.json
To process ReRF dataset, you need to re-calibration, undistortion the images and then convert to our format.
Install COLMAP for calibration and undistortion. However, as images without background is hard to calibration, here we provide a colmap calibration for KPOP sequence in ReRF datasets. You can download it from this link. If you need other sequence's calibration for ReRF dataset, please contact by email
With installed colmap and colmap calibration, you can undistortion the other frames by the command
cd preprocess
python --input xxx --output xxx --calib xxx(the path to colmap calibration) --start xxx(start frame) --end xxx(end frame)
Then follow the code in, undistortion the calibration, and use to generate the transform.json file.
Finally, the processed dataset is structured as follows:
| |---xxx (data name)
| | |---%d
| | | |---images (undistorted images)
| | | | |---%d.png
| | | |---transforms.json
For processed data, lanuch training with
python --start 0 --end 200 --cuda 0 --data datasets/HiFi4G/0932dancer3 --output output/0923dancer3 --sh 0 --interval 1 --group_size 20 --resolution 2
Command Line Arguments for
The frame id to start training
The frame id to end training
The CUDA device for training
The path to the dataset, note that this should be the folder containing frames from start to end
The output path for trained frame
Order of spherical harmonics to be used. 0
by default.
The interval between frames. For example, if set to 2, the training frames will be 0, 2, 4, 6, ...
The number of frames to trained in a group
Specifies resolution of the loaded images before training. If provided 1, 2, 4
or 8
, uses original, 1/2, 1/4 or 1/8 resolution, respectively. For all other values, rescales the width to the given number while maintaining image aspect. If not set and input image width exceeds 1.6K pixels, inputs are automatically rescaled to this target.
After training, the checkpoints in the output folder is structured as follows:
| |---checkpoint
| | |---%d (each frame ckpt folder)
| | |---record (record config and training file)
| |---neus2_output
After getting the Gaussian point clouds, we can compress them by the following command:
python --frame_start 100 --frame_end 140 --group_size 20 --interval 1 --ply_path ~/workspace/output/v3/0923dancer3/checkpoint/ --output_folder ~/workspace/output/v3/0923dancer3/feature_image --sh_degree 0
The frame trained is [100, 140), so is 40 frames. The output structure will be:
| |---checkpoint
| |---feature_image
| | |---group%d (each group's images)
| | |---min_max.json (store the min max value for each frame)
| | |---viewer_min_max.json (same as min_max.json, different struct)
| | |---group_info.json (store the each group frame index)
| |---neus2_output
Then compress images to video by the following command:
python --frame_start 100 --frame_end 140 --group_size 20 --output_path ~/workspace/output/v3/0923dancer3 --qp 25
The qp value is the parameter for compression, lower refers to higher quality, but larger size.
The output structure will be:
| |---checkpoint
| |---feature_image
| |---feature_video
| | |---group%d (each group's videos)
| | | |---%d.mp4 (each attribute's video)
| | |---viewer_min_max.json (store each frame min max info)
| | |---group_info.json (store the each group frame index)
| |---neus2_output
Note that the
need to be executed on linux OS due to video codec.
Finally, the compressed video folder can be hosted by nginx server and use our volumetric video viewer to play.
Our code is based on original gaussian-splatting implementation. We also refer NeuS2 for fast key frame point cloud generation, and 3DGStream for the inspiration of fast training strategy.
Thanks for Zhehao Shen for his help on datasets process.
If you find our work useful in your research, please consider citing our paper.
title={V\^{} 3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians},
author={Wang, Penghao and Zhang, Zhirui and Wang, Liao and Yao, Kaixin and Xie, Siyuan and Yu, Jingyi and Wu, Minye and Xu, Lan},
journal={ACM Transactions on Graphics (TOG)},
publisher={ACM New York, NY, USA}