red: "Hands", green: "Cookie"
red: "Hand", green: "Cup"
red: "Hands", green: "Chicken"
red: "Hands"/"arms", green: "Torch"
CLIP segmentation: "hands"-"cookie"
We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input. Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene, enabling the generation of novel views and their corresponding semantics. This enables the segmentation and tracking of a diverse set of 3D semantic entities, specified using a simple and intuitive interface that includes a user click or a text prompt. To this end, we present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene, building upon the recently proposed dynamic 3D Gaussians representation. Our representation is optimized over time with both color and semantic information. Key to our method is the joint optimization of the appearance and semantic attributes, which jointly affect the geometric properties of the scene. We evaluate our approach in its ability to enable dense semantic 3D object tracking and demonstrate high-quality results that are fast to render, for a diverse set of scenes.
Dynamic 3D Gaussians Distillation utilizes 3D Gaussian representation and optimizes spatial parameters of the Gaussians and their deformation, concurrently with appearance properties with a semantic feature per Gaussian. Our learned representation enables efficient semantic understanding and manipulation of dynamic 3D scenes.
The following results show a fixed novel view over different timesteps using our method for semantically segmenting objects over time in the real-world HyperNerf dataset. The considered objects are marked in green and red.
Cookie (green), Hands (red) | Torch (green), Hands (red) | Chicken (green), Hands (red) | Cup (green), Hands (red) |
The following result are for the synthetic D-NeRF dataset. The considered parts are marked by green, yellow, and red colors.
red: "Spine", yellow: "Ribs", green: "Skull" | red: "Face", yellow: "Shoes", green: "Helmet" | red: "Pants", yellow: "Hair", green: "Hands" | yellow: "Feets", green: "Hands" |
Our method leverages various 2D foundation models, including DINOv2, Lseg (CLIP), a combination of DINOv2 and Lseg and SAM, to effectively capture and understand the semantics of 3D scenes.
DINOv2 (click)
Lseg (text prompt)
Lseg (click)
DINOv2 and Lseg (click)
SAM (click)
Our semantic understanding of the scene captures fine-grained details, allowing us to control the click segmentation's granularity by adjusting the parameter θ.
θ = 0.9
θ = 0.85
θ = 0.8
θ = 0.75
θ = 0.7
θ = 0.6
A red strawberry | A black handgun | |
@misc{labe2024dgd,
title={DGD: Dynamic 3D Gaussians Distillation},
author={Isaac Labe and Noam Issachar and Itai Lang and Sagie Benaim},
year={2024},
eprint={2405.19321},
archivePrefix={arXiv},
primaryClass={cs.CV}
}