DGD: Dynamic 3D Gaussians Distillation

ECCV 2024

Isaac Labe¹ Noam Issachar¹ Itai Lang² Sagie Benaim¹

¹The Hebrew University of Jerusalem ²University of Chicago

Paper arXiv Code

red: "Hands", green: "Cookie"

red: "Hand", green: "Cup"

red: "Hands", green: "Chicken"

red: "Hands"/"arms", green: "Torch"

CLIP segmentation: "hands"-"cookie"

TL;DR: We distill 2D semantic features to dynamic 3D Gaussian Splatting scenes, allowing for semantic segmentation of dynamic objects in 3D.

Abstract

We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input. Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene, enabling the generation of novel views and their corresponding semantics. This enables the segmentation and tracking of a diverse set of 3D semantic entities, specified using a simple and intuitive interface that includes a user click or a text prompt. To this end, we present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene, building upon the recently proposed dynamic 3D Gaussians representation. Our representation is optimized over time with both color and semantic information. Key to our method is the joint optimization of the appearance and semantic attributes, which jointly affect the geometric properties of the scene. We evaluate our approach in its ability to enable dense semantic 3D object tracking and demonstrate high-quality results that are fast to render, for a diverse set of scenes.

Video

Method Overview

Dynamic 3D Gaussians Distillation utilizes 3D Gaussian representation and optimizes spatial parameters of the Gaussians and their deformation, concurrently with appearance properties with a semantic feature per Gaussian. Our learned representation enables efficient semantic understanding and manipulation of dynamic 3D scenes.

Results

The following results show a fixed novel view over different timesteps using our method for semantically segmenting objects over time in the real-world HyperNerf dataset. The considered objects are marked in green and red.


Cookie (green), Hands (red)	Torch (green), Hands (red)	Chicken (green), Hands (red)	Cup (green), Hands (red)

The following result are for the synthetic D-NeRF dataset. The considered parts are marked by green, yellow, and red colors.


red: "Spine", yellow: "Ribs", green: "Skull"	red: "Face", yellow: "Shoes", green: "Helmet"	red: "Pants", yellow: "Hair", green: "Hands"	yellow: "Feets", green: "Hands"

Segmentation with different foundation models

Our method leverages various 2D foundation models, including DINOv2, Lseg (CLIP), a combination of DINOv2 and Lseg and SAM, to effectively capture and understand the semantics of 3D scenes.

DINOv2 (click)

Lseg (text prompt)

Lseg (click)

DINOv2 and Lseg (click)

SAM (click)

Granularity

Our semantic understanding of the scene captures fine-grained details, allowing us to control the click segmentation's granularity by adjusting the parameter θ.

θ = 0.9

θ = 0.85

θ = 0.8

θ = 0.75

θ = 0.7

θ = 0.6

Editing

Visual illustration of our method’s ability to semantically edit segmented and tracked objects. Given the segmented region of the “cookie” (resp. “torch”), we consider the ability to edit its texture using the prompt “strawberry” (resp. “handgun”). Here, we display a randomly selected novel view over time.

A red strawberry		A black handgun

BibTeX

@misc{labe2024dgd,
      title={DGD: Dynamic 3D Gaussians Distillation}, 
      author={Isaac Labe and Noam Issachar and Itai Lang and Sagie Benaim},
      year={2024},
      eprint={2405.19321},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}