Space-time Diffusion Features for Zero-shot Text-driven Motion Transfer

Supplementary Material

 


Press the spacebar to pause all videos simultaneously.

 


Our Results

We present sample results of our method, part of which are shown in Fig.4 in the paper.


Original video A lion sitting in a forest A dragon sitting in a floral forest
Original video A bike driving in a snowy forest A train riding on rails in autumn view
Original video A horse jumping into a river A dolphin jumping into the ocean
Original video A camel crossing a road in a savannah A giraffe crossing a road in a savannah
Original video Monkeys playing with coconuts Rabbits playing with easter eggs
Original video An airplane driving in a forest A motorbike driving in a forest
Original video A duck swimming in a pond A penguin swimming in outer space
Original video A flamingo walking in a field A goose walking in a field
Original video A bear catching a frisbee in the field A cat catching a frisbee in the field
Original video A cat running in the cosmos A horse running near a fence
Original video A camel walking on icy rocks in antarctica An elephant walking on the rocks
Original video A car driving on a road A bus driving on a road

 


Comparisons to Baselines

Existing text-guided video editing methods are incapable of preserving the original motion while adhering to the edit prompt.

Our method allows for motion preservation of the guidance video while fulfilling the target structure.


A motorbike driving in a scenic desert TokenFlow Ours
Gen-1 Tune-A-Video Control-A-Video

 


A car driving in a city TokenFlow Ours
Gen-1 Tune-A-Video Control-A-Video

 


A giraffe walking in the zoo TokenFlow Ours
Gen-1 Tune-A-Video Control-A-Video

 


Comparisons to SA-NLA

We present additional qualitative comparisons of our method and SA-NLA ([5]). Both our method and SA-NLA exhibit high fidelity to the original motion. Nevertheless, our method allows for greater deviation in structure, and adaption of fine-grained motion traits, which are necessary for capturing the unique attributes of the target object


A duck swimming in a river SA-NLA Ours
A minivan driving in a snowy forest
A cat running in the cosmos

 


Ablations

We ablate key design choices of our method - alternative loss functions (first row), the need of guidance during sampling and latent initialization strategy (second row).


A car driving in a forest Space-time feature loss SMM feature loss
w/o guidance w/o low-freq. init Full method

 


A giraffe walking on the rocks Space-time feature loss SMM feature loss
w/o guidance w/o low-freq. init Full method

 


Inversion Analysis

We present the feature inversion visualization for the full Space-Time features and for SMM features.

The synthesized videos for the full Space-Time features closely resemble the original video content in terms of appearance, shape, and pose. Replacing the full space-time features with SMM features allows for more flexibility.

Each row represents a different random starting point.


Original video Space-time feature loss SMM feature loss
Seed 1
Seed 2

 


Variability Across Seeds

By applying the initial latent noise filtering for different random seeds, our method is able to produce various results for a given edit prompt.

Last two columns represent different random starting points.


A sports car driving on a road
A duck swimming in a pond

 


Effect of initial latent filtering

To obtain the initial noise, we apply the downsampling/upsampling operation described in Eq. 4 in the paper. The resulting videos without this filtering operation (middle column) retain the appearance characteristic of the original video.


An airplane driving on a road Unfiltered latent Ours
A giraffe crossing a road in a savannah

 


Comparison to SDEdit

We further consider SDEdit ([6]) with different noise levels, none of which can resolve the motion preservation / edit fidelity tradeoff.


A goose walking in a puddle Ours SDEdit (t=0.6)
SDEdit (t=0.8) SDEdit (t=0.9) SDEdit (t=0.99)

 


A car driving on a road Ours SDEdit (t=0.6)
SDEdit (t=0.8) SDEdit (t=0.9) SDEdit (t=0.99)

 


A duck swimming in a river Ours SDEdit (t=0.6)
SDEdit (t=0.8) SDEdit (t=0.9) SDEdit (t=0.99)

 


Limitations

Our method struggles to preserve the original motion since the combination of the original motion and the edit prompt may be out of distribution for the T2V model.


Original video A squirrel climbing on a wall


 

 

 

[1] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.

[2] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023

[3] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.

[4] Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models, 2023.

[5] Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. Shape-aware text-driven layered video editing. arXiv preprint arXiv:2301.13173, 2023.

[6] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022