Press the spacebar to pause all videos simultaneously.
We present sample results of our method, part of which are shown in Fig.4 in the paper.
Existing text-guided video editing methods are incapable of preserving the original motion while adhering to the edit prompt.
Our method allows for motion preservation of the guidance video while fulfilling the target structure.A motorbike driving in a scenic desert | TokenFlow | Ours |
---|---|---|
Gen-1 | Tune-A-Video | Control-A-Video |
A car driving in a city | TokenFlow | Ours |
---|---|---|
Gen-1 | Tune-A-Video | Control-A-Video |
A giraffe walking in the zoo | TokenFlow | Ours |
---|---|---|
Gen-1 | Tune-A-Video | Control-A-Video |
We present additional qualitative comparisons of our method and SA-NLA ([5]). Both our method and SA-NLA exhibit high fidelity to the original motion. Nevertheless, our method allows for greater deviation in structure, and adaption of fine-grained motion traits, which are necessary for capturing the unique attributes of the target object
A duck swimming in a river | SA-NLA | Ours |
---|---|---|
A minivan driving in a snowy forest | ||
A cat running in the cosmos | ||
We ablate key design choices of our method - alternative loss functions (first row), the need of guidance during sampling and latent initialization strategy (second row).
A car driving in a forest | Space-time feature loss | SMM feature loss |
---|---|---|
w/o guidance | w/o low-freq. init | Full method |
A giraffe walking on the rocks | Space-time feature loss | SMM feature loss |
---|---|---|
w/o guidance | w/o low-freq. init | Full method |
We present the feature inversion visualization for the full Space-Time features and for SMM features.
The synthesized videos for the full Space-Time features closely resemble the original video content in terms of appearance, shape, and pose. Replacing the full space-time features with SMM features allows for more flexibility.
Each row represents a different random starting point.
Original video | Space-time feature loss | SMM feature loss | |
---|---|---|---|
Seed 1 | |||
Seed 2 | |||
By applying the initial latent noise filtering for different random seeds, our method is able to produce various results for a given edit prompt.
Last two columns represent different random starting points.
A sports car driving on a road | ||
---|---|---|
A duck swimming in a pond | ||
To obtain the initial noise, we apply the downsampling/upsampling operation described in Eq. 4 in the paper. The resulting videos without this filtering operation (middle column) retain the appearance characteristic of the original video.
An airplane driving on a road | Unfiltered latent | Ours |
---|---|---|
A giraffe crossing a road in a savannah | ||
We further consider SDEdit ([6]) with different noise levels, none of which can resolve the motion preservation / edit fidelity tradeoff.