PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi Lizhuang Ma
Shanghai Jiao Tong University

On this page, we present a demo video and detailed comparative results showcasing the performance of our method against state-of-the-art controllable video generation approaches. We conduct the comparative experiments on data from two distinct domains: human and non-human, highlighting our model's generalization across diverse subject categories.

Abstract

Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.

Framework

PoseAnything Framework

Demo Video

demo_video

Human Data Comparison

This section provides comparison with state-of-the-art methods on human data. Each row corresponds to a single test case with same input. Our model achieves excellent continuity in motion, consistent appearance, and stable background, while current SOTA methods show distortion in key areas like the hands and face, or experience catastrophic frame failure.

Case 1

GT

Pose

Ours

UniAnimate

Animate-X

MagicPose

Case 2

GT

Pose

Ours

UniAnimate

Animate-X

MagicPose

Case 3

GT

Pose

Ours

UniAnimate

Animate-X

MagicPose

Case 4

GT

Pose

Ours

UniAnimate

Animate-X

MagicPose

Non-Human Data Comparison

This section provides comparison with state-of-the-art methods on non-human data. Each row corresponds to a single test case. Since ATI, SG-I2V, and Tora are trajectory-guided video generation methods, we manually constructed the input control information for Tora and SG-I2V, and utilized ATI's self-proposed control information extraction mechanism for its input. Furthermore, given that the default generation frame counts of SG-I2V and Tora are less than the default frame count 81, used by our PoseAnything and ATI methods, we align the frame count of all videos to 81 frames by duplicating the last frame to ensure a intuitive comparison. The results indicate that our PoseAnything model demonstrates a significant advantage in precise object pose control, whereas competing methods struggle to achieve frame-level pose alignment and tend to generate hallucinations during large-range motion synthesis.

Case 1

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 2

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 3

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 4

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 5

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 6

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 7

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 8

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 9

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 10

GT

Pose

Ours

ATI

SG-I2V

Tora

Case 11

GT

Pose

Ours

ATI

SG-I2V

Tora