3D Human Pose Estimation in Video

Alexandra Kissel

UW-Madison

Bin Yang

UW-Madison

Ganesh Arivoli

UW-Madison

CS 566 | Fall 2025

Abstract


This project presents a comprehensive comparison between transformer-based and CNN-based architectures for 3D human pose estimation in video. We evaluate two state-of-the-art models—VideoPose3D (a temporal CNN) and PoseFormerV2 (a transformer-based model)—on the Human3.6M dataset using standardized evaluation protocols. Our experiments demonstrate that PoseFormerV2 consistently outperforms VideoPose3D across all metrics, achieving a 3.4% improvement in MPJPE (45.2 mm vs 46.8 mm), 2.5% improvement in P-MPJPE (35.6 mm vs 36.5 mm), and 2.7% improvement in N-MPJPE (43.8 mm vs 45.0 mm). Through both quantitative metrics and qualitative video comparisons, we show that transformer-based architectures provide better global joint understanding and improved stability for 3D pose estimation. Our findings highlight the importance of 2D keypoint detection quality and demonstrate that transformer models represent a promising direction for robust, generalizable 3D pose estimation systems.

Problem Statement


The Problem: Estimating 3D human pose from video

Challenges:

Our Aim: Compare transformer-based models with classical CNN models for 3D pose estimation in video.

Examples of challenging human pose estimation scenarios.

Why is it Important


Real-world Applications:

2D keypoints from sports footage being uplifted to 3D — crucial for real-world applications like sports analytics.

Current Limitations:

Systems work well in clean conditions but struggle with:

Impact: Better 3D pose estimation increases safety, robustness, and usability across many practical applications.

Our Method


Models Compared

We implemented and compared two architectures using identical 2D keypoints as input:

  1. VideoPose3D (baseline CNN-based)
    • Temporal fully connected network
    • Reference: Pavllo, D., et al. (2019). "3D human pose estimation in video with temporal convolutions and semi-supervised training"
  2. PoseFormerV2 (transformer-based)
    • Uses self-attention to model global spatio-temporal relationships
    • Frequency-domain representations
    • Reference: Zhao, Q., et al. (2023). "PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation"

Comparison Methods

We compare them using two methods:

  1. Metrics using Human 3.6M dataset
    • Quantitative evaluation on standardized test set
  2. Visual comparison using random videos:
    • Step a): 2D Video frame to 2D keypoints using DetectronV2
    • Step b): 2D keypoints uplifted to 3D coordinates

About the Dataset and Metrics


Human3.6M Dataset

Human3.6M dataset samples with corresponding 3D ground-truth poses.

Evaluation Protocols

Three evaluation protocols:

  1. Protocol #1 (MPJPE): Mean per-joint position error in millimeters
    • Mean Euclidean distance between predicted joint positions and ground-truth joint positions
  2. Protocol #2 (P-MPJPE): Procrustes-aligned Mean Per-Joint Position Error
    • Error after alignment with the ground truth in translation, rotation, and scale
  3. Protocol #3 (N-MPJPE): Normalized Mean Per-Joint Position Error
    • Aligns predicted poses with the ground-truth only in scale
    • Used for semi-supervised experiments

Results and Comparison


Quantitative Results

S. No. Error Metric VideoPose3D PoseFormerV2 % Improvement
1 MPJPE 46.8 mm 45.2 mm 3.4%
2 P-MPJPE 36.5 mm 35.6 mm 2.5%
3 N-MPJPE 45.0 mm 43.8 mm 2.7%

Key Points:

Error Metrics Explained:


Qualitative Results (Sample Videos)

Video Demonstrations:

Output 3D Coordinates Comparison:

Action Mean 3D Error Median Max Error
Tennis 1.067 1.059 1.717
Dancing 0.996 0.938 2.134
Running 1.009 0.897 1.765

Discussion


What We Learned

Problems Encountered

2D Keypoint Detection Quality:

Coordinate System Alignment:

Temporal Synchronization:

Future Directions

Conclusion

Overall, transformer-based architectures represent a strong next step for building robust and more generalizable 3D pose estimation systems.

Acknowledgements


The website template was adapted from Tzofi Klinghoffer.