3D Human Pose Estimation in Video

Alexandra Kissel

UW-Madison

Bin Yang

UW-Madison

Ganesh Arivoli

UW-Madison

CS 566 | Fall 2025

videocam Presentation (Video) description Presentation (PDF)

Abstract

This project presents a comprehensive comparison between transformer-based and CNN-based architectures for 3D human pose estimation in video. We evaluate two state-of-the-art models—VideoPose3D (a temporal CNN) and PoseFormerV2 (a transformer-based model)—on the Human3.6M dataset using standardized evaluation protocols. Our experiments demonstrate that PoseFormerV2 consistently outperforms VideoPose3D across all metrics, achieving a 3.4% improvement in MPJPE (45.2 mm vs 46.8 mm), 2.5% improvement in P-MPJPE (35.6 mm vs 36.5 mm), and 2.7% improvement in N-MPJPE (43.8 mm vs 45.0 mm). Through both quantitative metrics and qualitative video comparisons, we show that transformer-based architectures provide better global joint understanding and improved stability for 3D pose estimation. Our findings highlight the importance of 2D keypoint detection quality and demonstrate that transformer models represent a promising direction for robust, generalizable 3D pose estimation systems.

Problem Statement

The Problem: Estimating 3D human pose from video

Input: Sequence of RGB frames
Goal: Recover 3D positions of human joints (head, shoulders, elbows, hips, knees, ankles)

Challenges:

2D image loses depth information
Occlusions or clutter can hide body parts
Movements can be fast or complex
Models must generalize to new scenes and people

Our Aim: Compare transformer-based models with classical CNN models for 3D pose estimation in video.

Why is it Important

Real-world Applications:

Robotics – for safe human–robot interaction
AR/VR – for tracking body movement in virtual environments
Healthcare & rehab – analyzing patient motion
Sports analytics – performance tracking and injury prevention

Current Limitations:

Systems work well in clean conditions but struggle with:

Complex motion sequences
Depth uncertainty in single-camera video
Hidden objects and clutter

Impact: Better 3D pose estimation increases safety, robustness, and usability across many practical applications.

Our Method

Models Compared

We implemented and compared two architectures using identical 2D keypoints as input:

VideoPose3D (baseline CNN-based)
- Temporal fully connected network
- Reference: Pavllo, D., et al. (2019). "3D human pose estimation in video with temporal convolutions and semi-supervised training"
PoseFormerV2 (transformer-based)
- Uses self-attention to model global spatio-temporal relationships
- Frequency-domain representations
- Reference: Zhao, Q., et al. (2023). "PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation"

Comparison Methods

We compare them using two methods:

Metrics using Human 3.6M dataset
- Quantitative evaluation on standardized test set
Visual comparison using random videos:
- Step a): 2D Video frame to 2D keypoints using DetectronV2
- Step b): 2D keypoints uplifted to 3D coordinates

About the Dataset and Metrics

Human3.6M Dataset

Contains 3.6 million video frames for 11 subjects
Each subject performs 15 actions recorded using four cameras
We adopt a 17-joint skeleton
Train subjects: S1, S5, S6, S7, S8
Test subjects: S9, S11

Evaluation Protocols

Three evaluation protocols:

Protocol #1 (MPJPE): Mean per-joint position error in millimeters
- Mean Euclidean distance between predicted joint positions and ground-truth joint positions
Protocol #2 (P-MPJPE): Procrustes-aligned Mean Per-Joint Position Error
- Error after alignment with the ground truth in translation, rotation, and scale
Protocol #3 (N-MPJPE): Normalized Mean Per-Joint Position Error
- Aligns predicted poses with the ground-truth only in scale
- Used for semi-supervised experiments

Results and Comparison

Quantitative Results

S. No.	Error Metric	VideoPose3D	PoseFormerV2	% Improvement
1	MPJPE	46.8 mm	45.2 mm	3.4%
2	P-MPJPE	36.5 mm	35.6 mm	2.5%
3	N-MPJPE	45.0 mm	43.8 mm	2.7%

Key Points:

All metrics measured in millimeters (mm)
Lower is better
PoseFormerV2 outperforms VideoPose3D across all metrics

Error Metrics Explained:

MPJPE (Protocol #1): Raw 3D position error
P-MPJPE (Protocol #2): Error after pose alignment
N-MPJPE (Protocol #3): Error after scale normalization

Qualitative Results (Sample Videos)

Video Demonstrations:

Dancing: Watch Video
Tennis: Watch Video
Running: Watch Video

Output 3D Coordinates Comparison:

Action	Mean 3D Error	Median	Max Error
Tennis	1.067	1.059	1.717
Dancing	0.996	0.938	2.134
Running	1.009	0.897	1.765

Discussion

What We Learned

Transformers significantly outperform classical temporal models for 3D pose estimation
Data quality (2D keypoints) heavily influences 3D accuracy
- Improvement from MediaPipe to Detectron
PoseFormer improves global joint understanding and adds stability when compared to the CNN-based VideoPose3D

Problems Encountered

2D Keypoint Detection Quality:

Initial use of MediaPipe resulted in lower accuracy
Switched to DetectronV2 for better 2D keypoint quality
This significantly improved 3D pose estimation results

Coordinate System Alignment:

Different models may output in different coordinate systems
Required careful alignment for fair comparison

Temporal Synchronization:

Ensuring frame-by-frame consistency in video comparisons

Future Directions

Merge RGB images with depth info using LIDAR
Extend to multi-person pose estimation
Move toward real-time, on-device inference

Conclusion

Overall, transformer-based architectures represent a strong next step for building robust and more generalizable 3D pose estimation systems.

Resources & References

Code Repository

GitHub: https://github.com/gnsh-a/repo566
Key scripts:
- eval_videopose3d.sh - VideoPose3D evaluation
- eval_poseformerv2.sh - PoseFormerV2 evaluation
- compare_metrics.py - Metrics comparison
- compare_video_models.py - Video comparison tool

Datasets

Human3.6M: [Dataset reference/link]

Model References

VideoPose3D:

Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). "3D human pose estimation in video with temporal convolutions and semi-supervised training"
GitHub: https://github.com/facebookresearch/VideoPose3D

PoseFormerV2:

Zhao, Q., Zheng, C., Liu, M., Wang, P., & Chen, C. (2023). "PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation"
GitHub: https://github.com/QitaoZhao/PoseFormerV2

Acknowledgements

The website template was adapted from Tzofi Klinghoffer.