This project presents a comprehensive comparison between transformer-based and CNN-based architectures for 3D human pose estimation in video. We evaluate two state-of-the-art models—VideoPose3D (a temporal CNN) and PoseFormerV2 (a transformer-based model)—on the Human3.6M dataset using standardized evaluation protocols. Our experiments demonstrate that PoseFormerV2 consistently outperforms VideoPose3D across all metrics, achieving a 3.4% improvement in MPJPE (45.2 mm vs 46.8 mm), 2.5% improvement in P-MPJPE (35.6 mm vs 36.5 mm), and 2.7% improvement in N-MPJPE (43.8 mm vs 45.0 mm). Through both quantitative metrics and qualitative video comparisons, we show that transformer-based architectures provide better global joint understanding and improved stability for 3D pose estimation. Our findings highlight the importance of 2D keypoint detection quality and demonstrate that transformer models represent a promising direction for robust, generalizable 3D pose estimation systems.
Problem Statement
The Problem: Estimating 3D human pose from video
Input: Sequence of RGB frames
Goal: Recover 3D positions of human joints (head, shoulders, elbows, hips, knees, ankles)
Challenges:
2D image loses depth information
Occlusions or clutter can hide body parts
Movements can be fast or complex
Models must generalize to new scenes and people
Our Aim: Compare transformer-based models with classical CNN models for 3D pose estimation in video.
Examples of challenging human pose estimation scenarios.
Why is it Important
Real-world Applications:
Robotics – for safe human–robot interaction
AR/VR – for tracking body movement in virtual environments
Healthcare & rehab – analyzing patient motion
Sports analytics – performance tracking and injury prevention
2D keypoints from sports footage being uplifted to 3D — crucial for real-world applications like sports analytics.
Current Limitations:
Systems work well in clean conditions but struggle with:
Complex motion sequences
Depth uncertainty in single-camera video
Hidden objects and clutter
Impact: Better 3D pose estimation increases safety, robustness, and usability across many practical applications.
Our Method
Models Compared
We implemented and compared two architectures using identical 2D keypoints as input:
VideoPose3D (baseline CNN-based)
Temporal fully connected network
Reference: Pavllo, D., et al. (2019). "3D human pose estimation in video with temporal convolutions and semi-supervised training"
PoseFormerV2 (transformer-based)
Uses self-attention to model global spatio-temporal relationships
Frequency-domain representations
Reference: Zhao, Q., et al. (2023). "PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation"
Comparison Methods
We compare them using two methods:
Metrics using Human 3.6M dataset
Quantitative evaluation on standardized test set
Visual comparison using random videos:
Step a): 2D Video frame to 2D keypoints using DetectronV2
Step b): 2D keypoints uplifted to 3D coordinates
About the Dataset and Metrics
Human3.6M Dataset
Contains 3.6 million video frames for 11 subjects
Each subject performs 15 actions recorded using four cameras
We adopt a 17-joint skeleton
Train subjects: S1, S5, S6, S7, S8
Test subjects: S9, S11
Human3.6M dataset samples with corresponding 3D ground-truth poses.
Evaluation Protocols
Three evaluation protocols:
Protocol #1 (MPJPE): Mean per-joint position error in millimeters
Mean Euclidean distance between predicted joint positions and ground-truth joint positions
Protocol #2 (P-MPJPE): Procrustes-aligned Mean Per-Joint Position Error
Error after alignment with the ground truth in translation, rotation, and scale
Protocol #3 (N-MPJPE): Normalized Mean Per-Joint Position Error
Aligns predicted poses with the ground-truth only in scale
Used for semi-supervised experiments
Results and Comparison
Quantitative Results
S. No.
Error Metric
VideoPose3D
PoseFormerV2
% Improvement
1
MPJPE
46.8 mm
45.2 mm
3.4%
2
P-MPJPE
36.5 mm
35.6 mm
2.5%
3
N-MPJPE
45.0 mm
43.8 mm
2.7%
Key Points:
All metrics measured in millimeters (mm)
Lower is better
PoseFormerV2 outperforms VideoPose3D across all metrics
Error Metrics Explained:
MPJPE (Protocol #1): Raw 3D position error
P-MPJPE (Protocol #2): Error after pose alignment
N-MPJPE (Protocol #3): Error after scale normalization
Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M. (2019). "3D human pose estimation in video with temporal convolutions and semi-supervised training"
Zhao, Q., Zheng, C., Liu, M., Wang, P., & Chen, C. (2023). "PoseFormerV2: Exploring frequency domain for efficient and robust 3D human pose estimation"