Human pose and shape estimation (HPSE) is a crucial function for human-centric applications, while the accuracy of deep learning-based monocular 3D HPSE may suffer due to depth ambiguity. Multi-camera systems with wide baselines can mitigate the problem but accurate and robust multi-camera calibration is a prerequisite. The main objective of the paper is to develop fast and accurate algorithms for automatic calibration of multi-camera systems which fully utilize human semantic information without using predetermined calibration patterns or objects. The proposed automatic calibration method for multi-camera systems takes from each camera the 3D human body meshes output from pretrained Human Mesh Recovery (HMR) model, and the vertices of each 3D human body mesh are projected onto the 2D image plane for each corresponding camera. Structure-from-Motion (SfM) algorithm is used to reconstruct 3D shapes from a pair of cameras, using iterative Random Sample Consensus (RANSAC) algorithm to remove outliers when calculating the essential matrix in each iteration. Relative camera extrinsic parameters (i.e., the rotation matrix and translation vector) can be calculated from the estimated essential matrix accordingly. By assuming one main camera’s pose in the world coordinate is known, the poses of all other cameras in the multi-camera system can be readily calculated. Using (1) average 2D projection error and (2) average rotation and translation errors as performance metrics, the proposed method is shown to perform calibration more accurate than methods using appearance-based feature extractors, e.g., Scale-Invariant Feature Transform (SIFT), and deep learning-based 2D human joint estimators, e.g., OpenPose.
Chih-Hsien Chou, Lin-His Tsao, "Wide-baseline Multi-camera Automatic Calibration Using Recovered Human Body Mesh" in Electronic Imaging, 2025, pp 126-1 - 126-8, https://doi.org/10.2352/EI.2025.37.14.COIMG-126