
Human pose and shape estimation (HPSE) is a crucial function for human-centric applications, while the accuracy of deep learning-based monocular 3D HPSE may suffer due to depth ambiguity and occlusion problems. Multi-camera systems with wide baselines can mitigate the problems but accurate and robust multi-camera calibration is a prerequisite. The main objective for the project is to develop fast and accurate algorithms for automatic calibration of multi-camera systems which fully utilize human semantic information from multiple persons in the scene simultaneously seen by multiple cameras, without using predetermined calibration patterns or objects. The proposed method solves the multi-view matching problem by combining geometric consistency (represented by pose and shape from HPSE model) and appearance similarity (represented by feature from Re-ID model) to calculate the affinity scores between human body meshes detected from different views and then calculate the optimal permutation matrix P, which is cycle-consistent across all views for all persons seen by more than one camera. Humans seen by pairs of cameras and identified as the same person are further processed for pairwise camera calibration using Structure-from-Motion (SfM) and RANSAC algorithms to estimate the relative camera pose between the pair of cameras. The proposed method supports multiple persons in the common regions and achieves higher accuracy and faster convergence rate than existing methods using deep learning-based 2D human object detectors or 2D human joint estimators with iterative refinement for multi-person support.