Shixiang Tang1,4*, Cheng Chen4*, Qingsong Xie4, Meilin Chen2,4, Yizhou Wang2,4, Yuanzheng Ci1, Lei Bai3, Feng Zhu4, Haiyang Yang4, Li Yi4, Rui Zhao4,5, Wanli Ouyang3
1The University of Sydney; 2Zhejiang University; 3Shanghai Artifical Laboratory; 4SenseTime Research; 5Qing Yuan Research Institute, Shanghai Jiao Tong University
CVPR 2023

Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a HumanBench based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a Projector AssisTed Hierarchical pretraining method (PATH) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets.
- collected 11,019,187 pretraining images from 37 datasets among 5 tasks from global to local tasks.
- constructed 19 evaluation datasets from 6 tasks.
- 3 evaluation protocols to assess the generalization ability of pretrained models: in-datasets evaluation, out-of-datasets evaluation, unseen-tasks evaluation.
- Designed a Task-specific MLP Projector to Enhance Generalization Ability of Supervised Pretraining.
- Designed Hierarchical Weight Sharing Strategy to Reduce Task Conflicts.
- Higher Performance than States-of-the-art Methods on 17 Datasets and On-par Performance than States-of-the-art Methods on 2 Datasets.
- Even the Tasks do NOT Exist in the Training Data.
See installation instructions.
See data instructions.
We also provide a small training config, with 10% samples of the whole pretraining dataset.
Download pre-trained MAE ViT-Large model from here and place the MAE pretrained weight mae_pretrain_vit_base.pth under core/models/backbones/pretrain_weights folder.
## train ViT-B
cd experiments/L2_full_setting_joint_v100_32g
sh train.sh
## train ViT-L
cd experiments/L2_full_setting_vit_large_a100_80g
sh train.sh
A pre-trained PATH-ViT-B is available at 🤗 hugging face and A pre-trained PATH-ViT-L is availabel at 🤗 hugging face. The results on various tasks are summarized below:
- Hugging Face Release
- Detailed and Convinent Methods for Data Preparation.
- PATH-B finetune configs
- PATH-B/L HumanBench pretrained models
- PATH Pretraining Code
@article{tang2023humanbench,
title={HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining},
author={Tang, Shixiang and Chen, Cheng and Xie, Qingsong and Chen, Meilin and Wang, Yizhou and Ci, Yuanzheng and Bai, Lei and Zhu, Feng and Yang, Haiyang and Yi, Li and others},
journal={arXiv preprint arXiv:2303.05675},
year={2023}
}
MAE, Mask2Former, bts, mmcv, mmdetetection, mmpose.
We are hiring at all levels at 2d-3d Human-Centric Foundation Model Team, including full-time researchers, engineers and interns. If you are interested in working with us on human-centric foundation model and human-centric AIGC driven by foundation model, please contact Shixiang Tang.