Tapping the transformer architecture
Meta emphasizes their ability to generalize to real-world scenarios, stating that the models have strong performance with real-world data and function well when data is “scarce or entirely synthetic.” To accomplish that feat, the models use a combination of a large-scale, curated training dataset and a scalable architecture based on vision transformers. Interest in transformers has exploded since about 2018 — especially in natural language processing tasks but also in models such as Google DeepMind’s AlphaFold 2 for protein structure prediction. Computer vision applications are hot, too.
Before Sapiens, Meta had gathered significant experience with transformer architectures, having developed models like Data-efficient Image Transformers (DeiT) in 2021 and DETR (DEtection TRansformer), an object detection framework. In Sapiens, transformers’ attention mechanisms allow the various models to weigh the importance of different parts of the input image and dynamically focus on the most relevant features. Such capabilities allow the models to accurately infer human pose, segmentation, depth, and surface normals across various scenarios, from simple poses to complex interactions in cluttered environments.
Sapiens models also use multi-headed self-attention to process high-resolution images, allowing them to discern subtle variations in human anatomy.
Native support for 1k inference
Sapiens models “natively support 1K high-resolution inference,” and their performance improves as parameters are scaled. As the paper notes, “model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion.” The results are impressive, with Meta reporting that “Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.”
The arXiv paper notes that the goal of Sapiens is to offer a unified framework and models to “unlock a wide range of human-centric applications for everybody.” The core focus is on 3D human digitization, which “remains a pivotal goal in computer vision.” In the long run, Meta envisions the model family could be a tool for acquiring large-scale, real-world supervision with human-in-the-loop to develop future generations of human vision models.Potential applications are diverse
Sapiens could have an array of potential uses. In the entertainment industry, the high-fidelity pose estimation and body-part segmentation could facilitate motion capture for films and video games, enabling more realistic CGI-based character animations. Additionally, the detailed facial keypoint detection (243 points) could enhance facial expression analysis for applications in human-computer interaction or emotion recognition systems. In augmented and virtual reality, Sapiens’ depth estimation and surface normal prediction could improve the integration of virtual objects into real environments.
Outside of entertainment, the models’ ability to generalize to in-the-wild scenarios points to potential applications in surveillance and security, such as crowd behavior analysis or anomaly detection in public spaces. Sapiens could also potentially fund use in Advanced Driver Assistance Systems where it provides improved ability to avoid pedestrian collisions. Finally, in healthcare, the precise body pose and depth estimation capabilities could be useful for gait analysis, physical therapy monitoring, or ergonomics assessments.
Over the course of the year, Meta has announced a string of new AI models even as its Reality Labs division slims.
The Sapien models are available to download for free on GitHub.
Tell Us What You Think!