Meta's human vision models excel "in the wild"

[Meta]

“Facebook’s parent Meta AI has unveiled Sapiens, a family of high-performance vision models designed to excel in ‘in-the-wild’ environments, overcoming the limitations of traditional models often confined to controlled studio settings. The family of models focuses on ‘four fundamental human-centric vision tasks,’ as the arXiv paper on the tech noted. Those include 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Meta notes that the models offer both adaptability and robust performance. Meta engineers highlight that they are “extremely easy to adapt for individual tasks by simply fine-tuning models pre-trained on over 300 million in-the-wild human images.”

Tapping the transformer architecture

Overview of number of humans per image in the Humans-300M dataset

Meta emphasizes their ability to generalize to real-world scenarios, stating that the models have strong performance with real-world data and function well when data is “scarce or entirely synthetic.” To accomplish that feat, the models use a combination of a large-scale, curated training dataset and a scalable architecture based on vision transformers. Interest in transformers has exploded since about 2018 — especially in natural language processing tasks but also in models such as Google DeepMind’s AlphaFold 2 for protein structure prediction. Computer vision applications are hot, too.

Before Sapiens, Meta had gathered significant experience with transformer architectures, having developed models like Data-efficient Image Transformers (DeiT) in 2021 and DETR (DEtection TRansformer), an object detection framework. In Sapiens, transformers’ attention mechanisms allow the various models to weigh the importance of different parts of the input image and dynamically focus on the most relevant features. Such capabilities allow the models to accurately infer human pose, segmentation, depth, and surface normals across various scenarios, from simple poses to complex interactions in cluttered environments.

Sapiens models also use multi-headed self-attention to process high-resolution images, allowing them to discern subtle variations in human anatomy.

Native support for 1k inference

Sapiens models “natively support 1K high-resolution inference,” and their performance improves as parameters are scaled. As the paper notes, “model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion.” The results are impressive, with Meta reporting that “Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.”

Sapiens models are fine-tuned for four human tasks: 2D pose estimation, body-part segmentation, depth prediction, and normal prediction.

Sapiens models are designed for four human tasks: 2D pose estimation, body-part segmentation, depth prediction, and normal prediction. [From Meta’s Arxiv paper]

The arXiv paper notes that the goal of Sapiens is to offer a unified framework and models to “unlock a wide range of human-centric applications for everybody.” The core focus is on 3D human digitization, which “remains a pivotal goal in computer vision.” In the long run, Meta envisions the model family could be a tool for acquiring large-scale, real-world supervision with human-in-the-loop to develop future generations of human vision models.

Potential applications are diverse

Sapiens could have an array of potential uses. In the entertainment industry, the high-fidelity pose estimation and body-part segmentation could facilitate motion capture for films and video games, enabling more realistic CGI-based character animations. Additionally, the detailed facial keypoint detection (243 points) could enhance facial expression analysis for applications in human-computer interaction or emotion recognition systems. In augmented and virtual reality, Sapiens’ depth estimation and surface normal prediction could improve the integration of virtual objects into real environments.

Outside of entertainment, the models’ ability to generalize to in-the-wild scenarios points to potential applications in surveillance and security, such as crowd behavior analysis or anomaly detection in public spaces. Sapiens could also potentially fund use in Advanced Driver Assistance Systems where it provides improved ability to avoid pedestrian collisions. Finally, in healthcare, the precise body pose and depth estimation capabilities could be useful for gait analysis, physical therapy monitoring, or ergonomics assessments.

Over the course of the year, Meta has announced a string of new AI models even as its Reality Labs division slims.

The Sapien models are available to download for free on GitHub.

Tapping the transformer architecture

Native support for 1k inference

Potential applications are diverse

Related Articles Read More >

AI agents with $3,000 budget flunk open-ended AI research assignment

OpenAI debuts ChatGPT for Academic Researchers program will offer complimentary access to 100,000

As AI floods drug discovery with designs, Twist uses DNA chips to tackle the wet-lab bottleneck

Inside AutoLabs: PNNL’s self-correcting AI still needs an expert in the loop

Search R&D World