--- layout: default title: Holistic parent: Solutions nav_order: 6 --- # MediaPipe Holistic {: .no_toc }
Table of contents {: .text-delta } 1. TOC {:toc}
--- ## Overview Live perception of simultaneous [human pose](./pose.md), [face landmarks](./face_mesh.md), and [hand tracking](./hands.md) in real-time on mobile devices can enable various modern life applications: fitness and sport analysis, gesture control and sign language recognition, augmented reality try-on and effects. MediaPipe already offers fast and accurate, yet separate, solutions for these tasks. Combining them all in real-time into a semantically consistent end-to-end solution is a uniquely difficult problem requiring simultaneous inference of multiple, dependent neural networks. ![holistic_sports_and_gestures_example.gif](../images/mobile/holistic_sports_and_gestures_example.gif) | :----------------------------------------------------------------------------------------------------: | *Fig 1. Example of MediaPipe Holistic.* | ## ML Pipeline The MediaPipe Holistic pipeline integrates separate models for [pose](./pose.md), [face](./face_mesh.md) and [hand](./hands.md) components, each of which are optimized for their particular domain. However, because of their different specializations, the input to one component is not well-suited for the others. The pose estimation model, for example, takes a lower, fixed resolution video frame (256x256) as input. But if one were to crop the hand and face regions from that image to pass to their respective models, the image resolution would be too low for accurate articulation. Therefore, we designed MediaPipe Holistic as a multi-stage pipeline, which treats the different regions using a region appropriate image resolution. First, we estimate the human pose (top of Fig 2) with [BlazePose](./pose.md)’s pose detector and subsequent landmark model. Then, using the inferred pose landmarks we derive three regions of interest (ROI) crops for each hand (2x) and the face, and employ a re-crop model to improve the ROI. We then crop the full-resolution input frame to these ROIs and apply task-specific face and hand models to estimate their corresponding landmarks. Finally, we merge all landmarks with those of the pose model to yield the full 540+ landmarks. ![holistic_pipeline_example.jpg](../images/mobile/holistic_pipeline_example.jpg) | :------------------------------------------------------------------------------: | *Fig 2. MediaPipe Holistic Pipeline Overview.* | To streamline the identification of ROIs for face and hands, we utilize a tracking approach similar to the one we use for standalone [face](./face_mesh.md) and [hand](./hands.md) pipelines. It assumes that the object doesn't move significantly between frames and uses estimation from the previous frame as a guide to the object region on the current one. However, during fast movements, the tracker can lose the target, which requires the detector to re-localize it in the image. MediaPipe Holistic uses [pose](./pose.md) prediction (on every frame) as an additional ROI prior to reduce the response time of the pipeline when reacting to fast movements. This also enables the model to retain semantic consistency across the body and its parts by preventing a mixup between left and right hands or body parts of one person in the frame with another. In addition, the resolution of the input frame to the pose model is low enough that the resulting ROIs for face and hands are still too inaccurate to guide the re-cropping of those regions, which require a precise input crop to remain lightweight. To close this accuracy gap we use lightweight face and hand re-crop models that play the role of [spatial transformers](https://arxiv.org/abs/1506.02025) and cost only ~10% of corresponding model's inference time. The pipeline is implemented as a MediaPipe [graph](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt) that uses a [holistic landmark subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/holistic_landmark/holistic_landmark_gpu.pbtxt) from the [holistic landmark module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/holistic_landmark) and renders using a dedicated [holistic renderer subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/holistic_tracking/holistic_tracking_to_render_data.pbtxt). The [holistic landmark subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/holistic_landmark/holistic_landmark_gpu.pbtxt) internally uses a [pose landmark module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/pose_landmark) , [hand landmark module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/hand_landmark) and [face landmark module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/face_landmark/). Please check them for implementation details. Note: To visualize a graph, copy the graph and paste it into [MediaPipe Visualizer](https://viz.mediapipe.dev/). For more information on how to visualize its associated subgraphs, please see [visualizer documentation](../tools/visualizer.md). ## Models ### Landmark Models MediaPipe Holistic utilizes the pose, face and hand landmark models in [MediaPipe Pose](./pose.md), [MediaPipe Face Mesh](./face_mesh.md) and [MediaPipe Hands](./hands.md) respectively to generate a total of 543 landmarks (33 pose landmarks, 468 face landmarks, and 21 hand landmarks per hand). ### Hand Recrop Model For cases when the accuracy of the pose model is low enough that the resulting ROIs for hands are still too inaccurate we run the additional lightweight hand re-crop model that play the role of [spatial transformer](https://arxiv.org/abs/1506.02025) and cost only ~10% of hand model inference time. ## Solution APIs ### Cross-platform Configuration Options Naming style and availability may differ slightly across platforms/languages. #### static_image_mode If set to `false`, the solution treats the input images as a video stream. It will try to detect the most prominent person in the very first images, and upon a successful detection further localizes the pose and other landmarks. In subsequent images, it then simply tracks those landmarks without invoking another detection until it loses track, on reducing computation and latency. If set to `true`, person detection runs every input image, ideal for processing a batch of static, possibly unrelated, images. Default to `false`. #### model_complexity Complexity of the pose landmark model: `0`, `1` or `2`. Landmark accuracy as well as inference latency generally go up with the model complexity. Default to `1`. #### smooth_landmarks If set to `true`, the solution filters pose landmarks across different input images to reduce jitter, but ignored if [static_image_mode](#static_image_mode) is also set to `true`. Default to `true`. #### enable_segmentation If set to `true`, in addition to the pose, face and hand landmarks the solution also generates the segmentation mask. Default to `false`. #### smooth_segmentation If set to `true`, the solution filters segmentation masks across different input images to reduce jitter. Ignored if [enable_segmentation](#enable_segmentation) is `false` or [static_image_mode](#static_image_mode) is `true`. Default to `true`. #### min_detection_confidence Minimum confidence value (`[0.0, 1.0]`) from the person-detection model for the detection to be considered successful. Default to `0.5`. #### min_tracking_confidence Minimum confidence value (`[0.0, 1.0]`) from the landmark-tracking model for the pose landmarks to be considered tracked successfully, or otherwise person detection will be invoked automatically on the next input image. Setting it to a higher value can increase robustness of the solution, at the expense of a higher latency. Ignored if [static_image_mode](#static_image_mode) is `true`, where person detection simply runs on every image. Default to `0.5`. ### Output Naming style may differ slightly across platforms/languages. #### pose_landmarks A list of pose landmarks. Each landmark consists of the following: * `x` and `y`: Landmark coordinates normalized to `[0.0, 1.0]` by the image width and height respectively. * `z`: Should be discarded as currently the model is not fully trained to predict depth, but this is something on the roadmap. * `visibility`: A value in `[0.0, 1.0]` indicating the likelihood of the landmark being visible (present and not occluded) in the image. #### pose_world_landmarks Another list of pose landmarks in world coordinates. Each landmark consists of the following: * `x`, `y` and `z`: Real-world 3D coordinates in meters with the origin at the center between hips. * `visibility`: Identical to that defined in the corresponding [pose_landmarks](#pose_landmarks). #### face_landmarks A list of 468 face landmarks. Each landmark consists of `x`, `y` and `z`. `x` and `y` are normalized to `[0.0, 1.0]` by the image width and height respectively. `z` represents the landmark depth with the depth at center of the head being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of `z` uses roughly the same scale as `x`. #### left_hand_landmarks A list of 21 hand landmarks on the left hand. Each landmark consists of `x`, `y` and `z`. `x` and `y` are normalized to `[0.0, 1.0]` by the image width and height respectively. `z` represents the landmark depth with the depth at the wrist being the origin, and the smaller the value the closer the landmark is to the camera. The magnitude of `z` uses roughly the same scale as `x`. #### right_hand_landmarks A list of 21 hand landmarks on the right hand, in the same representation as [left_hand_landmarks](#left_hand_landmarks). #### segmentation_mask The output segmentation mask, predicted only when [enable_segmentation](#enable_segmentation) is set to `true`. The mask has the same width and height as the input image, and contains values in `[0.0, 1.0]` where `1.0` and `0.0` indicate high certainty of a "human" and "background" pixel respectively. Please refer to the platform-specific usage examples below for usage details. ### Python Solution API Please first follow general [instructions](../getting_started/python.md) to install MediaPipe Python package, then learn more in the companion [Python Colab](#resources) and the usage example below. Supported configuration options: * [static_image_mode](#static_image_mode) * [model_complexity](#model_complexity) * [smooth_landmarks](#smooth_landmarks) * [enable_segmentation](#enable_segmentation) * [smooth_segmentation](#smooth_segmentation) * [min_detection_confidence](#min_detection_confidence) * [min_tracking_confidence](#min_tracking_confidence) ```python import cv2 import mediapipe as mp mp_drawing = mp.solutions.drawing_utils mp_drawing_styles = mp.solutions.drawing_styles mp_holistic = mp.solutions.holistic # For static images: IMAGE_FILES = [] with mp_holistic.Holistic( static_image_mode=True, model_complexity=2, enable_segmentation=True) as holistic: for idx, file in enumerate(IMAGE_FILES): image = cv2.imread(file) image_height, image_width, _ = image.shape # Convert the BGR image to RGB before processing. results = holistic.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB)) if results.pose_landmarks: print( f'Nose coordinates: (' f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].x * image_width}, ' f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].y * image_height})' ) annotated_image = image.copy() # Draw segmentation on the image. # To improve segmentation around boundaries, consider applying a joint # bilateral filter to "results.segmentation_mask" with "image". condition = np.stack((results.segmentation_mask,) * 3, axis=-1) > 0.1 bg_image = np.zeros(image.shape, dtype=np.uint8) bg_image[:] = BG_COLOR annotated_image = np.where(condition, annotated_image, bg_image) # Draw pose, left and right hands, and face landmarks on the image. mp_drawing.draw_landmarks( annotated_image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION, landmark_drawing_spec=None, connection_drawing_spec=mp_drawing_styles .get_default_face_mesh_tesselation_style()) mp_drawing.draw_landmarks( annotated_image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS, landmark_drawing_spec=mp_drawing_styles. get_default_pose_landmarks_style()) cv2.imwrite('/tmp/annotated_image' + str(idx) + '.png', annotated_image) # Plot pose world landmarks. mp_drawing.plot_landmarks( results.pose_world_landmarks, mp_holistic.POSE_CONNECTIONS) # For webcam input: cap = cv2.VideoCapture(0) with mp_holistic.Holistic( min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic: while cap.isOpened(): success, image = cap.read() if not success: print("Ignoring empty camera frame.") # If loading a video, use 'break' instead of 'continue'. continue # To improve performance, optionally mark the image as not writeable to # pass by reference. image.flags.writeable = False image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = holistic.process(image) # Draw landmark annotation on the image. image.flags.writeable = True image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) mp_drawing.draw_landmarks( image, results.face_landmarks, mp_holistic.FACEMESH_CONTOURS, landmark_drawing_spec=None, connection_drawing_spec=mp_drawing_styles .get_default_face_mesh_contours_style()) mp_drawing.draw_landmarks( image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS, landmark_drawing_spec=mp_drawing_styles .get_default_pose_landmarks_style()) # Flip the image horizontally for a selfie-view display. cv2.imshow('MediaPipe Holistic', cv2.flip(image, 1)) if cv2.waitKey(5) & 0xFF == 27: break cap.release() ``` ### JavaScript Solution API Please first see general [introduction](../getting_started/javascript.md) on MediaPipe in JavaScript, then learn more in the companion [web demo](#resources) and the following usage example. Supported configuration options: * [modelComplexity](#model_complexity) * [smoothLandmarks](#smooth_landmarks) * [enableSegmentation](#enable_segmentation) * [smoothSegmentation](#smooth_segmentation) * [minDetectionConfidence](#min_detection_confidence) * [minTrackingConfidence](#min_tracking_confidence) ```html
``` ```javascript ``` ## Example Apps Please first see general instructions for [Android](../getting_started/android.md), [iOS](../getting_started/ios.md), and [desktop](../getting_started/cpp.md) on how to build MediaPipe examples. Note: To visualize a graph, copy the graph and paste it into [MediaPipe Visualizer](https://viz.mediapipe.dev/). For more information on how to visualize its associated subgraphs, please see [visualizer documentation](../tools/visualizer.md). ### Mobile * Graph: [`mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt) * Android target: [(or download prebuilt ARM64 APK)](https://drive.google.com/file/d/1o-Trp2GIRitA0OvmZWUQjVMa476xpfgK/view?usp=sharing) [`mediapipe/examples/android/src/java/com/google/mediapipe/apps/holistictrackinggpu:holistictrackinggpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/android/src/java/com/google/mediapipe/apps/holistictrackinggpu/BUILD) * iOS target: [`mediapipe/examples/ios/holistictrackinggpu:HolisticTrackingGpuApp`](http:/mediapipe/examples/ios/holistictrackinggpu/BUILD) ### Desktop Please first see general instructions for [desktop](../getting_started/cpp.md) on how to build MediaPipe examples. * Running on CPU * Graph: [`mediapipe/graphs/holistic_tracking/holistic_tracking_cpu.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/holistic_tracking/holistic_tracking_cpu.pbtxt) * Target: [`mediapipe/examples/desktop/holistic_tracking:holistic_tracking_cpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/holistic_tracking/BUILD) * Running on GPU * Graph: [`mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt) * Target: [`mediapipe/examples/desktop/holistic_tracking:holistic_tracking_gpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/holistic_tracking/BUILD) ## Resources * Google AI Blog: [MediaPipe Holistic - Simultaneous Face, Hand and Pose Prediction, on Device](https://ai.googleblog.com/2020/12/mediapipe-holistic-simultaneous-face.html) * [Models and model cards](./models.md#holistic) * [Web demo](https://code.mediapipe.dev/codepen/holistic) * [Python Colab](https://mediapipe.page.link/holistic_py_colab)