f96eadd6df
GitOrigin-RevId: f7d09ed033907b893638a8eb4148efa11c0f09a6
292 lines
14 KiB
Markdown
292 lines
14 KiB
Markdown
---
|
|
layout: default
|
|
title: Hands
|
|
parent: Solutions
|
|
nav_order: 4
|
|
---
|
|
|
|
# MediaPipe Hands
|
|
{: .no_toc }
|
|
|
|
1. TOC
|
|
{:toc}
|
|
---
|
|
|
|
## Overview
|
|
|
|
The ability to perceive the shape and motion of hands can be a vital component
|
|
in improving the user experience across a variety of technological domains and
|
|
platforms. For example, it can form the basis for sign language understanding
|
|
and hand gesture control, and can also enable the overlay of digital content and
|
|
information on top of the physical world in augmented reality. While coming
|
|
naturally to people, robust real-time hand perception is a decidedly challenging
|
|
computer vision task, as hands often occlude themselves or each other (e.g.
|
|
finger/palm occlusions and hand shakes) and lack high contrast patterns.
|
|
|
|
MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs
|
|
machine learning (ML) to infer 21 3D landmarks of a hand from just a single
|
|
frame. Whereas current state-of-the-art approaches rely primarily on powerful
|
|
desktop environments for inference, our method achieves real-time performance on
|
|
a mobile phone, and even scales to multiple hands. We hope that providing this
|
|
hand perception functionality to the wider research and development community
|
|
will result in an emergence of creative use cases, stimulating new applications
|
|
and new research avenues.
|
|
|
|
![hand_tracking_3d_android_gpu.gif](../images/mobile/hand_tracking_3d_android_gpu.gif) |
|
|
:------------------------------------------------------------------------------------: |
|
|
*Fig 1. Tracked 3D hand landmarks are represented by dots in different shades, with the brighter ones denoting landmarks closer to the camera.* |
|
|
|
|
## ML Pipeline
|
|
|
|
MediaPipe Hands utilizes an ML pipeline consisting of multiple models working
|
|
together: A palm detection model that operates on the full image and returns an
|
|
oriented hand bounding box. A hand landmark model that operates on the cropped
|
|
image region defined by the palm detector and returns high-fidelity 3D hand
|
|
keypoints. This strategy is similar to that employed in our
|
|
[MediaPipe Face Mesh](./face_mesh.md) solution, which uses a face detector
|
|
together with a face landmark model.
|
|
|
|
Providing the accurately cropped hand image to the hand landmark model
|
|
drastically reduces the need for data augmentation (e.g. rotations, translation
|
|
and scale) and instead allows the network to dedicate most of its capacity
|
|
towards coordinate prediction accuracy. In addition, in our pipeline the crops
|
|
can also be generated based on the hand landmarks identified in the previous
|
|
frame, and only when the landmark model could no longer identify hand presence
|
|
is palm detection invoked to relocalize the hand.
|
|
|
|
The pipeline is implemented as a MediaPipe
|
|
[graph](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/hand_tracking_mobile.pbtxt)
|
|
that uses a
|
|
[hand landmark tracking subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/hand_landmark/hand_landmark_tracking_gpu.pbtxt)
|
|
from the
|
|
[hand landmark module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/hand_landmark),
|
|
and renders using a dedicated
|
|
[hand renderer subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/subgraphs/hand_renderer_gpu.pbtxt).
|
|
The
|
|
[hand landmark tracking subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/hand_landmark/hand_landmark_tracking_gpu.pbtxt)
|
|
internally uses a
|
|
[hand landmark subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/hand_landmark/hand_landmark_gpu.pbtxt)
|
|
from the same module and a
|
|
[palm detection subgraph](https://github.com/google/mediapipe/tree/master/mediapipe/modules/palm_detection/palm_detection_gpu.pbtxt)
|
|
from the
|
|
[palm detection module](https://github.com/google/mediapipe/tree/master/mediapipe/modules/palm_detection).
|
|
|
|
Note: To visualize a graph, copy the graph and paste it into
|
|
[MediaPipe Visualizer](https://viz.mediapipe.dev/). For more information on how
|
|
to visualize its associated subgraphs, please see
|
|
[visualizer documentation](../tools/visualizer.md).
|
|
|
|
## Models
|
|
|
|
### Palm Detection Model
|
|
|
|
To detect initial hand locations, we designed a
|
|
[single-shot detector](https://arxiv.org/abs/1512.02325) model optimized for
|
|
mobile real-time uses in a manner similar to the face detection model in
|
|
[MediaPipe Face Mesh](./face_mesh.md). Detecting hands is a decidedly complex
|
|
task: our
|
|
[model](https://github.com/google/mediapipe/tree/master/mediapipe/models/palm_detection.tflite) has
|
|
to work across a variety of hand sizes with a large scale span (~20x) relative
|
|
to the image frame and be able to detect occluded and self-occluded hands.
|
|
Whereas faces have high contrast patterns, e.g., in the eye and mouth region,
|
|
the lack of such features in hands makes it comparatively difficult to detect
|
|
them reliably from their visual features alone. Instead, providing additional
|
|
context, like arm, body, or person features, aids accurate hand localization.
|
|
|
|
Our method addresses the above challenges using different strategies. First, we
|
|
train a palm detector instead of a hand detector, since estimating bounding
|
|
boxes of rigid objects like palms and fists is significantly simpler than
|
|
detecting hands with articulated fingers. In addition, as palms are smaller
|
|
objects, the non-maximum suppression algorithm works well even for two-hand
|
|
self-occlusion cases, like handshakes. Moreover, palms can be modelled using
|
|
square bounding boxes (anchors in ML terminology) ignoring other aspect ratios,
|
|
and therefore reducing the number of anchors by a factor of 3-5. Second, an
|
|
encoder-decoder feature extractor is used for bigger scene context awareness
|
|
even for small objects (similar to the RetinaNet approach). Lastly, we minimize
|
|
the focal loss during training to support a large amount of anchors resulting
|
|
from the high scale variance.
|
|
|
|
With the above techniques, we achieve an average precision of 95.7% in palm
|
|
detection. Using a regular cross entropy loss and no decoder gives a baseline of
|
|
just 86.22%.
|
|
|
|
### Hand Landmark Model
|
|
|
|
After the palm detection over the whole image our subsequent hand landmark
|
|
[model](https://github.com/google/mediapipe/tree/master/mediapipe/models/hand_landmark.tflite)
|
|
performs precise keypoint localization of 21 3D hand-knuckle coordinates inside
|
|
the detected hand regions via regression, that is direct coordinate prediction.
|
|
The model learns a consistent internal hand pose representation and is robust
|
|
even to partially visible hands and self-occlusions.
|
|
|
|
To obtain ground truth data, we have manually annotated ~30K real-world images
|
|
with 21 3D coordinates, as shown below (we take Z-value from image depth map, if
|
|
it exists per corresponding coordinate). To better cover the possible hand poses
|
|
and provide additional supervision on the nature of hand geometry, we also
|
|
render a high-quality synthetic hand model over various backgrounds and map it
|
|
to the corresponding 3D coordinates.
|
|
|
|
| ![hand_crops.png](../images/mobile/hand_crops.png) |
|
|
| :-------------------------------------------------------------------------: |
|
|
| *Fig 2. Top: Aligned hand crops passed to the tracking network with ground truth annotation. Bottom: Rendered synthetic hand images with ground truth annotation.* |
|
|
|
|
## Example Apps
|
|
|
|
Please first see general instructions for
|
|
[Android](../getting_started/building_examples.md#android), [iOS](../getting_started/building_examples.md#ios)
|
|
and [desktop](../getting_started/building_examples.md#desktop) on how to build MediaPipe
|
|
examples.
|
|
|
|
Note: To visualize a graph, copy the graph and paste it into
|
|
[MediaPipe Visualizer](https://viz.mediapipe.dev/). For more information on how
|
|
to visualize its associated subgraphs, please see
|
|
[visualizer documentation](../tools/visualizer.md).
|
|
|
|
### Mobile
|
|
|
|
#### Main Example
|
|
|
|
* Graph:
|
|
[`mediapipe/graphs/hand_tracking/hand_tracking_mobile.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/hand_tracking_mobile.pbtxt)
|
|
* Android target:
|
|
[(or download prebuilt ARM64 APK)](https://drive.google.com/open?id=1uCjS0y0O0dTDItsMh8x2cf4-l3uHW1vE)
|
|
[`mediapipe/examples/android/src/java/com/google/mediapipe/apps/handtrackinggpu:handtrackinggpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/android/src/java/com/google/mediapipe/apps/handtrackinggpu/BUILD)
|
|
* iOS target:
|
|
[`mediapipe/examples/ios/handtrackinggpu:HandTrackingGpuApp`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/ios/handtrackinggpu/BUILD)
|
|
|
|
Tip: Maximum number of hands to detect/process is set to 2 by default. To change
|
|
it, for Android modify `NUM_HANDS` in
|
|
[MainActivity.java](https://github.com/google/mediapipe/tree/master/mediapipe/examples/android/src/java/com/google/mediapipe/apps/handtrackinggpu/MainActivity.java),
|
|
and for iOS modify `kNumHands` in
|
|
[HandTrackingViewController.mm](https://github.com/google/mediapipe/tree/master/mediapipe/examples/ios/handtrackinggpu/HandTrackingViewController.mm).
|
|
|
|
#### Palm/Hand Detection Only (no landmarks)
|
|
|
|
* Graph:
|
|
[`mediapipe/graphs/hand_tracking/hand_detection_mobile.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/hand_detection_mobile.pbtxt)
|
|
* Android target:
|
|
[(or download prebuilt ARM64 APK)](https://drive.google.com/open?id=1qUlTtH7Ydg-wl_H6VVL8vueu2UCTu37E)
|
|
[`mediapipe/examples/android/src/java/com/google/mediapipe/apps/handdetectiongpu:handdetectiongpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/android/src/java/com/google/mediapipe/apps/handdetectiongpu/BUILD)
|
|
* iOS target:
|
|
[`mediapipe/examples/ios/handdetectiongpu:HandDetectionGpuApp`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/ios/handdetectiongpu/BUILD)
|
|
|
|
### Desktop
|
|
|
|
* Running on CPU
|
|
* Graph:
|
|
[`mediapipe/graphs/hand_tracking/hand_tracking_desktop_live.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/hand_tracking_desktop_live.pbtxt)
|
|
* Target:
|
|
[`mediapipe/examples/desktop/hand_tracking:hand_tracking_cpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/hand_tracking/BUILD)
|
|
* Running on GPU
|
|
* Graph:
|
|
[`mediapipe/graphs/hand_tracking/hand_tracking_desktop_live_gpu.pbtxt`](https://github.com/google/mediapipe/tree/master/mediapipe/graphs/hand_tracking/hand_tracking_desktop_gpu.pbtxt)
|
|
* Target:
|
|
[`mediapipe/examples/desktop/hand_tracking:hand_tracking_gpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/hand_tracking/BUILD)
|
|
|
|
Tip: Maximum number of hands to detect/process is set to 2 by default. To change
|
|
it, in the graph file modify the option of `ConstantSidePacketCalculator`.
|
|
|
|
### Python
|
|
|
|
MediaPipe Python package is available on
|
|
[PyPI](https://pypi.org/project/mediapipe/), and can be installed simply by `pip
|
|
install mediapipe` on Linux and macOS, as described below and in this
|
|
[colab](https://mediapipe.page.link/hands_py_colab). If you do need to build the
|
|
Python package from source, see
|
|
[additional instructions](../getting_started/building_examples.md#python).
|
|
|
|
Activate a Python virtual environment:
|
|
|
|
```bash
|
|
$ python3 -m venv mp_env && source mp_env/bin/activate
|
|
```
|
|
|
|
Install MediaPipe Python package:
|
|
|
|
```bash
|
|
(mp_env)$ pip install mediapipe
|
|
```
|
|
|
|
Run the following Python code:
|
|
|
|
<!-- Do not change the example code below directly. Change the corresponding example in mediapipe/python/solutions/hands.py and copy it over. -->
|
|
|
|
```python
|
|
import cv2
|
|
import mediapipe as mp
|
|
mp_drawing = mp.solutions.drawing_utils
|
|
mp_hands = mp.solutions.hands
|
|
|
|
# For static images:
|
|
hands = mp_hands.Hands(
|
|
static_image_mode=True,
|
|
max_num_hands=2,
|
|
min_detection_confidence=0.7)
|
|
for idx, file in enumerate(file_list):
|
|
# Read an image, flip it around y-axis for correct handedness output (see
|
|
# above).
|
|
image = cv2.flip(cv2.imread(file), 1)
|
|
# Convert the BGR image to RGB before processing.
|
|
results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
|
|
|
|
# Print handedness and draw hand landmarks on the image.
|
|
print('handedness:', results.multi_handedness)
|
|
if not results.multi_hand_landmarks:
|
|
continue
|
|
annotated_image = image.copy()
|
|
for hand_landmarks in results.multi_hand_landmarks:
|
|
print('hand_landmarks:', hand_landmarks)
|
|
mp_drawing.draw_landmarks(
|
|
annotated_image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
|
|
cv2.imwrite(
|
|
'/tmp/annotated_image' + str(idx) + '.png', cv2.flip(image, 1))
|
|
hands.close()
|
|
|
|
# For webcam input:
|
|
hands = mp_hands.Hands(
|
|
min_detection_confidence=0.7, min_tracking_confidence=0.5)
|
|
cap = cv2.VideoCapture(0)
|
|
while cap.isOpened():
|
|
success, image = cap.read()
|
|
if not success:
|
|
break
|
|
|
|
# Flip the image horizontally for a later selfie-view display, and convert
|
|
# the BGR image to RGB.
|
|
image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
|
|
# To improve performance, optionally mark the image as not writeable to
|
|
# pass by reference.
|
|
image.flags.writeable = False
|
|
results = hands.process(image)
|
|
|
|
# Draw the hand annotations on the image.
|
|
image.flags.writeable = True
|
|
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
|
|
if results.multi_hand_landmarks:
|
|
for hand_landmarks in results.multi_hand_landmarks:
|
|
mp_drawing.draw_landmarks(
|
|
image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
|
|
cv2.imshow('MediaPipe Hands', image)
|
|
if cv2.waitKey(5) & 0xFF == 27:
|
|
break
|
|
hands.close()
|
|
cap.release()
|
|
```
|
|
|
|
Tip: Use command `deactivate` to exit the Python virtual environment.
|
|
|
|
### Web
|
|
|
|
Please refer to [these instructions](../index.md#mediapipe-on-the-web).
|
|
|
|
## Resources
|
|
|
|
* Google AI Blog:
|
|
[On-Device, Real-Time Hand Tracking with MediaPipe](https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html)
|
|
* TensorFlow Blog:
|
|
[Face and hand tracking in the browser with MediaPipe and TensorFlow.js](https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html)
|
|
* Paper:
|
|
[MediaPipe Hands: On-device Real-time Hand Tracking](https://arxiv.org/abs/2006.10214)
|
|
([presentation](https://www.youtube.com/watch?v=I-UOrvxxXEk))
|
|
* [Models and model cards](./models.md#hands)
|