Project import generated by Copybara.
GitOrigin-RevId: 5b4c149782c086ebf9ef390195fb260ad0103217
This commit is contained in:
		
							parent
							
								
									350fbb2100
								
							
						
					
					
						commit
						a92cff7a60
					
				| 
						 | 
				
			
			@ -2,6 +2,8 @@
 | 
			
		|||
layout: default
 | 
			
		||||
title: Pose
 | 
			
		||||
parent: Solutions
 | 
			
		||||
has_children: true
 | 
			
		||||
has_toc: false
 | 
			
		||||
nav_order: 5
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -21,10 +23,9 @@ nav_order: 5
 | 
			
		|||
## Overview
 | 
			
		||||
 | 
			
		||||
Human pose estimation from video plays a critical role in various applications
 | 
			
		||||
such as
 | 
			
		||||
[quantifying physical exercises](#pose-classification-and-repetition-counting),
 | 
			
		||||
sign language recognition, and full-body gesture control. For example, it can
 | 
			
		||||
form the basis for yoga, dance, and fitness applications. It can also enable the
 | 
			
		||||
such as [quantifying physical exercises](./pose_classification.md), sign
 | 
			
		||||
language recognition, and full-body gesture control. For example, it can form
 | 
			
		||||
the basis for yoga, dance, and fitness applications. It can also enable the
 | 
			
		||||
overlay of digital content and information on top of the physical world in
 | 
			
		||||
augmented reality.
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -387,121 +388,6 @@ on how to build MediaPipe examples.
 | 
			
		|||
    *   Target:
 | 
			
		||||
        [`mediapipe/examples/desktop/upper_body_pose_tracking:upper_body_pose_tracking_gpu`](https://github.com/google/mediapipe/tree/master/mediapipe/examples/desktop/upper_body_pose_tracking/BUILD)
 | 
			
		||||
 | 
			
		||||
## Pose Classification and Repetition Counting
 | 
			
		||||
 | 
			
		||||
One of the applications
 | 
			
		||||
[BlazePose](https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html)
 | 
			
		||||
can enable is fitness. More specifically - pose classification and repetition
 | 
			
		||||
counting. In this section we'll provide basic guidance on building a custom pose
 | 
			
		||||
classifier with the help of a
 | 
			
		||||
[Colab](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing)
 | 
			
		||||
and wrap it in a simple
 | 
			
		||||
[fitness app](https://mediapipe.page.link/mlkit-pose-classification-demo-app)
 | 
			
		||||
powered by [ML Kit](https://developers.google.com/ml-kit). Push-ups and squats
 | 
			
		||||
are used for demonstration purposes as the most common exercises.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 4. Pose classification and repetition counting with MediaPipe Pose.*                                  |
 | 
			
		||||
 | 
			
		||||
We picked the
 | 
			
		||||
[k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
 | 
			
		||||
(k-NN) as the classifier. It's simple and easy to start with. The algorithm
 | 
			
		||||
determines the object's class based on the closest samples in the training set.
 | 
			
		||||
To build it, one needs to:
 | 
			
		||||
 | 
			
		||||
*   Collect image samples of the target exercises and run pose prediction on
 | 
			
		||||
    them,
 | 
			
		||||
*   Convert obtained pose landmarks to a representation suitable for the k-NN
 | 
			
		||||
    classifier and form a training set,
 | 
			
		||||
*   Perform the classification itself followed by repetition counting.
 | 
			
		||||
 | 
			
		||||
### Training Set
 | 
			
		||||
 | 
			
		||||
To build a good classifier appropriate samples should be collected for the
 | 
			
		||||
training set: about a few hundred samples for each terminal state of each
 | 
			
		||||
exercise (e.g., "up" and "down" positions for push-ups). It's important that
 | 
			
		||||
collected samples cover different camera angles, environment conditions, body
 | 
			
		||||
shapes, and exercise variations.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 5. Two terminal states of push-ups.*                                                                                    |
 | 
			
		||||
 | 
			
		||||
To transform samples into a k-NN classifier training set, either
 | 
			
		||||
[basic](https://drive.google.com/file/d/1z4IM8kG6ipHN6keadjD-F6vMiIIgViKK/view?usp=sharing)
 | 
			
		||||
or
 | 
			
		||||
[extended](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing)
 | 
			
		||||
Colab could be used. They both use the
 | 
			
		||||
[Python Solution API](#python-solution-api) to run the BlazePose models on given
 | 
			
		||||
images and dump predicted pose landmarks to a CSV file. Additionally, the
 | 
			
		||||
extended Colab provides useful tools to find outliers (e.g., wrongly predicted
 | 
			
		||||
poses) and underrepresented classes (e.g., not covering all camera angles) by
 | 
			
		||||
classifying each sample against the entire training set. After that, you'll be
 | 
			
		||||
able to test the classifier on an arbitrary video right in the Colab.
 | 
			
		||||
 | 
			
		||||
### Classification
 | 
			
		||||
 | 
			
		||||
Code of the classifier is available both in the
 | 
			
		||||
[extended](https://drive.google.com/file/d/19txHpN8exWhstO6WVkfmYYVC6uug_oVR/view?usp=sharing)
 | 
			
		||||
Colab and in the
 | 
			
		||||
[ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app).
 | 
			
		||||
Please refer to them for details of the approach described below.
 | 
			
		||||
 | 
			
		||||
The k-NN algorithm used for pose classification requires a feature vector
 | 
			
		||||
representation of each sample and a metric to compute the distance between two
 | 
			
		||||
such vectors to find the nearest pose samples to a target one.
 | 
			
		||||
 | 
			
		||||
To convert pose landmarks to a feature vector, we use pairwise distances between
 | 
			
		||||
predefined lists of pose joints, such as distances between wrist and shoulder,
 | 
			
		||||
ankle and hip, and two wrists. Since the algorithm relies on distances, all
 | 
			
		||||
poses are normalized to have the same torso size and vertical torso orientation
 | 
			
		||||
before the conversion.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 6. Main pairwise distances used for the pose feature vector.*                                         |
 | 
			
		||||
 | 
			
		||||
To get a better classification result, k-NN search is invoked twice with
 | 
			
		||||
different distance metrics:
 | 
			
		||||
 | 
			
		||||
*   First, to filter out samples that are almost the same as the target one but
 | 
			
		||||
    have only a few different values in the feature vector (which means
 | 
			
		||||
    differently bent joints and thus other pose class), minimum per-coordinate
 | 
			
		||||
    distance is used as distance metric,
 | 
			
		||||
*   Then average per-coordinate distance is used to find the nearest pose
 | 
			
		||||
    cluster among those from the first search.
 | 
			
		||||
 | 
			
		||||
Finally, we apply
 | 
			
		||||
[exponential moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average)
 | 
			
		||||
(EMA) smoothing to level any noise from pose prediction or classification. To do
 | 
			
		||||
that, we search not only for the nearest pose cluster, but we calculate a
 | 
			
		||||
probability for each of them and use it for smoothing over time.
 | 
			
		||||
 | 
			
		||||
### Repetition Counter
 | 
			
		||||
 | 
			
		||||
To count the repetitions, the algorithm monitors the probability of a target
 | 
			
		||||
pose class. Let's take push-ups with its "up" and "down" terminal states:
 | 
			
		||||
 | 
			
		||||
*   When the probability of the "down" pose class passes a certain threshold for
 | 
			
		||||
    the first time, the algorithm marks that the "down" pose class is entered.
 | 
			
		||||
*   Once the probability drops below the threshold, the algorithm marks that the
 | 
			
		||||
    "down" pose class has been exited and increases the counter.
 | 
			
		||||
 | 
			
		||||
To avoid cases when the probability fluctuates around the threshold (e.g., when
 | 
			
		||||
the user pauses between "up" and "down" states) causing phantom counts, the
 | 
			
		||||
threshold used to detect when the state is exited is actually slightly lower
 | 
			
		||||
than the one used to detect when the state is entered. It creates an interval
 | 
			
		||||
where the pose class and the counter can't be changed.
 | 
			
		||||
 | 
			
		||||
### Future Work
 | 
			
		||||
 | 
			
		||||
We are actively working on improving BlazePose GHUM 3D's Z prediction. It will
 | 
			
		||||
allow us to use joint angles in the feature vectors, which are more natural and
 | 
			
		||||
easier to configure (although distances can still be useful to detect touches
 | 
			
		||||
between body parts) and to perform rotation normalization of poses and reduce
 | 
			
		||||
the number of camera angles required for accurate k-NN classification.
 | 
			
		||||
 | 
			
		||||
## Resources
 | 
			
		||||
 | 
			
		||||
*   Google AI Blog:
 | 
			
		||||
| 
						 | 
				
			
			@ -512,5 +398,3 @@ the number of camera angles required for accurate k-NN classification.
 | 
			
		|||
*   [Models and model cards](./models.md#pose)
 | 
			
		||||
*   [Web demo](https://code.mediapipe.dev/codepen/pose)
 | 
			
		||||
*   [Python Colab](https://mediapipe.page.link/pose_py_colab)
 | 
			
		||||
*   [Pose Classification Colab (Basic)](https://mediapipe.page.link/pose_classification_basic)
 | 
			
		||||
*   [Pose Classification Colab (Extended)](https://mediapipe.page.link/pose_classification_extended)
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
							
								
								
									
										142
									
								
								docs/solutions/pose_classification.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										142
									
								
								docs/solutions/pose_classification.md
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,142 @@
 | 
			
		|||
---
 | 
			
		||||
layout: default
 | 
			
		||||
title: Pose Classification
 | 
			
		||||
parent: Pose
 | 
			
		||||
grand_parent: Solutions
 | 
			
		||||
nav_order: 1
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
# Pose Classification
 | 
			
		||||
{: .no_toc }
 | 
			
		||||
 | 
			
		||||
<details close markdown="block">
 | 
			
		||||
  <summary>
 | 
			
		||||
    Table of contents
 | 
			
		||||
  </summary>
 | 
			
		||||
  {: .text-delta }
 | 
			
		||||
1. TOC
 | 
			
		||||
{:toc}
 | 
			
		||||
</details>
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## Overview
 | 
			
		||||
 | 
			
		||||
One of the applications
 | 
			
		||||
[BlazePose](https://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html)
 | 
			
		||||
can enable is fitness. More specifically - pose classification and repetition
 | 
			
		||||
counting. In this section we'll provide basic guidance on building a custom pose
 | 
			
		||||
classifier with the help of [Colabs](#colabs) and wrap it in a simple
 | 
			
		||||
[fitness app](https://mediapipe.page.link/mlkit-pose-classification-demo-app)
 | 
			
		||||
powered by [ML Kit](https://developers.google.com/ml-kit). Push-ups and squats
 | 
			
		||||
are used for demonstration purposes as the most common exercises.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 1. Pose classification and repetition counting with MediaPipe Pose.*                                  |
 | 
			
		||||
 | 
			
		||||
We picked the
 | 
			
		||||
[k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
 | 
			
		||||
(k-NN) as the classifier. It's simple and easy to start with. The algorithm
 | 
			
		||||
determines the object's class based on the closest samples in the training set.
 | 
			
		||||
 | 
			
		||||
**To build it, one needs to:**
 | 
			
		||||
 | 
			
		||||
1.  Collect image samples of the target exercises and run pose prediction on
 | 
			
		||||
    them,
 | 
			
		||||
2.  Convert obtained pose landmarks to a representation suitable for the k-NN
 | 
			
		||||
    classifier and form a training set using these [Colabs](#colabs),
 | 
			
		||||
3.  Perform the classification itself followed by repetition counting (e.g., in
 | 
			
		||||
    the
 | 
			
		||||
    [ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app)).
 | 
			
		||||
 | 
			
		||||
## Training Set
 | 
			
		||||
 | 
			
		||||
To build a good classifier appropriate samples should be collected for the
 | 
			
		||||
training set: about a few hundred samples for each terminal state of each
 | 
			
		||||
exercise (e.g., "up" and "down" positions for push-ups). It's important that
 | 
			
		||||
collected samples cover different camera angles, environment conditions, body
 | 
			
		||||
shapes, and exercise variations.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 2. Two terminal states of push-ups.*                                                                                    |
 | 
			
		||||
 | 
			
		||||
To transform samples into a k-NN classifier training set, both
 | 
			
		||||
[`Pose Classification Colab (Basic)`] and
 | 
			
		||||
[`Pose Classification Colab (Extended)`] could be used. They use the
 | 
			
		||||
[Python Solution API](./pose.md#python-solution-api) to run the BlazePose models
 | 
			
		||||
on given images and dump predicted pose landmarks to a CSV file. Additionally,
 | 
			
		||||
the [`Pose Classification Colab (Extended)`] provides useful tools to find
 | 
			
		||||
outliers (e.g., wrongly predicted poses) and underrepresented classes (e.g., not
 | 
			
		||||
covering all camera angles) by classifying each sample against the entire
 | 
			
		||||
training set. After that, you'll be able to test the classifier on an arbitrary
 | 
			
		||||
video right in the Colab.
 | 
			
		||||
 | 
			
		||||
## Classification
 | 
			
		||||
 | 
			
		||||
Code of the classifier is available both in the
 | 
			
		||||
[`Pose Classification Colab (Extended)`] and in the
 | 
			
		||||
[ML Kit demo app](https://mediapipe.page.link/mlkit-pose-classification-demo-app).
 | 
			
		||||
Please refer to them for details of the approach described below.
 | 
			
		||||
 | 
			
		||||
The k-NN algorithm used for pose classification requires a feature vector
 | 
			
		||||
representation of each sample and a metric to compute the distance between two
 | 
			
		||||
such vectors to find the nearest pose samples to a target one.
 | 
			
		||||
 | 
			
		||||
To convert pose landmarks to a feature vector, we use pairwise distances between
 | 
			
		||||
predefined lists of pose joints, such as distances between wrist and shoulder,
 | 
			
		||||
ankle and hip, and two wrists. Since the algorithm relies on distances, all
 | 
			
		||||
poses are normalized to have the same torso size and vertical torso orientation
 | 
			
		||||
before the conversion.
 | 
			
		||||
 | 
			
		||||
 |
 | 
			
		||||
:--------------------------------------------------------------------------------------------------------: |
 | 
			
		||||
*Fig 3. Main pairwise distances used for the pose feature vector.*                                         |
 | 
			
		||||
 | 
			
		||||
To get a better classification result, k-NN search is invoked twice with
 | 
			
		||||
different distance metrics:
 | 
			
		||||
 | 
			
		||||
*   First, to filter out samples that are almost the same as the target one but
 | 
			
		||||
    have only a few different values in the feature vector (which means
 | 
			
		||||
    differently bent joints and thus other pose class), minimum per-coordinate
 | 
			
		||||
    distance is used as distance metric,
 | 
			
		||||
*   Then average per-coordinate distance is used to find the nearest pose
 | 
			
		||||
    cluster among those from the first search.
 | 
			
		||||
 | 
			
		||||
Finally, we apply
 | 
			
		||||
[exponential moving average](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average)
 | 
			
		||||
(EMA) smoothing to level any noise from pose prediction or classification. To do
 | 
			
		||||
that, we search not only for the nearest pose cluster, but we calculate a
 | 
			
		||||
probability for each of them and use it for smoothing over time.
 | 
			
		||||
 | 
			
		||||
## Repetition Counting
 | 
			
		||||
 | 
			
		||||
To count the repetitions, the algorithm monitors the probability of a target
 | 
			
		||||
pose class. Let's take push-ups with its "up" and "down" terminal states:
 | 
			
		||||
 | 
			
		||||
*   When the probability of the "down" pose class passes a certain threshold for
 | 
			
		||||
    the first time, the algorithm marks that the "down" pose class is entered.
 | 
			
		||||
*   Once the probability drops below the threshold, the algorithm marks that the
 | 
			
		||||
    "down" pose class has been exited and increases the counter.
 | 
			
		||||
 | 
			
		||||
To avoid cases when the probability fluctuates around the threshold (e.g., when
 | 
			
		||||
the user pauses between "up" and "down" states) causing phantom counts, the
 | 
			
		||||
threshold used to detect when the state is exited is actually slightly lower
 | 
			
		||||
than the one used to detect when the state is entered. It creates an interval
 | 
			
		||||
where the pose class and the counter can't be changed.
 | 
			
		||||
 | 
			
		||||
## Future Work
 | 
			
		||||
 | 
			
		||||
We are actively working on improving BlazePose GHUM 3D's Z prediction. It will
 | 
			
		||||
allow us to use joint angles in the feature vectors, which are more natural and
 | 
			
		||||
easier to configure (although distances can still be useful to detect touches
 | 
			
		||||
between body parts) and to perform rotation normalization of poses and reduce
 | 
			
		||||
the number of camera angles required for accurate k-NN classification.
 | 
			
		||||
 | 
			
		||||
## Colabs
 | 
			
		||||
 | 
			
		||||
*   [`Pose Classification Colab (Basic)`]
 | 
			
		||||
*   [`Pose Classification Colab (Extended)`]
 | 
			
		||||
 | 
			
		||||
[`Pose Classification Colab (Basic)`]: https://mediapipe.page.link/pose_classification_basic
 | 
			
		||||
[`Pose Classification Colab (Extended)`]: https://mediapipe.page.link/pose_classification_extended
 | 
			
		||||
		Loading…
	
		Reference in New Issue
	
	Block a user