569 lines
43 KiB
Markdown
569 lines
43 KiB
Markdown
# MediaSequence keys and functions reference
|
|
|
|
The documentation below will first provide an overview of using MediaSequence
|
|
for machine learning tasks. Then, the documentation will describe the function
|
|
prototypes used in MediaSequence for storing multimedia data in
|
|
SequenceExamples. Finally, the documentation will describe the specific keys for
|
|
storing specific types of data.
|
|
|
|
## Overview of MediaSequence for machine learning
|
|
|
|
The goal of MediaSequence is to provide a tool for transforming annotations of
|
|
multimedia into input examples ready for use with machine learning models in
|
|
TensorFlow. The most semantically appropriate data type for this task that can
|
|
be easily parsed in TensorFlow is
|
|
tensorflow.train.SequenceExamples/tensorflow::SequenceExamples.
|
|
Using SequenceExamples enables quick integration of new
|
|
features into TensorFlow pipelines, easy open sourcing of models and data,
|
|
reasonable debugging, and efficient TensorFlow decoding. For many machine
|
|
learning tasks, TensorFlow Examples are capable of fulfilling that role.
|
|
However, Examples can become unwieldy for sequence data, particularly when the
|
|
number of features per timestep varies, creating a ragged struction. Video
|
|
object detection is one example task that requires this ragged structure because
|
|
the number of detections per frame varies. SequenceExamples can easily encode
|
|
this ragged structure. Sequences naturally match the semantics of video as a
|
|
sequence of frames or other common media patterns. The interpretable semantics simplify debugging and decoding of
|
|
potentially complicated data. One potential disadvantage of SequenceExamples is
|
|
that keys and formats can vary widely. The MediaSequence library provides tools
|
|
for consistently manipulating and decoding SequenceExamples in Python and C++ in
|
|
a consistent format. The consistent format enables creating a pipeline for
|
|
processing data sets. A goal of MediaSequence as a pipeline is that users should
|
|
only need to specify the metadata (e.g. videos and labels) for their task. The
|
|
pipeline will turn the metadata into training data.
|
|
|
|
The pipeline has two stages. First, users must generate the metadata
|
|
describing the data and applicable labels. This process is
|
|
straightforward and described in the next section. Second, users run MediaPipe
|
|
graphs with the `UnpackMediaSequenceCalculator` and
|
|
`PackMediaSequenceCalculators` to extract the relevant data from multimedia
|
|
files. A sequence of graphs can be chained together in this second stage to
|
|
achieve complex processing such as first extracting a subset of frames from a
|
|
video and then extracting deep features or object detections for each extracted
|
|
frame. As MediaPipe is built to simply and reproducibly process media files,
|
|
the two stage approach separates and simplifies data management.
|
|
|
|
### Creating metadata for a new data set
|
|
|
|
Generating examples for a new data set typically only requires defining the
|
|
metadata. MediaPipe graphs can interpret this metadata to fill out the
|
|
SequenceExamples using the `UnpackMediaSequenceCalculator` and
|
|
`PackMediaSequenceCalculator`. This section will list the metadata required for
|
|
different types of tasks and provide a limited descripiton for the data filled
|
|
by MediaPipe. The input media will be referred to as video because that is a
|
|
common case, but audio files or other sequences could be supported. The function
|
|
calls in the Python API will be used in examples, and the equivalent C++ calls
|
|
are described below.
|
|
|
|
The video metadata is a way to access the video, using `set_clip_data_path` to
|
|
define the path on disk, and the time span to include using
|
|
`set_clip_start_timestamp` and `set_clip_end_timestamp`. The data path can be
|
|
absolute or can be relative to a root directory passed to the
|
|
`UnpackMediaSequenceCalculator`. The start and end timestamps should be valid
|
|
MediaPipe timestamps in microseconds. Given this information, the pipeline can
|
|
extract the portion of the media between the start and end timestamps. If you do
|
|
not specify a start time, the video is decoded from the beginning. If you do not
|
|
specify an end time, the entire video is decoded. The start and end times are
|
|
not filled if left empty.
|
|
|
|
The features extracted from the video depends on the MediaPipe graph that is
|
|
run. The documentation of keys below and in `PackMediaSequenceCalculator`
|
|
provide the best description.
|
|
|
|
The annotations including labels should be added as metadata. They will be
|
|
passed through the MediaPipe pipeline unchanged. The label format will vary
|
|
depending on the task you want to do. Several examples are included below. In
|
|
general, the MediaPipe processing is independent of any labels that you provide:
|
|
only the clip data path, start time, and end time matter.
|
|
|
|
#### Clip classification
|
|
|
|
For clip classification, e.g. is this video clip about basketball?, you
|
|
should use `set_clip_label_index` with the integer index of the correct class
|
|
and `set_clip_label_string` with the human readable version of the correct class.
|
|
The index is often used when training the model and the string is used for
|
|
human readable debugging. The same number of indices and strings need to be
|
|
provided. The association between the two is just their relative positions in
|
|
the list.
|
|
|
|
##### Example lines creating metadata for clip classification
|
|
|
|
```python
|
|
# Python: functions from media_sequence.py as ms
|
|
sequence = tf.train.SequenceExample()
|
|
ms.set_clip_data_path(b"path_to_video", sequence)
|
|
ms.set_clip_start_timestamp(1000000, sequence)
|
|
ms.set_clip_end_timestamp(6000000, sequence)
|
|
ms.set_clip_label_index((4, 3), sequence)
|
|
ms.set_clip_label_string((b"run", b"jump"), sequence)
|
|
```
|
|
|
|
```c++
|
|
// C++: functions from media_sequence.h
|
|
tensorflow::SequenceExample sequence;
|
|
SetClipDataPath("path_to_video", &sequence);
|
|
SetClipStartTimestamp(1000000, &sequence);
|
|
SetClipEndTimestamp(6000000, &sequence);
|
|
SetClipLabelIndex({4, 3}, &sequence);
|
|
SetClipLabelString({"run", "jump"}, &sequence);
|
|
```
|
|
|
|
#### Temporal detection
|
|
|
|
For temporal event detection or localization, e.g. classify regions in time
|
|
where people are playing a sport, the labels are referred to as segments. You
|
|
need to set the segment timespans with `set_segment_start_timestamp` and
|
|
`set_segment_end_timestamp` and labels with `set_segment_label_index` and
|
|
`set_segment_label_string`. All of these are repeated fields so you can provide
|
|
multiple segments for each clip. The label index and string have the same
|
|
meaning as for clip classification. Only the start and end timestamps need to
|
|
be provided. (The pipeline will automatically call `set_segment_start_index` to
|
|
the index of the image frame under the image/timestamp key that is closest in
|
|
time, and similarly for `set_segment_end_index`. Allowing the pipeline to fill
|
|
in the indices corrects for frame rate changes automatically.) The same number
|
|
of values must be present in each field. If the same segment would have
|
|
multiple labels, the segment start and end time must be duplicated.
|
|
|
|
##### Example lines creating metadata for temporal detection
|
|
```python
|
|
# Python: functions from media_sequence.py as ms
|
|
sequence = tf.train.SequenceExample()
|
|
ms.set_clip_data_path(b"path_to_video", sequence)
|
|
ms.set_clip_start_timestamp(1000000, sequence)
|
|
ms.set_clip_end_timestamp(6000000, sequence)
|
|
|
|
ms.set_segment_start_timestamp((2000000, 4000000), sequence)
|
|
ms.set_segment_end_timestamp((3500000, 6000000), sequence)
|
|
ms.set_segment_label_index((4, 3), sequence)
|
|
ms.set_segment_label_string((b"run", b"jump"), sequence)
|
|
```
|
|
|
|
```c++
|
|
// C++: functions from media_sequence.h
|
|
tensorflow::SequenceExample sequence;
|
|
SetClipDataPath("path_to_video", &sequence);
|
|
SetClipStartTimestamp(1000000, &sequence);
|
|
SetClipEndTimestamp(6000000, &sequence);
|
|
|
|
SetSegmentStartTimestamp({2000000, 4000000}, &sequence);
|
|
SetSegmentEndTimestamp({3500000, 6000000}, &sequence);
|
|
SetSegmentLabelIndex({4, 3}, &sequence);
|
|
SetSegmentLabelString({"run", "jump"}, &sequence);
|
|
```
|
|
|
|
#### Tracking and spatiotemporal detection
|
|
|
|
For object tracking or detection in videos, e.g. classify regions in time and
|
|
space, the labels are typically bounding boxes. Unlike previous tasks, the
|
|
annotations are provided as a [`FeatureList`](https://www.tensorflow.org/api_docs/python/tf/train/FeatureList)
|
|
instead of in a context [`Feature`](https://www.tensorflow.org/api_docs/python/tf/train/Feature)
|
|
because they occur in multiple frames. Set up a detection task with `add_bbox`,
|
|
`add_bbox_timestamp`, `add_bbox_label_string`, and `add_bbox_label_index`. Only
|
|
add metadata for annotated frames. The pipeline will add empty features to each
|
|
feature list to align the box annotations with the nearest image frame.
|
|
`add_bbox_is_annotated` distinguishes between annotated frames and frames added
|
|
as padding. 1 is added if the frame was annotated and 0 otherwise. It is
|
|
automatically maintained in `PackMediaSequenceCalculator`. Other fields can be
|
|
used for tracking tasks: `add_bbox_track_string` identifies instances over time
|
|
and `add_bbox_class_string` can be concatenated to the track string if track ids
|
|
are not already unique. If track ids are unique across classes, you do not need
|
|
to fill out the class information.
|
|
|
|
##### Example lines creating metadata for spatiotemporal detection or tracking
|
|
|
|
```python
|
|
# Python: functions from media_sequence.py as ms
|
|
sequence = tf.train.SequenceExample()
|
|
ms.set_clip_data_path(b"path_to_video", sequence)
|
|
ms.set_clip_start_timestamp(1000000, sequence)
|
|
ms.set_clip_end_timestamp(6000000, sequence)
|
|
|
|
# For an object tracking task with action labels:
|
|
loctions_on_frame_1 = np.array([[0.1, 0.2, 0.3 0.4],
|
|
[0.2, 0.3, 0.4, 0.5]])
|
|
ms.add_bbox(locations_on_frame_1, sequence)
|
|
ms.add_bbox_timestamp(3000000, sequence)
|
|
ms.add_bbox_label_index((4, 3), sequence)
|
|
ms.add_bbox_label_string((b"run", b"jump"), sequence)
|
|
ms.add_bbox_track_string((b"id_0", b"id_1"), sequence)
|
|
# ms.add_bbox_class_string(("cls_0", "cls_0"), sequence) # if required
|
|
locations_on_frame_2 = locations_on_frame_1[0]
|
|
ms.add_bbox(locations_on_frame_2, sequence)
|
|
ms.add_bbox_timestamp(5000000, sequence)
|
|
ms.add_bbox_label_index((3), sequence)
|
|
ms.add_bbox_label_string((b"jump",), sequence)
|
|
ms.add_bbox_track_string((b"id_0",), sequence)
|
|
# ms.add_bbox_class_string(("cls_0",), sequence) # if required
|
|
```
|
|
|
|
```c++
|
|
// C++: functions from media_sequence.h
|
|
tensorflow::SequenceExample sequence;
|
|
SetClipDataPath("path_to_video", &sequence);
|
|
SetClipStartTimestamp(1000000, &sequence);
|
|
SetClipEndTimestamp(6000000, &sequence);
|
|
|
|
// For an object tracking task with action labels:
|
|
std::vector<mediapipe::Location> locations_on_frame_1;
|
|
AddBBox(locations_on_frame_1, &sequence);
|
|
AddBBoxTimestamp(3000000, &sequence);
|
|
AddBBoxLabelIndex({4, 3}, &sequence);
|
|
AddBBoxLabelString({"run", "jump"}, &sequence);
|
|
AddBBoxTrackString({"id_0", "id_1"}, &sequence);
|
|
// AddBBoxClassString({"cls_0", "cls_0"}, &sequence); // if required
|
|
std::vector<mediapipe::Location> locations_on_frame_2;
|
|
AddBBox(locations_on_frame_2, &sequence);
|
|
AddBBoxTimestamp(5000000, &sequence);
|
|
AddBBoxLabelIndex({3}, &sequence);
|
|
AddBBoxLabelString({"jump"}, &sequence);
|
|
AddBBoxTrackString({"id_0"}, &sequence);
|
|
// AddBBoxClassString({"cls_0"}, &sequence); // if required
|
|
```
|
|
|
|
### Running a MediaSequence through MediaPipe
|
|
|
|
#### `UnpackMediaSequenceCalculator` and `PackMediaSequenceCalculator`
|
|
MediaSequence utilizes MediaPipe for processing by providing two special
|
|
calculators. The `UnpackMediaSequenceCalculator` is used to extract data from
|
|
SequenceExamples. This will often be the metadata, such as the path to the video
|
|
file, and the clip start and end times. However, after storing images in a
|
|
SequenceExample, the images themselves can also be unpacked for further
|
|
processing, such as computing optical flow. Whatever data is extracted during
|
|
processing is added to the SequenceExample by the `PackMediaSequenceCalculator`.
|
|
The values that are unpacked and packed into these calculators are determined
|
|
by the tags on the streams in the MediaPipe calculator graph. (Tags are required
|
|
to be all capitals and underscores. To encode prefixes for feature keys as tags,
|
|
prefixes for feature keys should follow the same convention.) The documentation
|
|
for these two calculators describes the variety of data they support. Any other
|
|
MediaPipe processing can be used between these calculators to extract features.
|
|
|
|
#### Adding data and reconciling metadata
|
|
In general, the pipeline will decode the specified media between the clip
|
|
start and end timestamps and store any requested features. A common feature
|
|
to request is JPEG encoded images, so this will be used it as an example.
|
|
Each image between the clip start and end timestamps (generally inclusive) is
|
|
added to the SequenceExample's feature list with add_image_encoded and the
|
|
corresponding timestamp it arrived at is added with add_image_timestamp. At the
|
|
end of the image stream, the pipeline will determine and store what metadata it
|
|
can about the stream. For images, it will store the height and width of the
|
|
image as well as the number of channels and encoding format. Similar storage and
|
|
metadata computation is done when adding audio, float feature vectors, or
|
|
encoded optical flow to the SequenceExample. The code that reconciles the
|
|
metadata is in media_sequence.cc.
|
|
|
|
#### Automatically aligning bounding boxes to images
|
|
At the time of writing, the image/timestamp is also used to update the closest
|
|
timestamp for segment/start/index and segment/end/index and bounding box data.
|
|
Segment indices are relative to the start of the clip (i.e. only reference data
|
|
within the SequenceExample), while timestamps are absolute times within
|
|
the video. Bounding box data is aligned to the image/timestamps by inserting
|
|
empty bounding box annotations and indicating this with add_bbox_is_annotated.
|
|
If images are stored at a lower rate than the bounding box data, then only the
|
|
nearest annotation to each frame is retained and any others are dropped. *Be
|
|
careful when downsampling frame rates with bounding box annotations;
|
|
downsampling bounding box annotations is the only time annotations will be lost
|
|
in the MediaPipe pipeline.*
|
|
|
|
#### Chaining processing graphs
|
|
A common use case is to derive deep features from frames in a video when those
|
|
features are too expensive to compute during training. For example, extracting
|
|
ResNet-50 features on each frame of video. In the MediaSequence pipeline, the
|
|
way to generate these features is to first extract the images to the
|
|
SequenceExample in one MediaPipe graph. Then create a second MediaPipe graph
|
|
that unpacks the images from the SequenceExample and appends the new features to
|
|
a copy of that SequenceExample. This chaining behavior makes it easy to
|
|
incrementally add features in a modular way and makes debugging easier because
|
|
you can identify the anomalous stage more easily. Once the pipeline is complete,
|
|
any unnecessary features can be removed. Be aware that the number of derived
|
|
feature timestamps may be different than the number of input features, e.g.
|
|
optical flow can't be estimated for the last frame of a video clip, so it
|
|
adds one less frame of data. With the exception of aligning bounding boxes, the
|
|
pipeline does nothing to require consistent timestamps between features.
|
|
|
|
## Function prototypes for each data type
|
|
|
|
MediaSequence provides accessors to store common data patterns in
|
|
SequenceExamples. The exact functions depend on the type of data
|
|
and the key, but the patterns are similar. Each function has a name related to
|
|
the key, so we will document the functions with a generic name, Feature. Note
|
|
that due to different conventions for Python and C++ code, the capitalization
|
|
and parameter order varies, but the functionality should be equivalent.
|
|
|
|
Each function takes an optional prefix parameter. Prefixes enable storing
|
|
semantically identical data without collisions. For example, it is possible to
|
|
store predicted and ground truth bounding boxes by using different prefixes.
|
|
To minimize burdening the API and documentation, eschew using prefixes unless
|
|
necessary. For some common cases, such as storing instance segmentation labels
|
|
along with images, named versions with prefixes baked in provided as documented
|
|
below. Lastly, generic features and audio streams should almost always use a
|
|
prefix because storing multiple features or transformed audio streams is common.
|
|
|
|
The code generating these functions resides in media_sequence.h/.cc/.py and
|
|
media_sequence_util.h/.cc/.py. The media_sequence files generally defines the
|
|
API that should be used directly by developers. The media_sequence_util files
|
|
provide the function generation code used to define new features. If you require
|
|
additional features not supplied in the media_sequence files, use the functions
|
|
in media_sequence_util to create more in the appropriate namespace / module_dict
|
|
in your own files and import those as well.
|
|
|
|
In these prototypes, the prefix is optional as indicated by \[ \]s. The C++
|
|
types are abbreviated. The code and test cases are recommended for understanding
|
|
the exact types. The purpose of these example is to provide an illustration of
|
|
the pattern.
|
|
|
|
### Singular Context Features
|
|
|
|
| python call | c++ call | description |
|
|
|-------------|----------|-------------|
|
|
|`has_feature(example [, prefix])`|`HasFeature([const string& prefix,] const tf::SE& example)`|Returns a boolean if the feature is present.|
|
|
|`get_feature(example [, prefix])`|`GetFeature([const string& prefix,] const tf::SE& example)`|Returns a single feature of the appropriate type (string, int64, float).|
|
|
|`clear_feature(example [, prefix])`|`ClearFeature([const string& prefix,] tf::SE* example)`|Clears the feature.|
|
|
|`set_feature(value, example [, prefix])`|`SetFeature([const string& prefix,], const TYPE& value, tf::SE* example)`|Clears and stores the feature of the appropriate type.|
|
|
|`get_feature_key([prefix])`|`GetFeatureKey([const string& prefix])`|Returns the key used by related functions.|
|
|
|`get_feature_default_parser()`| | Returns the tf.io.FixedLenFeature for this type. (Python only.) |
|
|
|
|
### List Context Features
|
|
|
|
| python call | c++ call | description |
|
|
|-------------|----------|-------------|
|
|
|`has_feature(example [, prefix])`|`HasFeature([const string& prefix,] const tf::SE& example)`|Returns a boolean if the feature is present.|
|
|
|`get_feature(example [, prefix])`|`GetFeature([const string& prefix,] const tf::SE& example)`|Returns a sequence feature of the appropriate type (comparable to list/vector of string, int64, float).|
|
|
|`clear_feature(example [, prefix])`|`ClearFeature([const string& prefix,] tf::SE* example)`|Clears the feature.|
|
|
|`set_feature(values, example [, prefix])`|`SetFeature([const string& prefix,], const vector<TYPE>& values, tf::SE* example)`|Clears and stores the list of features of the appropriate type.|
|
|
|`get_feature_key([prefix])`|`GetFeatureKey([const string& prefix])`|Returns the key used by related functions.|
|
|
|`get_feature_default_parser()`| | Returns the tf.io.VarLenFeature for this type. (Python only.) |
|
|
|
|
### Singular Feature Lists
|
|
|
|
| python call | c++ call | description |
|
|
|-------------|----------|-------------|
|
|
|`has_feature(example [, prefix])`|`HasFeature([const string& prefix,] const tf::SE& example)`|Returns a boolean if the feature is present.|
|
|
|`get_feature_size(example [, prefix])`|`GetFeatureSize([const string& prefix,] const tf::SE&(example)`|Returns the number of features under this key. Will be 0 if the feature is absent.|
|
|
|`get_feature_at(index, example [, prefix])`|`GetFeatureAt([const string& prefix,] const tf::SE& example, const int index)`|Returns a single feature of the appropriate type (string, int64, float) at position index of the feature list.|
|
|
|`clear_feature(example [, prefix])`|`ClearFeature([const string& prefix,] tf::SE* example)`|Clears the entire feature.|
|
|
|`add_feature(value, example [, prefix])`|`AddFeature([const string& prefix,], const TYPE& value, tf::SE* example)`|Appends a feature of the appropriate type to the feature list.|
|
|
|`get_feature_key([prefix])`|`GetFeatureKey([const string& prefix])`|Returns the key used by related functions.|
|
|
|`get_feature_default_parser()`| | Returns the tf.io.FixedLenSequenceFeature for this type. (Python only.) |
|
|
|
|
### List Feature Lists
|
|
|
|
| python call | c++ call | description |
|
|
|-------------|----------|-------------|
|
|
|`has_feature(example [, prefix])`|`HasFeature([const string& prefix,] const tf::SE& example)`|Returns a boolean if the feature is present.|
|
|
|`get_feature_size(example [, prefix])`|`GetFeatureSize([const string& prefix,] const tf::SE& example)`|Returns the number of feature sequences under this key. Will be 0 if the feature is absent.|
|
|
|`get_feature_at(index, example [, prefix])`|`GetFeatureAt([const string& prefix,] const tf::SE& example, const int index)`|Returns a repeated feature of the appropriate type (comparable to list/vector of string, int64, float) at position index of the feature list.|
|
|
|`clear_feature(example [, prefix])`|`ClearFeature([const string& prefix,] tf::SE* example)`|Clears the entire feature.|
|
|
|`add_feature(value, example [, prefix])`|`AddFeature([const string& prefix,], const vector<TYPE>& value, tf::SE* example)`|Appends a sequence of features of the appropriate type to the feature list.|
|
|
|`get_feature_key([prefix])`|`GetFeatureKey([const string& prefix])`|Returns the key used by related functions.|
|
|
|`get_feature_default_parser()`| | Returns the tf.io.VarLenFeature for this type. (Python only.) |
|
|
|
|
|
|
## Keys
|
|
|
|
These keys are broadly useful for covering the range of multimedia based machine
|
|
learning tasks. The key itself should be human interpretable, and descriptions
|
|
are provided for elaboration.
|
|
|
|
### Keys related to the entire example
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`example/id`|context bytes|`set_example_id` / `SetExampleId`|A unique identifier for each example.|
|
|
|`example/dataset_name`|context bytes|`set_example_dataset_name` / `SetExampleDatasetName`|The name of the data set, including the version.|
|
|
|
|
### Keys related to a clip
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`clip/data_path`|context bytes|`set_clip_data_path` / `SetClipDataPath`|The relative path to the data on disk from some root directory.|
|
|
|`clip/start/timestamp`|context int|`set_clip_start_timestamp` / `SetClipStartTimestamp`|The start time, in microseconds, for the start of the clip in the media.|
|
|
|`clip/end/timestamp`|context int|`set_clip_end_timestamp` / `SetClipEndTimestamp`|The end time, in microseconds, for the end of the clip in the media.|
|
|
|`clip/label/index`|context int list|`set_clip_label_index` / `SetClipLabelIndex`|A list of label indices for this clip.|
|
|
|`clip/label/string`|context string list|`set_clip_label_string` / `SetClipLabelString`|A list of label strings for this clip.|
|
|
|`clip/label/confidence`|context float list|`set_clip_label_confidence` / `SetClipLabelConfidence`|A list of label confidences for this clip.|
|
|
|`clip/media_id`|context bytes|`set_clip_media_id` / `SetClipMediaId`|Any identifier for the media beyond the data path.|
|
|
|`clip/alternative_media_id`|context bytes|`set_clip_alternative_media_id` / `SetClipAlternativeMediaId`|Yet another alternative identifier.|
|
|
|`clip/encoded_media_bytes`|context bytes|`set_clip_encoded_media_bytes` / `SetClipEncodedMediaBytes`|The encoded bytes for storing media directly in the SequenceExample.|
|
|
|`clip/ encoded_media_start_timestamp`|context int|`set_clip_encoded_media_start_timestamp` / `SetClipEncodedMediaStartTimestamp`|The start time for the encoded media if not preserved during encoding.
|
|
|
|
### Keys related to segments of clips
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`segment/start/timestamp`|context int list|`set_segment_start_timestamp` / `SetSegmentStartTimestamp`|A list of segment start times in microseconds.|
|
|
|`segment/start/index`|context int list|`set_segment_start_index` / `SetSegmentstartIndex`|A list of indices marking the first frame index >= the start time.|
|
|
|`segment/end/timestamp`|context int list|`set_segment_end_timestamp` / `SetSegmentEndTimestamp`|A list of segment end times in microseconds.|
|
|
|`segment/end/index`|context int list|`set_segment_end_index` / `SetSegmentEndIndex`|A list of indices marking the last frame index <= the end time.|
|
|
|`segment/label/index`|context int list|`set_segment_label_index` / `SetSegmentLabelIndex`|A list with the label index for each segment. Multiple labels for the same segment are encoded as repeated segments.|
|
|
|`segment/label/string`|context bytes list|`set_segment_label_string` / `SetSegmentLabelString`|A list with the label string for each segment. Multiple labels for the same segment are encoded as repeated segments.|
|
|
|`segment/label/confidence`|context float list|`set_segment_label_confidence` / `SetSegmentLabelConfidence`|A list with the label confidence for each segment. Multiple labels for the same segment are encoded as repeated segments.|
|
|
|
|
### Keys related to spatial regions (e.g. bounding boxes)
|
|
Prefixes are used to distinguish betwen different semantic meanings of regions.
|
|
This practice is so common, that the BBox version of function calls will be
|
|
provided. Each call accepts an optional prefix to avoid name collisions.
|
|
"Region" is used in the keys because of the similar semantic meaning between
|
|
different types of regions.
|
|
|
|
A few *special* accessors are provided to work with multiple keys at once.
|
|
|
|
Regions can be given identifiers for labels, tracks, and classes. Although
|
|
similar information can be stored in each identifier, the intended use is
|
|
different. Labels should be used when predicting a label for a region (such as
|
|
the class of the bounding box or action performed by a person). Tracks should be
|
|
used to uniquely identify regions over sequential frames. Classes are only
|
|
intended to be used to disambiguate track ids if those ids are not unique across
|
|
object labels. The recommendation is to prefer label fields for classification
|
|
tasks and tracking (or class) fields for tracking information.
|
|
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`region/bbox/ymin`|feature list float list|`add_bbox_ymin` / `AddBBoxYMin`|A list of normalized minimum y values of bounding boxes in a frame.|
|
|
|`region/bbox/xmin`|feature list float list|`add_bbox_xmin` / `AddBBoxXMin`|A list of normalized minimum x values of bounding boxes in a frame.|
|
|
|`region/bbox/ymax`|feature list float list|`add_bbox_ymax` / `AddBBoxYMax`|A list of normalized maximum y values of bounding boxes in a frame.|
|
|
|`region/bbox/xmax`|feature list float list|`add_bbox_xmax` / `AddBBoxXMax`|A list of normalized maximum x values of bounding boxes in a frame.|
|
|
|`region/bbox/\*`| *special* |`add_bbox` / `AddBBox`|Operates on ymin,xmin,ymax,xmax with a single call.|
|
|
|`region/point/x`|feature list float list|`add_bbox_point_x` / `AddBBoxPointX`|A list of normalized x values for points in a frame.|
|
|
|`region/point/y`|feature list float list|`add_bbox_point_y` / `AddBBoxPointY`|A list of normalized y values for points in a frame.|
|
|
|`region/point/\*`| *special* |`add_bbox_point` / `AddBBoxPoint`|Operates on point/x,point/y with a single call.|
|
|
|`region/point/radius`|feature list float list|`add_bbox_point_radius` / `AddBBoxPointRadius`|A list of radii for points in a frame.|
|
|
|`region/3d_point/x`|feature list float list|`add_bbox_3d_point_x` / `AddBBox3dPointX`|A list of normalized x values for points in a frame.|
|
|
|`region/3d_point/y`|feature list float list|`add_bbox_3d_point_y` / `AddBBox3dPointY`|A list of normalized y values for points in a frame.|
|
|
|`region/3d_point/z`|feature list float list|`add_bbox_3d_point_z` / `AddBBox3dPointZ`|A list of normalized z values for points in a frame.|
|
|
|`region/3d_point/\*`| *special* |`add_bbox_3d_point` / `AddBBox3dPoint`|Operates on 3d_point/{x,y,z} with a single call.|
|
|
|`region/timestamp`|feature list int|`add_bbox_timestamp` / `AddBBoxTimestamp`|The timestamp in microseconds for the region annotations.|
|
|
|`region/num_regions`|feature list int|`add_bbox_num_regions` / `AddBBoxNumRegions`|The number of boxes or other regions in a frame. Should be 0 for unannotated frames.|
|
|
|`region/is_annotated`|feature list int|`add_bbox_is_annotated` / `AddBBoxIsAnnotated`|1 if this timestep is annotated. 0 otherwise. Distinguishes empty from unannotated frames.|
|
|
|`region/is_generated`|feature list int list|`add_bbox_is_generated` / `AddBBoxIsGenerated`|For each region, 1 if the region is procedurally generated for this frame.|
|
|
|`region/is_occluded`|feature list int list|`add_bbox_is_occluded` / `AddBBoxIsOccluded`|For each region, 1 if the region is occluded in the current frame.|
|
|
|`region/label/index`|feature list int list|`add_bbox_label_index` / `AddBBoxLabelIndex`|For each region, lists the integer label. Multiple labels for one region require duplicating the region.|
|
|
|`region/label/string`|feature list bytes list|`add_bbox_label_string` / `AddBBoxLabelString`|For each region, lists the string label. Multiple labels for one region require duplicating the region.|
|
|
|`region/label/confidence`|feature list float list|`add_bbox_label_confidence` / `AddBBoxLabelConfidence`|For each region, lists the confidence or weight for the label. Multiple labels for one region require duplicating the region.|
|
|
|`region/track/index`|feature list int list|`add_bbox_track_index` / `AddBBoxTrackIndex`|For each region, lists the integer track id. Multiple track ids for one region require duplicating the region.|
|
|
|`region/track/string`|feature list bytes list|`add_bbox_track_string` / `AddBBoxTrackString`|For each region, lists the string track id. Multiple track ids for one region require duplicating the region.|
|
|
|`region/track/confidence`|feature list float list|`add_bbox_track_confidence` / `AddBBoxTrackConfidence`|For each region, lists the confidence or weight for the track. Multiple track ids for one region require duplicating the region.|
|
|
|`region/class/index`|feature list int list|`add_bbox_class_index` / `AddBBoxClassIndex`|For each region, lists the integer class. Multiple classes for one region require duplicating the region.|
|
|
|`region/class/string`|feature list bytes list|`add_bbox_class_string` / `AddBBoxClassString`|For each region, lists the string class. Multiple classes for one region require duplicating the region.|
|
|
|`region/class/confidence`|feature list float list|`add_bbox_class_confidence` / `AddBBoxClassConfidence`|For each region, lists the confidence or weight for the class. Multiple classes for one region require duplicating the region.|
|
|
|`region/embedding/float`|feature list float list|`add_bbox_embedding_floats` / `AddBBoxEmbeddingFloats`|For each region, provide an embedding as sequence of floats.|
|
|
|`region/parts`|context bytes list|`set_bbox_parts` / `SetBBoxParts`|The list of region parts expected in this example.|
|
|
|`region/embedding/ dimensions_per_region`|context int list|`set_bbox_embedding_dimensions_per_region` / `SetBBoxEmbeddingDimensionsPerRegion`|Provide the dimensions for each embedding.|
|
|
|`region/embedding/format`|context string|`set_bbox_embedding_format` / `SetBBoxEmbeddingFormat`|Provides the encoding format, if any, for region embeddings.|
|
|
|`region/embedding/encoded`|feature list bytes list|`add_bbox_embedding_encoded` / `AddBBoxEmbeddingEncoded`|For each region, provide an encoded embedding.|
|
|
|`region/embedding/confidence`|feature list float list|`add_bbox_embedding_confidence` / `AddBBoxEmbeddingConfidence` | For each region, provide a confidence for the embedding.|
|
|
|`region/unmodified_timestamp`|feature list int|`add_bbox_unmodified_timestamp` / `AddBBoxUnmodifiedTimestamp`|Used to store the original timestamps if procedurally aligning timestamps to image frames.|
|
|
|
|
### Keys related to images
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`image/encoded`|feature list bytes|`add_image_encoded` / `AddImageEncoded`|The encoded image at each timestep.|
|
|
|`image/timestamp`|feature list int|`add_image_timestamp` / `AddImageTimestamp`|The timestamp in microseconds for the image.|
|
|
|`image/multi_encoded`|feature list bytes list|`add_image_multi_encoded` / `AddImageMultiEncoded`|Storing multiple images at each timestep (e.g. from multiple camera views).|
|
|
|`image/label/index`|feature list int list|`add_image_label_index` / `AddImageLabelIndex`|If an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.|
|
|
|`image/label/string`|feature list bytes list|`add_image_label_string` / `AddImageLabelString`|If an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.|
|
|
|`image/label/confidence`|feature list float list|`add_image_label_confidence` / `AddImageLabelConfidence`|If an image at a specific timestamp should have a label, use this. If a range of time, prefer Segments instead.|
|
|
|`image/format`|context bytes|`set_image_format` / `SetImageFormat`|The encoding format of the images.|
|
|
|`image/channels`|context int|`set_image_channels` / `SetImageChannels`|The number of channels in the image.|
|
|
|`image/height`|context int|`set_image_height` / `SetImageHeight`|The height of the image in pixels.|
|
|
|`image/width`|context int|`set_image_width` / `SetImageWidth`|The width of the image in pixels.|
|
|
|`image/frame_rate`|context float|`set_image_frame_rate` / `SetImageFrameRate`|The rate of images in frames per second.|
|
|
|`image/data_path`|context bytes|`set_image_data_path` / `SetImageDataPath`|The path of the image file if it did not come from a media clip.|
|
|
|
|
### Keys related to image class segmentation
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`CLASS_SEGMENTATION/image/encoded`|feature list bytes|`add_class_segmentation_encoded` / `AddClassSegmentationEncoded`|The encoded image of class labels at each timestep.|
|
|
|`CLASS_SEGMENTATION/image/timestamp`|feature list int|`add_class_segmentation_timestamp` / `AddClassSegmentationTimestamp`|The timestamp in microseconds for the class labels.|
|
|
|`CLASS_SEGMENTATION/image/multi_encoded`|feature list bytes list|`add_class_segmentation_multi_encoded` / `AddClassSegmentationMultiEncoded`|Storing multiple segmentation masks in case they overlap.|
|
|
|`CLASS_SEGMENTATION/image/format`|context bytes|`set_class_segmentation_format` / `SetClassSegmentationFormat`|The encoding format of the class label images.|
|
|
|`CLASS_SEGMENTATION/image/height`|context int|`set_class_segmentation_height` / `SetClassSegmentationHeight`|The height of the image in pixels.|
|
|
|`CLASS_SEGMENTATION/image/width`|context int|`set_class_segmentation_width` / `SetClassSegmentationWidth`|The width of the image in pixels.|
|
|
|`CLASS_SEGMENTATION/image/class/ label/index`|context int list|`set_class_segmentation_class_label_index` / `SetClassSegmentationClassLabelIndex`|If necessary a mapping from values in the image to class labels.|
|
|
|`CLASS_SEGMENTATION/image/class/ label/string`|context bytes list|`set_class_segmentation_class_label_string` / `SetClassSegmentationClassLabelString`|A mapping from values in the image to class labels.|
|
|
|
|
### Keys related to image instance segmentation
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`INSTANCE_SEGMENTATION/image/ encoded`|feature list bytes|`add_instance_segmentation_encoded` / `AddInstanceSegmentationEncoded`|The encoded image of object instance labels at each timestep.|
|
|
|`INSTANCE_SEGMENTATION/image/ timestamp`|feature list int|`add_instance_segmentation_timestamp` / `AddInstanceSegmentationTimestamp`|The timestamp in microseconds for the object instance labels.|
|
|
|`INSTANCE_SEGMENTATION/image/multi_encoded`|feature list bytes list|`add_instance_segmentation_multi_encoded` / `AddInstanceSegmentationEncoded`|Storing multiple segmentation masks in case they overlap.|
|
|
|`INSTANCE_SEGMENTATION/image/ format`|context bytes|`set_instance_segmentation_format` / `SetInstanceSegmentationFormat`|The encoding format of the object instance labels.|
|
|
|`INSTANCE_SEGMENTATION/image/ height`|context int|`set_instance_segmentation_height` / `SetInstanceSegmentationHeight`|The height of the image in pixels.|
|
|
|`INSTANCE_SEGMENTATION/image/ width`|context int|`set_instance_segmentation_width` / `SetInstanceSegmentationWidth`|The width of the image in pixels.|
|
|
|`INSTANCE_SEGMENTATION/image/ class/label/index`|context int list|`set_instance_segmentation_class_label_index` / `SetInstanceSegmentationClassLabelIndex`|If necessary a mapping from values in the image to class labels.|
|
|
|`INSTANCE_SEGMENTATION/image/ class/label/string`|context bytes list|`set_instance_segmentation_class_label_string` / `SetInstanceSegmentationClassLabelString`|A mapping from values in the image to class labels.|
|
|
|`INSTANCE_SEGMENTATION/image/ object/class/index`|context int|`set_instance_segmentation_object_class_index` / `SetInstanceSegmentationObjectClassIndex`|If necessary a mapping from values in the image to class indices.|
|
|
|
|
### Keys related to optical flow
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`FORWARD_FLOW/image/encoded`|feature list bytes|`add_forward_flow_encoded` / `AddForwardFlowEncoded`|The encoded forward optical flow field at each timestep.|
|
|
|`FORWARD_FLOW/image/timestamp`|feature list int|`add_forward_flow_timestamp` / `AddForwardFlowTimestamp`|The timestamp in microseconds for the optical flow field.|
|
|
|`FORWARD_FLOW/image/multi_encoded`|feature list bytes list|`add_forward_flow_multi_encoded` / `AddForwardFlowMultiEncoded`|Storing multiple optical flow fields at each timestep (e.g. from multiple camera views).|
|
|
|`FORWARD_FLOW/image/format`|context bytes|`set_forward_flow_format` / `SetForwardFlowFormat`|The encoding format of the optical flow field.|
|
|
|`FORWARD_FLOW/image/channels`|context int|`set_forward_flow_channels` / `SetForwardFlowChannels`|The number of channels in the optical flow field.|
|
|
|`FORWARD_FLOW/image/height`|context int|`set_forward_flow_height` / `SetForwardFlowHeight`|The height of the optical flow field in pixels.|
|
|
|`FORWARD_FLOW/image/width`|context int|`set_forward_flow_width` / `SetForwardFlowWidth`|The width of the optical flow field in pixels.|
|
|
|`FORWARD_FLOW/image/frame_rate`|context float|`set_forward_flow_frame_rate` / `SetForwardFlowFrameRate`|The rate of optical flow field in frames per second.|
|
|
|`FORWARD_FLOW/image/saturation`|context float|`set_forward_flow_saturation` / `SetForwardFlowSaturation`|The saturation value used before encoding the flow field to an image.|
|
|
|
|
|
|
### Keys related to generic features
|
|
Storing generic features is powerful, but potentially confusing. The
|
|
recommendation is to use more specific methods if possible. When using these
|
|
generic features, always supply a prefix. (The recommended prefix format,
|
|
enforced by some MediaPipe functions, is all caps with underscores, e.g.
|
|
MY_FAVORITE_FEATURE.) Following this recommendation, the keys will be listed
|
|
with a generic PREFIX.
|
|
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`PREFIX/feature/floats`|feature list float list|`add_feature_floats` / `AddFeatureFloats`|A list of floats at a timestep.|
|
|
|`PREFIX/feature/bytes`|feature list bytes list|`add_feature_bytes` / `AddFeatureBytes`|A list of bytes at a timestep. Maybe be encoded.|
|
|
|`PREFIX/feature/ints`|feature list int list|`add_feature_ints` / `AddFeatureInts`|A list of ints at a timestep.|
|
|
|`PREFIX/feature/timestamp`|feature list int|`add_feature_timestamp` / `AddFeatureTimestamp`|A timestamp for a set of features.|
|
|
|`PREFIX/feature/duration`|feature list int list|`add_feature_duration` / `AddFeatureDuration`|It is occasionally useful to indicate that a feature applies to a time range. This should only be used for features and annotations should be provided as Segments.|
|
|
|`PREFIX/feature/confidence`|feautre list float list|`add_feature_confidence` / `AddFeatureConfidence`|The confidence for each generated feature.|
|
|
|`PREFIX/feature/dimensions`|context int list|`set_feature_dimensions` / `SetFeatureDimensions`|A list of integer dimensions for each feature.|
|
|
|`PREFIX/feature/rate`|context float|`set_feature_rate` / `SetFeatureRate`|The rate that features are calculated as features per second.|
|
|
|`PREFIX/feature/bytes/format`|context bytes|`set_feature_bytes_format` / `SetFeatureBytesFormat`|The encoding format if any for features stored as bytes.|
|
|
|
|
### Keys related to audio
|
|
Audio is a special subtype of generic features with additional data about the
|
|
audio format. When using audio, always supply a prefix. (The recommended prefix
|
|
format, enforced by some MediaPipe functions, is all caps with underscores, e.g.
|
|
MY_FAVORITE_FEATURE.) Following this recommendation, the keys will be listed
|
|
with a generic PREFIX.
|
|
|
|
To understand the terminology, it is helpful conceptualize the audio as a list
|
|
of matrices. The columns of the matrix are called samples. The rows of the
|
|
matrix are called channels. Each matrix is called a packet. The packet rate is
|
|
how often packets appear per second. The sample rate is the rate of columns per
|
|
second. The audio sample rate is used for derived features such as spectrograms
|
|
where the STFT is computed over audio at some other rate.
|
|
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`PREFIX/feature/floats`|feature list float list|`add_feature_floats` / `AddFeatureFloats`|A list of floats at a timestep.|
|
|
|`PREFIX/feature/timestamp`|feature list int|`add_feature_timestamp` / `AddFeatureTimestamp`|A timestamp for a set of features.|
|
|
|`PREFIX/feature/sample_rate`|context float|`set_feature_sample_rate` / `SetFeatureSampleRate`|The number of features per second. (e.g. for a spectrogram, this is the rate of STFT windows.)|
|
|
|`PREFIX/feature/num_channels`|context int|`set_feature_num_channels` / `SetFeatureNumChannels`|The number of channels of audio in each stored feature.|
|
|
|`PREFIX/feature/num_samples`|context int|`set_feature_num_samples` / `SetFeatureNumSamples`|The number of samples of audio in each stored feature.|
|
|
|`PREFIX/feature/packet_rate`|context float|`set_feature_packet_rate` / `SetFeaturePacketRate`|The number of packets per second.|
|
|
|`PREFIX/feature/audio_sample_rate`|context float|`set_feature_audio_sample_rate` / `SetFeatureAudioSampleRate`|The sample rate of the original audio for derived features.|
|
|
|
|
### Keys related to text, captions, and ASR
|
|
Text features may be timed with the media such as captions or automatic
|
|
speech recognition results, or may be descriptions. This collection of keys
|
|
should be used for many, very short text features. For a few, longer segments
|
|
please use the Segment keys in the context as described above. As always,
|
|
prefixes can be used to store different types of text such as automated and
|
|
ground truth transcripts.
|
|
|
|
| key | type | python call / c++ call | description |
|
|
|-----|------|------------------------|-------------|
|
|
|`text/language`|context bytes|`set_text_langage` / `SetTextLanguage`|The language for the corresponding text.|
|
|
|`text/context/content`|context bytes|`set_text_context_content` / `SetTextContextContent`|Storage for large blocks of text in the context.|
|
|
|`text/content`|feature list bytes|`add_text_content` / `AddTextContent`|One (or a few) text tokens that occur at one timestamp.|
|
|
|`text/timestamp`|feature list int|`add_text_timestamp` / `AddTextTimestamp`|When a text token occurs in microseconds.|
|
|
|`text/duration`|feature list int|`add_text_duration` / `SetTextDuration`|The duration in microseconds for the corresponding text tokens.|
|
|
|`text/confidence`|feature list float|`add_text_confidence` / `AddTextConfidence`|How likely the text is correct.|
|
|
|`text/embedding`|feautre list float list|`add_text_embedding` / `AddTextEmbedding`|A floating point vector for the corresponding text token.|
|
|
|`text/token/id`|feature list int|`add_text_token_id` / `AddTextTokenId`|An integer id for the corresponding text token.|
|