Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial draft of captions extension to semantic labels proposal #67

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions proposals/semantic_schema/captions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Semantic Captions

Copyright © 2024, NVIDIA Corporation, version 1.0

## Goal
Extend the `usdSemantics` domain with a schema for natural language
semantic descriptions of subgraphs of an OpenUSD stage.

## Background
The proposed semantic labels API provides a mechanism for tagging
subgraphs with labels using different taxonomies.

```
def Mesh "orange" (apiSchemas = ["SemanticsLabelsAPI:tags",
"SemanticsLabelsAPI:state"]) {
token[] semantics:labels:tags = ["fruit", "food", "citrus"]
token[] semantics:labels:state = ["peeled"]
}
```

Labels as discrete tokens makes it easy to construct merged
sets of ancestors (to aid `IsLabeled` queries) or generate
segmentation images. Labels also suggest certain user interfaces and
query considerations (ie. drag and drop labels onto prims).

## Proposal
Some workflows are looking for natural language descriptions of
subgraphs that can be used to describe a variety of features.

```
def Xform "dancing_robot" (
apiSchemas = ["SemanticsCaptionsAPI:summary",
"SemanticsCaptionsAPI:skills"]
) {
string semantics:captions:summary = "This is a bipedal robot made up of metal that is painted red. The robot was manufactured in 2024."
string semantics:captions:skills = "This robot can perform a box step, a grapevine, and the waltz."
}
```

Captions are expected to be longer phrases, sentences, or paragraphs.
A single prim may have multiple named instances of a caption to
suggest its intent. Multiple instances can also allow for multiple language
support.

### Time Varying Captions
Just as labels benefit from being able to describe state changes
over time, captions may also be time varying.

```
def Xform "learning_robot" (
apiSchemas = ["SemanticsCaptionsAPI:skills"]
) {
string semantics:captions:skills.timeSamples = {
0 : "The robot does not know how to dance",
100 : "The robot is learning the box step",
150 : "The robot knows the box step"
}
}
Comment on lines +50 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in @dgovil's proposal, we also discussed future consideration for time-based descriptions. Sometimes a relevant time sequence needs an "announcement" for assistive technology, too... Either tied to a transition (for example, between slide builds where the change is more important than either end state) or a time code of the overall timelines, similar to closed captions or audio descriptions.

Potentially relevant: I'm working on a PR for VTT to add an ATTRIBUTES block, generally to disambiguate various types of metadata, but specifically because it's a prerequisite for using VTT to define time-based general flash data (seizure avoidance, etc.) in this follow-up VTT issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you see these timeSamples keys aligning with other timed-text formats like VTT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand VTT correctly, it specifies time code ranges while OpenUSD holds an authored value until the next authored time sample and pulls from the first or last time sample when querying out of range. To describe that format in OpenUSD, you'd likely have to do something like this--

string  semantics:captions:skills.timeSamples = {
   99.9999: "" # Suppress out of range queries
   100 : "This is some state." # This is valid between time codes 100 and 150.
   150.00001: "" # Suppress out of range queries
}

This is highly speculative, but I'm curious if there's path to building something like VTT using the time series proposal as a starting point. It's currently designed for animation splines, but might provide a path for eventually describing more complicated time based value resolution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time series have actually been removed from the design for animation splines, in favor of more simply leveraging timeSamples for all non-scalar, non-floating-point varying data.

```

### Ancestral captions
It's expected that captions describe the prim's subgraph. A prim may
look to ancestors for additional context. API will likely be
provided to aid in this.

```cpp
// Find either the instance of the captions API applied directly
// to prim the nearest hierarchical ancestor of the prim.
UsdSemanticsCaptionsAPI
FindNearest(const UsdPrim& prim, const TfToken& instance) const;

// Find all caption instances applied directly to the prim or
// to ancestors of the prim
std::vector<UsdSemanticsCaptionsAPI>
FindHierarchical(const UsdPrim& prim, const TfToken& instance) const;
```

## Alternatives
### Use `documentation` metadata
The documentation metadata would be an alternative. However, it
does not allow for multiple instances and may conflict with user
documentation. For example-- `doc = "This asset is designed for background usage only"`.

### Use `assetInfo` metadata
Unlike the `documentation` field, an asset info dictionary
could capture multiple strings in the dictionary. However, they
cannot vary over time.

### Use `SemanticsLabelsAPI`
The proposed labeling API could be used to capture captions.
There are two reasons to have distinct APIs.
* An authoring user interface for adding labels should be
different than the user interface for adding descriptions. (ie.
drop down box vs. text box).
* Merging ancestor labels into sets to perform membership
queries has many potential applications (ie. semantically
partioned image masks). There is no equivalent workflow for
captions.