-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial draft of captions extension to semantic labels proposal #67
Open
nvmkuruc
wants to merge
1
commit into
PixarAnimationStudios:main
Choose a base branch
from
NVIDIA-Omniverse:captions
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# Semantic Captions | ||
|
||
Copyright © 2024, NVIDIA Corporation, version 1.0 | ||
|
||
## Goal | ||
Extend the `usdSemantics` domain with a schema for natural language | ||
semantic descriptions of subgraphs of an OpenUSD stage. | ||
|
||
## Background | ||
The proposed semantic labels API provides a mechanism for tagging | ||
subgraphs with labels using different taxonomies. | ||
|
||
``` | ||
def Mesh "orange" (apiSchemas = ["SemanticsLabelsAPI:tags", | ||
"SemanticsLabelsAPI:state"]) { | ||
token[] semantics:labels:tags = ["fruit", "food", "citrus"] | ||
token[] semantics:labels:state = ["peeled"] | ||
} | ||
``` | ||
|
||
Labels as discrete tokens makes it easy to construct merged | ||
sets of ancestors (to aid `IsLabeled` queries) or generate | ||
segmentation images. Labels also suggest certain user interfaces and | ||
query considerations (ie. drag and drop labels onto prims). | ||
|
||
## Proposal | ||
Some workflows are looking for natural language descriptions of | ||
subgraphs that can be used to describe a variety of features. | ||
|
||
``` | ||
def Xform "dancing_robot" ( | ||
apiSchemas = ["SemanticsCaptionsAPI:summary", | ||
"SemanticsCaptionsAPI:skills"] | ||
) { | ||
string semantics:captions:summary = "This is a bipedal robot made up of metal that is painted red. The robot was manufactured in 2024." | ||
string semantics:captions:skills = "This robot can perform a box step, a grapevine, and the waltz." | ||
} | ||
``` | ||
|
||
Captions are expected to be longer phrases, sentences, or paragraphs. | ||
A single prim may have multiple named instances of a caption to | ||
suggest its intent. Multiple instances can also allow for multiple language | ||
support. | ||
|
||
### Time Varying Captions | ||
Just as labels benefit from being able to describe state changes | ||
over time, captions may also be time varying. | ||
|
||
``` | ||
def Xform "learning_robot" ( | ||
apiSchemas = ["SemanticsCaptionsAPI:skills"] | ||
) { | ||
string semantics:captions:skills.timeSamples = { | ||
0 : "The robot does not know how to dance", | ||
100 : "The robot is learning the box step", | ||
150 : "The robot knows the box step" | ||
} | ||
} | ||
``` | ||
|
||
### Ancestral captions | ||
It's expected that captions describe the prim's subgraph. A prim may | ||
look to ancestors for additional context. API will likely be | ||
provided to aid in this. | ||
|
||
```cpp | ||
// Find either the instance of the captions API applied directly | ||
// to prim the nearest hierarchical ancestor of the prim. | ||
UsdSemanticsCaptionsAPI | ||
FindNearest(const UsdPrim& prim, const TfToken& instance) const; | ||
|
||
// Find all caption instances applied directly to the prim or | ||
// to ancestors of the prim | ||
std::vector<UsdSemanticsCaptionsAPI> | ||
FindHierarchical(const UsdPrim& prim, const TfToken& instance) const; | ||
``` | ||
|
||
## Alternatives | ||
### Use `documentation` metadata | ||
The documentation metadata would be an alternative. However, it | ||
does not allow for multiple instances and may conflict with user | ||
documentation. For example-- `doc = "This asset is designed for background usage only"`. | ||
|
||
### Use `assetInfo` metadata | ||
Unlike the `documentation` field, an asset info dictionary | ||
could capture multiple strings in the dictionary. However, they | ||
cannot vary over time. | ||
|
||
### Use `SemanticsLabelsAPI` | ||
The proposed labeling API could be used to capture captions. | ||
There are two reasons to have distinct APIs. | ||
* An authoring user interface for adding labels should be | ||
different than the user interface for adding descriptions. (ie. | ||
drop down box vs. text box). | ||
* Merging ancestor labels into sets to perform membership | ||
queries has many potential applications (ie. semantically | ||
partioned image masks). There is no equivalent workflow for | ||
captions. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in @dgovil's proposal, we also discussed future consideration for time-based descriptions. Sometimes a relevant time sequence needs an "announcement" for assistive technology, too... Either tied to a transition (for example, between slide builds where the change is more important than either end state) or a time code of the overall timelines, similar to closed captions or audio descriptions.
Potentially relevant: I'm working on a PR for VTT to add an ATTRIBUTES block, generally to disambiguate various types of metadata, but specifically because it's a prerequisite for using VTT to define time-based general flash data (seizure avoidance, etc.) in this follow-up VTT issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you see these
timeSamples
keys aligning with other timed-text formats like VTT.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand
VTT
correctly, it specifies time code ranges while OpenUSD holds an authored value until the next authored time sample and pulls from the first or last time sample when querying out of range. To describe that format in OpenUSD, you'd likely have to do something like this--This is highly speculative, but I'm curious if there's path to building something like VTT using the time series proposal as a starting point. It's currently designed for animation splines, but might provide a path for eventually describing more complicated time based value resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time series have actually been removed from the design for animation splines, in favor of more simply leveraging timeSamples for all non-scalar, non-floating-point varying data.