-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transcript + metadata processing #15
base: main
Are you sure you want to change the base?
Conversation
Works fine on todi using |
Can you share the error? Likely it's that the video is now private or no longer exists and so can't be downloaded.
How often does this happen?
We intentionally decided to ignore non-English videos for now to keep the scope smaller.
The timestep should be left inclusive, right exclusive. So if a clip is from timestamp 1m30 to 1m39 and frame A is at 1m29.9, frame B is 1m30.0, ..., frame Y is 1m39.9, frame Z is 1m40.0, then we would want |
What is a bit more common is that there is no Any idea when/why this occurs @kdu4108? This not only contains transcripts but also other info.
Seems sensible. Let's wait with the implementation until we have metadata containing transcripts with proper timestamps. --> TODO: adapt frame calculation |
v2d_to_metadata.py
Outdated
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
description="Process tarfiles contati JSONs and convert to structured JSONL format." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contati?
v2d_to_transcript.py
Outdated
hrs, mins, secs = map(float, timestamp.split(":")) | ||
total_seconds = timedelta(hours=hrs, minutes=mins, seconds=secs).total_seconds() | ||
# TODO: is round the right way of doing this? Most transcripts are assigned to only 1-2 frames... | ||
return round(total_seconds * fps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be left inclusive and right exclusive.
Looking very clean overall! Maybe what could be helpful is to include here an example of the jsons that are being converted into our format for both metadata and transcripts. |
ExamplesMetadata:
Transcripts:
|
|
Thanks Markus, looks great! Two nits are (1) is there reason it's called merge_data.py instead of train_val_test_split.py or something like that? and (2) can you add an example command of how we would run that script in a comment (in particular, I want to clarify - does that splitting script takes in as input a modality folder like video_rgb or video_det? As opposed to the raw video folder?) |
Sure, renamed the script. I get this as output:
|
…ed because we don't have the container working yet to use whisper (cry)
6421dd5
to
1168fa0
Compare
Implements #16 and #12.
Transcript:
Output format (as jsonl for each video, with multiple videos in a tarfile):
Open issues (also see TODO/FIXME in the code):
(Outdated now; see discussion below)
Metadata:
Very similar to transcript structure, but save to json instead of jsonl.
TO DOs:
metadata/
directory. #16 (comment))