Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are images compressed in tsv files? #517

Open
ywwynm opened this issue Oct 12, 2024 · 3 comments
Open

Are images compressed in tsv files? #517

ywwynm opened this issue Oct 12, 2024 · 3 comments

Comments

@ywwynm
Copy link

ywwynm commented Oct 12, 2024

Thanks for your contribution to providing a collection of VLM datasets and models. I'm wondering why the tsv versions of datasets in this repository are smaller than the official versions. For example, the RealworldQA dataset downloaded from the official website RealWorldQA has 677MB, while the tsv version in this repo RealWorldQA_tsv only has 175MB. You are using base64 to encode images into texts and store them directly in tsv columns, which should be lossless. So why has the data size been reduced significantly? It seems that other datasets are having the similar situation.

@SYuan03
Copy link
Contributor

SYuan03 commented Oct 14, 2024

Hello, @ywwynm
The images in the dataset on the official RealWorldQA website are in webp format, whereas when we converted the original dataset to tsv format, we uniformly converted it to JPEG format during encode, you can refer to the code here in our repo.

@ywwynm
Copy link
Author

ywwynm commented Oct 14, 2024

@SYuan03 Thanks for your explanation. For other datasets like SeedBench or MMTBench, is such processing also performed? If the original images have already been in JPEG format, will you compress it again using the same code?

@SYuan03
Copy link
Contributor

SYuan03 commented Oct 14, 2024

Hello, @ywwynm
In fact, if the original image is in jpeg format, there will not be such a significant change in data size even after our processing. We just convert it to the tsv format we need for the convenience of unified processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants