Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

Open
flyingwaters opened this issue Mar 3, 2022 · 4 comments

Comments

@flyingwaters
Copy link

No description provided.

@VictorSanh
Copy link
Member

Hi @flyingwaters , did you have a look at https://github.com/bigscience-workshop/t-zero/tree/master/examples? I am not sure what it is specifically that you are trying to do, but it might be what you are looking for. The example lets you fine-tune a model on a given task/dataset.

@flyingwaters
Copy link
Author

flyingwaters commented Mar 8, 2022

hi @VictorSanh , i find Some problems when I reproduce your result. with t5==0.9.3,
I use gpus to train the model and the environ is offline ,so I get sentencepiece.model downloaded
and use this command
--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="get_sentencepiece_model_path = '/raid/yiptmp/huggingface-models/t5.1.1.lm100k.xxl'"
##################
but it has some problems ,as follow:
SyntaxError: malformed node or string: <_ast.Name object at 0x7f42f90404d0>
Failed to parse token 'SentencePieceVocabulary'
######################
I think T0 is nice, Can you fix this bug? and I think the t5 you used, maybe has be updated, can you provided the requirement with version number. I think I will help many researchers to reproduce it and develop this tech further~
Thanks your work !!!!

@VictorSanh
Copy link
Member

@flyingwaters it seems related to this issue, this is on t5 codebase side: google-research/text-to-text-transfer-transformer#513

maybe @lintangsutawika you've got a suggestion on how to proceed?

Something hacky that worked for me is to modify the t5/data/utils.py file in the text-to-text-transfer-transformer codebase. the diff:

-DEFAULT_SPM_PATH = "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model"  # GCS
+DEFAULT_SPM_PATH = "LOCAL_PATH_TO_SENTENCEPIECE_MODEL"  # GCS

@lintangsutawika
Copy link

lintangsutawika commented Mar 14, 2022

This is a long standing issue which they haven't fixed or given any timeline on when it will be fixed.

I recommend switching to T5X to retrain on your own dataset or use HF's trainer library for your usecase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants