how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

flyingwaters · 2022-03-03T06:18:01Z

No description provided.

VictorSanh · 2022-03-04T20:24:11Z

Hi @flyingwaters , did you have a look at https://github.com/bigscience-workshop/t-zero/tree/master/examples? I am not sure what it is specifically that you are trying to do, but it might be what you are looking for. The example lets you fine-tune a model on a given task/dataset.

flyingwaters · 2022-03-08T15:08:07Z

hi @VictorSanh , i find Some problems when I reproduce your result. with t5==0.9.3,
I use gpus to train the model and the environ is offline ,so I get sentencepiece.model downloaded
and use this command
--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="get_sentencepiece_model_path = '/raid/yiptmp/huggingface-models/t5.1.1.lm100k.xxl'"
##################
but it has some problems ,as follow:
SyntaxError: malformed node or string: <_ast.Name object at 0x7f42f90404d0>
Failed to parse token 'SentencePieceVocabulary'
######################
I think T0 is nice, Can you fix this bug? and I think the t5 you used, maybe has be updated, can you provided the requirement with version number. I think I will help many researchers to reproduce it and develop this tech further~
Thanks your work !!!!

VictorSanh · 2022-03-11T16:15:33Z

@flyingwaters it seems related to this issue, this is on t5 codebase side: google-research/text-to-text-transfer-transformer#513

maybe @lintangsutawika you've got a suggestion on how to proceed?

Something hacky that worked for me is to modify the t5/data/utils.py file in the text-to-text-transfer-transformer codebase. the diff:

-DEFAULT_SPM_PATH = "gs://t5-data/vocabs/cc_all.32000/sentencepiece.model"  # GCS
+DEFAULT_SPM_PATH = "LOCAL_PATH_TO_SENTENCEPIECE_MODEL"  # GCS

lintangsutawika · 2022-03-14T02:57:10Z

This is a long standing issue which they haven't fixed or given any timeline on when it will be fixed.

I recommend switching to T5X to retrain on your own dataset or use HF's trainer library for your usecase.

VictorSanh mentioned this issue Mar 11, 2022

how to reproduce the result in offline environment , I dowload the sentence.model and checkpoint but, the sentencepiece.model can not be recognised by the t5~!! #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

flyingwaters commented Mar 3, 2022

VictorSanh commented Mar 4, 2022

flyingwaters commented Mar 8, 2022 •

edited

Loading

VictorSanh commented Mar 11, 2022

lintangsutawika commented Mar 14, 2022 •

edited

Loading

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

how to use this t-zero to train my own dataset? did not find the api . if add the dataset to the cache_data_dir.I do not want reproduce. I want to use this tool train my own datasets #20

Comments

flyingwaters commented Mar 3, 2022

VictorSanh commented Mar 4, 2022

flyingwaters commented Mar 8, 2022 • edited Loading

VictorSanh commented Mar 11, 2022

lintangsutawika commented Mar 14, 2022 • edited Loading

flyingwaters commented Mar 8, 2022 •

edited

Loading

lintangsutawika commented Mar 14, 2022 •

edited

Loading