-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is there any plan for flan-ul2? #9
Comments
Thanks, that is nice to hear :) |
@chiayewken How did you process the shareGPT data? Do you take one utterance and its answer as a pair or do you account for contexts also, so that we can use the system properly with Langchain? 🤔 I am currently training a ShareGPT + Alpaca mix to see wether that can improve stuff and especially - I am building a context, so the data looks somewhat like this:
Here's the code to process the ShareGPT 90K taken from here: https://huggingface.co/datasets/RyokoAI/ShareGPT52K/tree/main import json
def clean_data(text: str) -> str:
if text is None:
return ""
text = text.replace("<p>", "")
text = text.replace("</p>", "")
text = text.replace("<pre><code>", "```")
text = text.replace("</code></pre>", "```")
text = text.replace("<code>", "`")
text = text.replace("</code>", "`")
return text
def conversation_to_jsonl(input_file, output_file):
with open(input_file, 'r') as f:
data = json.load(f)
with open(output_file, 'w') as f:
for item in data:
conversations = item['conversations']
conversation_history = []
# Iterate through the conversation messages with a step size of 2.
for i in range(0, len(conversations) - 1, 2):
conversation_history.append(clean_data(conversations[i]['value']))
# Check if the current message is from a human and the next message is from GPT.
if conversations[i]['from'] == 'human' and conversations[i + 1]['from'] == 'gpt':
conversation_history.append('<assistant>' + clean_data(conversations[i + 1]['value']))
source_elements = conversation_history[-6:]
if source_elements[0].startswith('<assistant>'):
source_elements = source_elements[1:]
# Remove the last <assistant> message from the source
source_elements = source_elements[:-1]
if len(source_elements) > 1:
source_elements = ['<user>' + x if i % 2 == 0 else x for i, x in enumerate(source_elements)]
source = ' '.join(source_elements)
target = clean_data(conversations[i + 1]['value'])
f.write(json.dumps({"source": source, "target": target}) + '\n')
input_file = 'data/sg_90k_part2.json' #"debug.json"
output_file = 'sg_90k_part2.json' #"debug_out.json"
conversation_to_jsonl(input_file, output_file) |
Hi, I processed the ShareGPT data similar to you, I take each utterance from GPT as the target sequence and the dialog history as context: Line 322 in c90aad7
|
Very nice! I think that will make the ShareGPT data way more useful than the alpaca data! (did you use the cleaned version?) Especially since we can put this stuff into long contexts now. I hope my training runs through smoothly and if so I'll be happy to share the weights... |
Great! Currently we are using the cleaned version from vicuna, and it would be awesome if you are able to train and release the bigger models like flan-t5-xxl and flan-ul2 :) |
@Logophoman Feel free to do a PR on this repo once you are training the large models. Thanks! |
i just wonder that is there going to be new checkpoint released base of flan ul2 ?
anyway, really love the work on this repo 😃 you guys did a really great job!
The text was updated successfully, but these errors were encountered: