is there any plan for flan-ul2? #9

nonkung51 · 2023-04-04T10:35:49Z

i just wonder that is there going to be new checkpoint released base of flan ul2 ?

anyway, really love the work on this repo 😃 you guys did a really great job!

chiayewken · 2023-04-04T11:52:12Z

Thanks, that is nice to hear :)
Unfortunately we do not have immediate plans for ul2 due to limited GPUs :(
However, we do have a new checkpoint trained with ShareGPT data, it seems to be more coherent and reasonable

https://huggingface.co/declare-lab/flan-sharegpt-xl

Logophoman · 2023-04-04T14:15:37Z

@chiayewken How did you process the shareGPT data? Do you take one utterance and its answer as a pair or do you account for contexts also, so that we can use the system properly with Langchain? 🤔 I am currently training a ShareGPT + Alpaca mix to see wether that can improve stuff and especially - I am building a context, so the data looks somewhat like this:

{"source": "content1", "target": "content2"}

{"source": "<user>content1 <assistant>content2 <user>content3", "target": "content4"}

{"source": "<user>content1 <assistant>content2 <user>content3 <assistant>content4 <user>content5", "target": "content6"}

{"source": "content1", "target": "content2"}

Here's the code to process the ShareGPT 90K taken from here:

https://huggingface.co/datasets/RyokoAI/ShareGPT52K/tree/main

import json

def clean_data(text: str) -> str:
    if text is None:
        return ""
    text = text.replace("<p>", "")
    text = text.replace("</p>", "")
    text = text.replace("<pre><code>", "```")
    text = text.replace("</code></pre>", "```")
    text = text.replace("<code>", "`")
    text = text.replace("</code>", "`")
    return text


def conversation_to_jsonl(input_file, output_file):
    with open(input_file, 'r') as f:
        data = json.load(f)

    with open(output_file, 'w') as f:
        for item in data:
            conversations = item['conversations']

            conversation_history = []

            # Iterate through the conversation messages with a step size of 2.
            for i in range(0, len(conversations) - 1, 2):
                conversation_history.append(clean_data(conversations[i]['value']))

                # Check if the current message is from a human and the next message is from GPT.
                if conversations[i]['from'] == 'human' and conversations[i + 1]['from'] == 'gpt':
                    conversation_history.append('<assistant>' + clean_data(conversations[i + 1]['value']))

                    source_elements = conversation_history[-6:]
                    if source_elements[0].startswith('<assistant>'):
                        source_elements = source_elements[1:]

                    # Remove the last <assistant> message from the source
                    source_elements = source_elements[:-1]

                    if len(source_elements) > 1:
                        source_elements = ['<user>' + x if i % 2 == 0 else x for i, x in enumerate(source_elements)]

                    source = ' '.join(source_elements)
                    target = clean_data(conversations[i + 1]['value'])
                    f.write(json.dumps({"source": source, "target": target}) + '\n')




input_file = 'data/sg_90k_part2.json' #"debug.json"
output_file = 'sg_90k_part2.json' #"debug_out.json"

conversation_to_jsonl(input_file, output_file)

chiayewken · 2023-04-04T15:40:38Z

Hi, I processed the ShareGPT data similar to you, I take each utterance from GPT as the target sequence and the dialog history as context:

flan-alpaca/data_loading.py

Line 322 in c90aad7

def as_data(self) -> TextToTextData:

Logophoman · 2023-04-04T15:48:22Z

Very nice! I think that will make the ShareGPT data way more useful than the alpaca data! (did you use the cleaned version?) Especially since we can put this stuff into long contexts now. I hope my training runs through smoothly and if so I'll be happy to share the weights...

chiayewken · 2023-04-04T16:39:11Z

Great! Currently we are using the cleaned version from vicuna, and it would be awesome if you are able to train and release the bigger models like flan-t5-xxl and flan-ul2 :)

soujanyaporia · 2023-04-04T23:35:12Z

@Logophoman Feel free to do a PR on this repo once you are training the large models. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is there any plan for flan-ul2? #9

is there any plan for flan-ul2? #9

nonkung51 commented Apr 4, 2023

chiayewken commented Apr 4, 2023

Logophoman commented Apr 4, 2023 •

edited

Loading

chiayewken commented Apr 4, 2023

Logophoman commented Apr 4, 2023

chiayewken commented Apr 4, 2023

soujanyaporia commented Apr 4, 2023

is there any plan for flan-ul2? #9

is there any plan for flan-ul2? #9

Comments

nonkung51 commented Apr 4, 2023

chiayewken commented Apr 4, 2023

Logophoman commented Apr 4, 2023 • edited Loading

chiayewken commented Apr 4, 2023

Logophoman commented Apr 4, 2023

chiayewken commented Apr 4, 2023

soujanyaporia commented Apr 4, 2023

Logophoman commented Apr 4, 2023 •

edited

Loading