Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is there any plan for flan-ul2? #9

Open
nonkung51 opened this issue Apr 4, 2023 · 6 comments
Open

is there any plan for flan-ul2? #9

nonkung51 opened this issue Apr 4, 2023 · 6 comments

Comments

@nonkung51
Copy link

i just wonder that is there going to be new checkpoint released base of flan ul2 ?

anyway, really love the work on this repo 😃 you guys did a really great job!

@chiayewken
Copy link
Collaborator

Thanks, that is nice to hear :)
Unfortunately we do not have immediate plans for ul2 due to limited GPUs :(
However, we do have a new checkpoint trained with ShareGPT data, it seems to be more coherent and reasonable

https://huggingface.co/declare-lab/flan-sharegpt-xl

@Logophoman
Copy link

Logophoman commented Apr 4, 2023

@chiayewken How did you process the shareGPT data? Do you take one utterance and its answer as a pair or do you account for contexts also, so that we can use the system properly with Langchain? 🤔 I am currently training a ShareGPT + Alpaca mix to see wether that can improve stuff and especially - I am building a context, so the data looks somewhat like this:

{"source": "content1", "target": "content2"}

{"source": "<user>content1 <assistant>content2 <user>content3", "target": "content4"}

{"source": "<user>content1 <assistant>content2 <user>content3 <assistant>content4 <user>content5", "target": "content6"}

{"source": "content1", "target": "content2"}

Here's the code to process the ShareGPT 90K taken from here:

https://huggingface.co/datasets/RyokoAI/ShareGPT52K/tree/main

import json

def clean_data(text: str) -> str:
    if text is None:
        return ""
    text = text.replace("<p>", "")
    text = text.replace("</p>", "")
    text = text.replace("<pre><code>", "```")
    text = text.replace("</code></pre>", "```")
    text = text.replace("<code>", "`")
    text = text.replace("</code>", "`")
    return text


def conversation_to_jsonl(input_file, output_file):
    with open(input_file, 'r') as f:
        data = json.load(f)

    with open(output_file, 'w') as f:
        for item in data:
            conversations = item['conversations']

            conversation_history = []

            # Iterate through the conversation messages with a step size of 2.
            for i in range(0, len(conversations) - 1, 2):
                conversation_history.append(clean_data(conversations[i]['value']))

                # Check if the current message is from a human and the next message is from GPT.
                if conversations[i]['from'] == 'human' and conversations[i + 1]['from'] == 'gpt':
                    conversation_history.append('<assistant>' + clean_data(conversations[i + 1]['value']))

                    source_elements = conversation_history[-6:]
                    if source_elements[0].startswith('<assistant>'):
                        source_elements = source_elements[1:]

                    # Remove the last <assistant> message from the source
                    source_elements = source_elements[:-1]

                    if len(source_elements) > 1:
                        source_elements = ['<user>' + x if i % 2 == 0 else x for i, x in enumerate(source_elements)]

                    source = ' '.join(source_elements)
                    target = clean_data(conversations[i + 1]['value'])
                    f.write(json.dumps({"source": source, "target": target}) + '\n')




input_file = 'data/sg_90k_part2.json' #"debug.json"
output_file = 'sg_90k_part2.json' #"debug_out.json"

conversation_to_jsonl(input_file, output_file)

@chiayewken
Copy link
Collaborator

Hi, I processed the ShareGPT data similar to you, I take each utterance from GPT as the target sequence and the dialog history as context:

def as_data(self) -> TextToTextData:

@Logophoman
Copy link

Very nice! I think that will make the ShareGPT data way more useful than the alpaca data! (did you use the cleaned version?) Especially since we can put this stuff into long contexts now. I hope my training runs through smoothly and if so I'll be happy to share the weights...

@chiayewken
Copy link
Collaborator

Great! Currently we are using the cleaned version from vicuna, and it would be awesome if you are able to train and release the bigger models like flan-t5-xxl and flan-ul2 :)

@soujanyaporia
Copy link
Contributor

@Logophoman Feel free to do a PR on this repo once you are training the large models. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants