-
Notifications
You must be signed in to change notification settings - Fork 554
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
377 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Dataset formats | ||
|
||
We support the following dataset formats: | ||
|
||
- **Prompt Completion format** | ||
- **The ChatGPT format** | ||
- **The Plain Text format** | ||
|
||
|
||
Before uploading your dataset to Hugging Face, please make sure that it is in one of the above formats. | ||
|
||
We provide a tutorial on how to convert your dataset to each format. | ||
|
||
- **Prompt Completion format** (https://github.com/higgsfield-ai/higgsfield/tutorials/prompt_completion.ipynb) | ||
- **ChatGPT format** (https://github.com/higgsfield-ai/higgsfield/tutorials/chatgpt.ipynb) | ||
- **Plain Text format** (https://github.com/higgsfield-ai/higgsfield/tutorials/text_format.ipynb) | ||
|
||
### Format: Prompt Completion | ||
```json | ||
prompt_completion = { | ||
"prompt": [ | ||
"prompt1", | ||
"prompt2", | ||
], | ||
"completion": [ | ||
"completion1", | ||
"completion2", | ||
] | ||
} | ||
``` | ||
|
||
### Format: ChatGPT | ||
```json | ||
chatgpt_format = { | ||
"chatgpt": [ | ||
[ | ||
{"role": "system", "content": "You are a human."}, | ||
{"role": "user", "content": "No I am not."}, | ||
{"role": "assistant", "content": "I am a robot."}, | ||
], | ||
] | ||
} | ||
``` | ||
|
||
### Format: Text | ||
```json | ||
text_format = { | ||
"text": ["text"] | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "56b08e7a-f016-4783-a4ff-1bc51bf5534b", | ||
"metadata": {}, | ||
"source": [ | ||
"## The ChatGPT format\n", | ||
"```json\n", | ||
"chatgpt_format = {\n", | ||
" \"chatgpt\": [\n", | ||
" [\n", | ||
" {\"role\": \"system\", \"content\": \"You are a human.\"},\n", | ||
" {\"role\": \"user\", \"content\": \"No I am not.\"},\n", | ||
" {\"role\": \"assistant\", \"content\": \"I am a robot.\"},\n", | ||
" ],\n", | ||
" ]\n", | ||
"}\n", | ||
"```\n", | ||
"### Example: converting a dataset from Hugging Face to the ChatGPT format and uploading to Hugging Face repo" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "c27e49b3-9ce2-481f-9e55-f008abeadc3d", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'text': '<HUMAN>: What is a panic attack?\\n<ASSISTANT>: Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'} 172\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from datasets import load_dataset\n", | ||
"data = load_dataset(\"heliosbrahma/mental_health_chatbot_dataset\")[\"train\"]\n", | ||
"print(data[0], len(data))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "351d1a81-c1c8-4bb0-9b8c-e14b54ff73f6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[{'role': 'system', 'content': 'You are a mental health assistant.'}, {'role': 'user', 'content': 'What is a panic attack?\\n'}, {'role': 'assistant', 'content': 'Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'}]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"chatgpt_format = {\n", | ||
" \"chatgpt\": []\n", | ||
"}\n", | ||
"\n", | ||
"SYSTEM_PROMPT = \"You are a mental health assistant.\"\n", | ||
"for d in data:\n", | ||
" text = d[\"text\"]\n", | ||
" assistant_word_i = text.find(\"<A\")\n", | ||
" human_text = text[9:assistant_word_i]\n", | ||
" assistant_text = text[assistant_word_i + len(\"<ASSISTANT>: \"):]\n", | ||
"\n", | ||
" chatgpt_format[\"chatgpt\"].append(\n", | ||
" [\n", | ||
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", | ||
" {\"role\": \"user\", \"content\": human_text},\n", | ||
" {\"role\": \"assistant\", \"content\": assistant_text}\n", | ||
" ])\n", | ||
"\n", | ||
"print(chatgpt_format[\"chatgpt\"][0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ab541655-c234-41ca-9f7a-2f20f1ad3c01", | ||
"metadata": {}, | ||
"source": [ | ||
"### Publish to Hugging Face repo" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "88fd3ef0-1bc4-42d6-af93-b670b64a44c9", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = Dataset.from_dict(chatgpt_format)\n", | ||
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.18" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "68c0040c-7c2f-44a3-934e-4969b14d1ebf", | ||
"metadata": {}, | ||
"source": [ | ||
"## The Prompt Completion Format\n", | ||
"```json\n", | ||
" prompt_completion = {\n", | ||
" \"prompt\": [\n", | ||
" \"prompt1\",\n", | ||
" \"prompt2\",\n", | ||
" ],\n", | ||
" \"completion\": [\n", | ||
" \"completion1\",\n", | ||
" \"completion2\",\n", | ||
" ]\n", | ||
" }\n", | ||
"```\n", | ||
"### Example: converting a dataset from Hugging Face to the Prompt Completion format and uploading to Hugging Face repo" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "7c19feb1-28e6-4ecb-9545-af1f61fef7b9", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from datasets import load_dataset\n", | ||
"from datasets import Dataset \n", | ||
"\n", | ||
"data = load_dataset(\"tatsu-lab/alpaca\")[\"train\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "960dce00-cc00-479e-9b1b-6d8e2fd089d6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nGive three tips for staying healthy.\\n\\n### Response:\\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.'}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(data[0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "893c8c1e-ce97-4d13-a718-ecc24fe5a97f", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"###Instruction: Give three tips for staying healthy.\n", | ||
"###Input: \n", | ||
"###Assistant:\n", | ||
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n", | ||
"2. Exercise regularly to keep your body active and strong. \n", | ||
"3. Get enough sleep and maintain a consistent sleep schedule.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"prompt_completion = {\n", | ||
" \"prompt\": [],\n", | ||
" \"completion\": []\n", | ||
"}\n", | ||
"\n", | ||
"for d in data:\n", | ||
" if \"input\" in d.keys():\n", | ||
" prompt = f\"\"\"###Instruction: {d['instruction']}\n", | ||
"###Input: {d['input']}\n", | ||
"###Assistant:\"\"\"\n", | ||
" else:\n", | ||
" prompt = f\"\"\"###Instruction: {d['instruction']}\n", | ||
"###Assistant:\"\"\"\n", | ||
" prompt_completion[\"prompt\"].append(prompt)\n", | ||
" prompt_completion[\"completion\"].append(d[\"output\"])\n", | ||
"\n", | ||
"print(prompt_completion[\"prompt\"][0])\n", | ||
"print(prompt_completion[\"completion\"][0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "dd5c509d-5fd5-43b5-b119-6029904ec266", | ||
"metadata": {}, | ||
"source": [ | ||
"### Publish to Hugging Face repo" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "6db95a3f-9bb7-403f-882b-b7e428be7a2c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = Dataset.from_dict(prompt_completion)\n", | ||
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.18" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "595308c8-9feb-4b6c-bfea-922ef8b031e8", | ||
"metadata": {}, | ||
"source": [ | ||
"## The Text format\n", | ||
"### Example: converting the TINY SHAKESPEARE dataset into the \"Text\" format" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "546f924a-df45-48ba-849f-51cf919aca9d", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Text len: 1115394\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import urllib.request\n", | ||
"URL = \"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\"\n", | ||
"shakespeare_data = \"\".join([line.decode('utf-8') for line in urllib.request.urlopen(URL)])\n", | ||
"print(\"Text len: \", len(shakespeare_data))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "3b84ca7b-d344-4ebe-9751-22d3e1482a3b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"text_format = {\n", | ||
" \"text\": [shakespeare_data]\n", | ||
"}\n", | ||
"\n", | ||
"from datasets import Dataset\n", | ||
"dataset = Dataset.from_dict(text_format)\n", | ||
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.18" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |