Skip to content

Commit

Permalink
Data format tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
zxcjhg committed Nov 12, 2023
1 parent 7c49b17 commit 5b8421d
Show file tree
Hide file tree
Showing 4 changed files with 377 additions and 0 deletions.
50 changes: 50 additions & 0 deletions tutorials/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Dataset formats

We support the following dataset formats:

- **Prompt Completion format**
- **The ChatGPT format**
- **The Plain Text format**


Before uploading your dataset to Hugging Face, please make sure that it is in one of the above formats.

We provide a tutorial on how to convert your dataset to each format.

- **Prompt Completion format** (https://github.com/higgsfield-ai/higgsfield/tutorials/prompt_completion.ipynb)
- **ChatGPT format** (https://github.com/higgsfield-ai/higgsfield/tutorials/chatgpt.ipynb)
- **Plain Text format** (https://github.com/higgsfield-ai/higgsfield/tutorials/text_format.ipynb)

### Format: Prompt Completion
```json
prompt_completion = {
"prompt": [
"prompt1",
"prompt2",
],
"completion": [
"completion1",
"completion2",
]
}
```

### Format: ChatGPT
```json
chatgpt_format = {
"chatgpt": [
[
{"role": "system", "content": "You are a human."},
{"role": "user", "content": "No I am not."},
{"role": "assistant", "content": "I am a robot."},
],
]
}
```

### Format: Text
```json
text_format = {
"text": ["text"]
}
```
120 changes: 120 additions & 0 deletions tutorials/chatgpt.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "56b08e7a-f016-4783-a4ff-1bc51bf5534b",
"metadata": {},
"source": [
"## The ChatGPT format\n",
"```json\n",
"chatgpt_format = {\n",
" \"chatgpt\": [\n",
" [\n",
" {\"role\": \"system\", \"content\": \"You are a human.\"},\n",
" {\"role\": \"user\", \"content\": \"No I am not.\"},\n",
" {\"role\": \"assistant\", \"content\": \"I am a robot.\"},\n",
" ],\n",
" ]\n",
"}\n",
"```\n",
"### Example: converting a dataset from Hugging Face to the ChatGPT format and uploading to Hugging Face repo"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c27e49b3-9ce2-481f-9e55-f008abeadc3d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'text': '<HUMAN>: What is a panic attack?\\n<ASSISTANT>: Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'} 172\n"
]
}
],
"source": [
"from datasets import load_dataset\n",
"data = load_dataset(\"heliosbrahma/mental_health_chatbot_dataset\")[\"train\"]\n",
"print(data[0], len(data))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "351d1a81-c1c8-4bb0-9b8c-e14b54ff73f6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'role': 'system', 'content': 'You are a mental health assistant.'}, {'role': 'user', 'content': 'What is a panic attack?\\n'}, {'role': 'assistant', 'content': 'Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'}]\n"
]
}
],
"source": [
"chatgpt_format = {\n",
" \"chatgpt\": []\n",
"}\n",
"\n",
"SYSTEM_PROMPT = \"You are a mental health assistant.\"\n",
"for d in data:\n",
" text = d[\"text\"]\n",
" assistant_word_i = text.find(\"<A\")\n",
" human_text = text[9:assistant_word_i]\n",
" assistant_text = text[assistant_word_i + len(\"<ASSISTANT>: \"):]\n",
"\n",
" chatgpt_format[\"chatgpt\"].append(\n",
" [\n",
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": human_text},\n",
" {\"role\": \"assistant\", \"content\": assistant_text}\n",
" ])\n",
"\n",
"print(chatgpt_format[\"chatgpt\"][0])"
]
},
{
"cell_type": "markdown",
"id": "ab541655-c234-41ca-9f7a-2f20f1ad3c01",
"metadata": {},
"source": [
"### Publish to Hugging Face repo"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88fd3ef0-1bc4-42d6-af93-b670b64a44c9",
"metadata": {},
"outputs": [],
"source": [
"dataset = Dataset.from_dict(chatgpt_format)\n",
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
136 changes: 136 additions & 0 deletions tutorials/prompt_completion.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "68c0040c-7c2f-44a3-934e-4969b14d1ebf",
"metadata": {},
"source": [
"## The Prompt Completion Format\n",
"```json\n",
" prompt_completion = {\n",
" \"prompt\": [\n",
" \"prompt1\",\n",
" \"prompt2\",\n",
" ],\n",
" \"completion\": [\n",
" \"completion1\",\n",
" \"completion2\",\n",
" ]\n",
" }\n",
"```\n",
"### Example: converting a dataset from Hugging Face to the Prompt Completion format and uploading to Hugging Face repo"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7c19feb1-28e6-4ecb-9545-af1f61fef7b9",
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"from datasets import Dataset \n",
"\n",
"data = load_dataset(\"tatsu-lab/alpaca\")[\"train\"]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "960dce00-cc00-479e-9b1b-6d8e2fd089d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nGive three tips for staying healthy.\\n\\n### Response:\\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.'}\n"
]
}
],
"source": [
"print(data[0])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "893c8c1e-ce97-4d13-a718-ecc24fe5a97f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"###Instruction: Give three tips for staying healthy.\n",
"###Input: \n",
"###Assistant:\n",
"1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n",
"2. Exercise regularly to keep your body active and strong. \n",
"3. Get enough sleep and maintain a consistent sleep schedule.\n"
]
}
],
"source": [
"prompt_completion = {\n",
" \"prompt\": [],\n",
" \"completion\": []\n",
"}\n",
"\n",
"for d in data:\n",
" if \"input\" in d.keys():\n",
" prompt = f\"\"\"###Instruction: {d['instruction']}\n",
"###Input: {d['input']}\n",
"###Assistant:\"\"\"\n",
" else:\n",
" prompt = f\"\"\"###Instruction: {d['instruction']}\n",
"###Assistant:\"\"\"\n",
" prompt_completion[\"prompt\"].append(prompt)\n",
" prompt_completion[\"completion\"].append(d[\"output\"])\n",
"\n",
"print(prompt_completion[\"prompt\"][0])\n",
"print(prompt_completion[\"completion\"][0])"
]
},
{
"cell_type": "markdown",
"id": "dd5c509d-5fd5-43b5-b119-6029904ec266",
"metadata": {},
"source": [
"### Publish to Hugging Face repo"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6db95a3f-9bb7-403f-882b-b7e428be7a2c",
"metadata": {},
"outputs": [],
"source": [
"dataset = Dataset.from_dict(prompt_completion)\n",
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
71 changes: 71 additions & 0 deletions tutorials/text_format.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "595308c8-9feb-4b6c-bfea-922ef8b031e8",
"metadata": {},
"source": [
"## The Text format\n",
"### Example: converting the TINY SHAKESPEARE dataset into the \"Text\" format"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "546f924a-df45-48ba-849f-51cf919aca9d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Text len: 1115394\n"
]
}
],
"source": [
"import urllib.request\n",
"URL = \"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\"\n",
"shakespeare_data = \"\".join([line.decode('utf-8') for line in urllib.request.urlopen(URL)])\n",
"print(\"Text len: \", len(shakespeare_data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b84ca7b-d344-4ebe-9751-22d3e1482a3b",
"metadata": {},
"outputs": [],
"source": [
"text_format = {\n",
" \"text\": [shakespeare_data]\n",
"}\n",
"\n",
"from datasets import Dataset\n",
"dataset = Dataset.from_dict(text_format)\n",
"dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 5b8421d

Please sign in to comment.