Data format tutorial

higgsfield-ai · Nov 12, 2023 · 5b8421d · 5b8421d
1 parent 7c49b17
commit 5b8421d
Show file tree

Hide file tree

Showing 4 changed files with 377 additions and 0 deletions.
diff --git a/tutorials/README.md b/tutorials/README.md
@@ -0,0 +1,50 @@
+# Dataset formats
+
+We support the following dataset formats:
+
+- **Prompt Completion format**
+- **The ChatGPT format**
+- **The Plain Text format**
+
+
+Before uploading your dataset to Hugging Face, please make sure that it is in one of the above formats.
+
+We provide a tutorial on how to convert your dataset to each format.
+
+- **Prompt Completion format** (https://github.com/higgsfield-ai/higgsfield/tutorials/prompt_completion.ipynb)
+- **ChatGPT format** (https://github.com/higgsfield-ai/higgsfield/tutorials/chatgpt.ipynb)
+- **Plain Text format** (https://github.com/higgsfield-ai/higgsfield/tutorials/text_format.ipynb)
+
+### Format: Prompt Completion
+```json
+prompt_completion = {
+        "prompt": [
+            "prompt1",
+            "prompt2",
+        ],
+        "completion": [
+            "completion1",
+            "completion2",
+        ]
+    }
+```
+
+### Format: ChatGPT
+```json
+chatgpt_format = {
+    "chatgpt": [
+        [
+            {"role": "system", "content": "You are a human."},
+            {"role": "user", "content": "No I am not."},
+            {"role": "assistant", "content": "I am a robot."},
+        ],
+    ]
+}
+```
+
+### Format: Text
+```json
+text_format = {
+    "text": ["text"]
+}
+```
diff --git a/tutorials/chatgpt.ipynb b/tutorials/chatgpt.ipynb
@@ -0,0 +1,120 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "56b08e7a-f016-4783-a4ff-1bc51bf5534b",
+   "metadata": {},
+   "source": [
+    "## The ChatGPT format\n",
+    "```json\n",
+    "chatgpt_format = {\n",
+    "    \"chatgpt\": [\n",
+    "        [\n",
+    "            {\"role\": \"system\", \"content\": \"You are a human.\"},\n",
+    "            {\"role\": \"user\", \"content\": \"No I am not.\"},\n",
+    "            {\"role\": \"assistant\", \"content\": \"I am a robot.\"},\n",
+    "        ],\n",
+    "    ]\n",
+    "}\n",
+    "```\n",
+    "### Example: converting a dataset from Hugging Face to the ChatGPT format and uploading to Hugging Face repo"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c27e49b3-9ce2-481f-9e55-f008abeadc3d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'text': '<HUMAN>: What is a panic attack?\\n<ASSISTANT>: Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'} 172\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "data = load_dataset(\"heliosbrahma/mental_health_chatbot_dataset\")[\"train\"]\n",
+    "print(data[0], len(data))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "351d1a81-c1c8-4bb0-9b8c-e14b54ff73f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[{'role': 'system', 'content': 'You are a mental health assistant.'}, {'role': 'user', 'content': 'What is a panic attack?\\n'}, {'role': 'assistant', 'content': 'Panic attacks come on suddenly and involve intense and often overwhelming fear. They’re accompanied by very challenging physical symptoms, like a racing heartbeat, shortness of breath, or nausea. Unexpected panic attacks occur without an obvious cause. Expected panic attacks are cued by external stressors, like phobias. Panic attacks can happen to anyone, but having more than one may be a sign of panic disorder, a mental health condition characterized by sudden and repeated panic attacks.'}]\n"
+     ]
+    }
+   ],
+   "source": [
+    "chatgpt_format = {\n",
+    "    \"chatgpt\": []\n",
+    "}\n",
+    "\n",
+    "SYSTEM_PROMPT = \"You are a mental health assistant.\"\n",
+    "for d in data:\n",
+    "    text = d[\"text\"]\n",
+    "    assistant_word_i = text.find(\"<A\")\n",
+    "    human_text = text[9:assistant_word_i]\n",
+    "    assistant_text = text[assistant_word_i + len(\"<ASSISTANT>: \"):]\n",
+    "\n",
+    "    chatgpt_format[\"chatgpt\"].append(\n",
+    "        [\n",
+    "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "            {\"role\": \"user\", \"content\": human_text},\n",
+    "            {\"role\": \"assistant\", \"content\": assistant_text}\n",
+    "        ])\n",
+    "\n",
+    "print(chatgpt_format[\"chatgpt\"][0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab541655-c234-41ca-9f7a-2f20f1ad3c01",
+   "metadata": {},
+   "source": [
+    "### Publish to Hugging Face repo"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88fd3ef0-1bc4-42d6-af93-b670b64a44c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset.from_dict(chatgpt_format)\n",
+    "dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/prompt_completion.ipynb b/tutorials/prompt_completion.ipynb
@@ -0,0 +1,136 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "68c0040c-7c2f-44a3-934e-4969b14d1ebf",
+   "metadata": {},
+   "source": [
+    "## The Prompt Completion Format\n",
+    "```json\n",
+    "    prompt_completion = {\n",
+    "        \"prompt\": [\n",
+    "            \"prompt1\",\n",
+    "            \"prompt2\",\n",
+    "        ],\n",
+    "        \"completion\": [\n",
+    "            \"completion1\",\n",
+    "            \"completion2\",\n",
+    "        ]\n",
+    "    }\n",
+    "```\n",
+    "### Example: converting a dataset from Hugging Face to the Prompt Completion format and uploading to Hugging Face repo"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "7c19feb1-28e6-4ecb-9545-af1f61fef7b9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from datasets import Dataset \n",
+    "\n",
+    "data = load_dataset(\"tatsu-lab/alpaca\")[\"train\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "960dce00-cc00-479e-9b1b-6d8e2fd089d6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nGive three tips for staying healthy.\\n\\n### Response:\\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(data[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "893c8c1e-ce97-4d13-a718-ecc24fe5a97f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "###Instruction: Give three tips for staying healthy.\n",
+      "###Input: \n",
+      "###Assistant:\n",
+      "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n",
+      "2. Exercise regularly to keep your body active and strong. \n",
+      "3. Get enough sleep and maintain a consistent sleep schedule.\n"
+     ]
+    }
+   ],
+   "source": [
+    "prompt_completion = {\n",
+    "    \"prompt\": [],\n",
+    "    \"completion\": []\n",
+    "}\n",
+    "\n",
+    "for d in data:\n",
+    "    if \"input\" in d.keys():\n",
+    "        prompt = f\"\"\"###Instruction: {d['instruction']}\n",
+    "###Input: {d['input']}\n",
+    "###Assistant:\"\"\"\n",
+    "    else:\n",
+    "        prompt = f\"\"\"###Instruction: {d['instruction']}\n",
+    "###Assistant:\"\"\"\n",
+    "    prompt_completion[\"prompt\"].append(prompt)\n",
+    "    prompt_completion[\"completion\"].append(d[\"output\"])\n",
+    "\n",
+    "print(prompt_completion[\"prompt\"][0])\n",
+    "print(prompt_completion[\"completion\"][0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd5c509d-5fd5-43b5-b119-6029904ec266",
+   "metadata": {},
+   "source": [
+    "### Publish to Hugging Face repo"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6db95a3f-9bb7-403f-882b-b7e428be7a2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = Dataset.from_dict(prompt_completion)\n",
+    "dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tutorials/text_format.ipynb b/tutorials/text_format.ipynb
@@ -0,0 +1,71 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "595308c8-9feb-4b6c-bfea-922ef8b031e8",
+   "metadata": {},
+   "source": [
+    "## The Text format\n",
+    "### Example: converting the TINY SHAKESPEARE dataset into the \"Text\" format"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "546f924a-df45-48ba-849f-51cf919aca9d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Text len:  1115394\n"
+     ]
+    }
+   ],
+   "source": [
+    "import urllib.request\n",
+    "URL = \"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\"\n",
+    "shakespeare_data = \"\".join([line.decode('utf-8') for line in urllib.request.urlopen(URL)])\n",
+    "print(\"Text len: \", len(shakespeare_data))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b84ca7b-d344-4ebe-9751-22d3e1482a3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_format = {\n",
+    "    \"text\": [shakespeare_data]\n",
+    "}\n",
+    "\n",
+    "from datasets import Dataset\n",
+    "dataset = Dataset.from_dict(text_format)\n",
+    "dataset.push_to_hub(\"<HUGGING_FACE_DATASET_REPO>\", token=\"<HUGGING_FACE_TOKEN>\") # Example of a dataset repo 'test/test_dataset'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}