diff --git a/notebooks/demo/nx_cugraph_demo.ipynb b/notebooks/demo/nx_cugraph_demo.ipynb new file mode 100644 index 00000000000..6e50370ed80 --- /dev/null +++ b/notebooks/demo/nx_cugraph_demo.ipynb @@ -0,0 +1,672 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `nx-cugraph`: a NetworkX backend that provides GPU acceleration with RAPIDS cuGraph" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook will demonstrate the `nx-cugraph` NetworkX backend using the NetworkX betweenness_centrality algorithm.\n", + "\n", + "## Background\n", + "Networkx version 3.0 introduced a dispatching mechanism that allows users to configure NetworkX to dispatch various algorithms to third-party backends. Backends can provide different implementations of graph algorithms, allowing users to take advantage of capabilities not available in NetworkX. `nx-cugraph` is a NetworkX backend provided by the [RAPIDS](https://rapids.ai) cuGraph project that adds GPU acceleration to greatly improve performance.\n", + "\n", + "## System Requirements\n", + "Using `nx-cugraph` with this notebook requires the following: \n", + "- NVIDIA GPU, Pascal architecture or later\n", + "- CUDA 11.2, 11.4, 11.5, 11.8, or 12.0\n", + "- Python versions 3.9, 3.10, or 3.11\n", + "- NetworkX >= version 3.2\n", + " - _NetworkX 3.0 supports dispatching and is compatible with `nx-cugraph`, but this notebook will demonstrate features added in 3.2_\n", + " - At the time of this writing, NetworkX 3.2 is only available from source and can be installed by following the [development version install instructions](https://github.com/networkx/networkx/blob/main/INSTALL.rst#install-the-development-version).\n", + "- Pandas\n", + "\n", + "More details about system requirements can be found in the [RAPIDS System Requirements documentation](https://docs.rapids.ai/install#system-req)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Installation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Assuming NetworkX >= 3.2 has been installed using the [development version install instructions](https://github.com/networkx/networkx/blob/main/INSTALL.rst#install-the-development-version), `nx-cugraph` can be installed using either `conda` or `pip`. \n", + "\n", + "#### conda\n", + "```\n", + "conda install -c rapidsai-nightly -c conda-forge -c nvidia nx-cugraph\n", + "```\n", + "#### pip\n", + "```\n", + "python -m pip install nx-cugraph-cu11 --extra-index-url https://pypi.nvidia.com\n", + "```\n", + "#### _Notes:_\n", + " * nightly wheel builds will not be available until the 23.12 release, therefore the index URL for the stable release version is being used in the pip install command above.\n", + " * Additional information relevant to installing any RAPIDS package can be found [here](https://rapids.ai/#quick-start).\n", + " * If you installed any of the packages described here since running this notebook, you may need to restart the kernel to have them visible to this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Notebook Helper Functions\n", + "\n", + "A few helper functions will be defined here that will be used in order to help keep this notebook easy to read." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "def reimport_networkx():\n", + " \"\"\"\n", + " Re-imports networkx for demonstrating different backend configuration\n", + " options applied at import-time. This is only needed for demonstration\n", + " purposes since other mechanisms are available for runtime configuration.\n", + " \"\"\"\n", + " # Using importlib.reload(networkx) has several caveats (described here:\n", + " # https://docs.python.org/3/library/imp.html?highlight=reload#imp.reload)\n", + " # which result in backend configuration not being re-applied correctly.\n", + " # Instead, manually remove all modules and re-import\n", + " nx_mods = [m for m in sys.modules.keys()\n", + " if (m.startswith(\"networkx\") or m.startswith(\"nx_cugraph\"))]\n", + " for m in nx_mods:\n", + " sys.modules.pop(m)\n", + " import networkx\n", + " return networkx\n", + "\n", + "\n", + "from pathlib import Path\n", + "import requests\n", + "import gzip\n", + "import pandas as pd\n", + "def create_cit_patents_graph(verbose=True):\n", + " \"\"\"\n", + " Downloads the cit-Patents dataset (if not previously downloaded), reads\n", + " it, and creates a nx.DiGraph from it and returns it.\n", + " cit-Patents is described here:\n", + " https://snap.stanford.edu/data/cit-Patents.html\n", + " \"\"\"\n", + " url = \"https://snap.stanford.edu/data/cit-Patents.txt.gz\"\n", + " gz_file_name = Path(url.split(\"/\")[-1])\n", + " csv_file_name = Path(gz_file_name.stem)\n", + " if csv_file_name.exists():\n", + " if verbose: print(f\"{csv_file_name} already exists, not downloading.\")\n", + " else:\n", + " if verbose: print(f\"downloading {url}...\", end=\"\", flush=True)\n", + " req = requests.get(url)\n", + " open(gz_file_name, \"wb\").write(req.content)\n", + " if verbose: print(\"done\")\n", + " if verbose: print(f\"unzipping {gz_file_name}...\", end=\"\", flush=True)\n", + " with gzip.open(gz_file_name, \"rb\") as gz_in:\n", + " with open(csv_file_name, \"wb\") as txt_out:\n", + " txt_out.write(gz_in.read())\n", + " if verbose: print(\"done\")\n", + "\n", + " if verbose: print(\"reading csv to dataframe...\", end=\"\", flush=True)\n", + " pandas_edgelist = pd.read_csv(\n", + " csv_file_name.name,\n", + " skiprows=4,\n", + " delimiter=\"\\t\",\n", + " names=[\"src\", \"dst\"],\n", + " dtype={\"src\":\"int32\", \"dst\":\"int32\"},\n", + " )\n", + " if verbose: print(\"done\")\n", + " if verbose: print(\"creating NX graph from dataframe...\", end=\"\", flush=True)\n", + " G = nx.from_pandas_edgelist(\n", + " pandas_edgelist, source=\"src\", target=\"dst\", create_using=nx.DiGraph\n", + " )\n", + " if verbose: print(\"done\")\n", + " return G" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Running `betweenness_centrality`\n", + "Let's start by running `betweenness_centrality` on the Karate Club graph using the default NetworkX implementation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Zachary's Karate Club\n", + "\n", + "Zachary's Karate Club is a small dataset consisting of 34 nodes and 78 edges which represent the friendships between members of a karate club. This dataset is small enough to make comparing results between NetworkX and `nx-cugraph` easy." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import networkx as nx\n", + "karate_club_graph = nx.karate_club_graph()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Having NetworkX compute the `betweenness_centrality` values for each node on this graph is quick and easy." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2.51 ms ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + ] + } + ], + "source": [ + "%%timeit global karate_nx_bc_results\n", + "karate_nx_bc_results = nx.betweenness_centrality(karate_club_graph)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Automatic GPU acceleration\n", + "When `nx-cugraph` is installed, NetworkX will detect it on import and make it available as a backend for APIs supported by that backend. However, NetworkX does not assume the user always wants to use a particular backend, and instead looks at various configuration mechanisms in place for users to specify how NetworkX should use installed backends. Since NetworkX was not configured to use a backend for the above `betweenness_centrality` call, it used the default implementation provided by NetworkX.\n", + "\n", + "The first configuration mechanism to be demonstrated below is the `NETWORKX_AUTOMATIC_BACKENDS` environment variable. This environment variable directs NetworkX to use the backend specified everywhere it's supported and does not require the user to modify any of their existing NetworkX code.\n", + "\n", + "To use it, a user sets `NETWORKX_AUTOMATIC_BACKENDS` in their shell to the backend they'd like to use. If a user has more than one backend installed, the environment variable can also accept a comma-separated list of backends, ordered by priority in which NetworkX should use them, where the first backend that supports a particular API call will be used. For example:\n", + "```\n", + "bash> export NETWORKX_AUTOMATIC_BACKENDS=cugraph\n", + "bash> python my_nx_app.py # uses nx-cugraph wherever possible, then falls back to default implementation where it's not.\n", + "```\n", + "or in the case of multiple backends installed\n", + "```\n", + "bash> export NETWORKX_AUTOMATIC_BACKENDS=cugraph,graphblas\n", + "bash> python my_nx_app.py # uses nx-cugraph if possible, then nx-graphblas if possible, then default implementation.\n", + "```\n", + "\n", + "NetworkX looks at the environment variable and the installed backends at import time, and will not re-examine the environment after that. Because `networkx` was already imported in this notebook, the `reimport_nx()` utility will be called after the `os.environ` dictionary is updated to simulate an environment variable being set in the shell.\n", + "\n", + "**Please note, this is only needed for demonstration purposes to compare runs both with and without fully-automatic backend use enabled.**" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"NETWORKX_AUTOMATIC_BACKENDS\"] = \"cugraph\"\n", + "nx = reimport_networkx()\n", + "# reimporting nx requires reinstantiating Graphs since python considers\n", + "# types from the prior nx import != types from the reimported nx\n", + "karate_club_graph = nx.karate_club_graph()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Once the environment is updated, re-running the same `betweenness_centrality` call on the same graph requires no code changes." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "43.9 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit global karate_cg_bc_results\n", + "karate_cg_bc_results = nx.betweenness_centrality(karate_club_graph)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "We may see that the same computation actually took *longer* using `nx-cugraph`. This is not too surprising given how small the graph is, since there's a small amount of overhead to copy data to and from the GPU which becomes more obvious on very small graphs. We'll see with a larger graph how this overhead becomes negligible." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### Results Comparison\n", + "\n", + "Let's examine the results of each run to see how they compare. \n", + "The `betweenness_centrality` results are a dictionary mapping vertex IDs to betweenness_centrality scores. The score itself is usually not as important as the relative rank of each vertex ID (e.g. vertex A is ranked higher than vertex B in both sets of results)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NX: (0, 0.437635), CG: (0, 0.437635)\n", + "NX: (33, 0.304075), CG: (33, 0.304075)\n", + "NX: (32, 0.145247), CG: (32, 0.145247)\n", + "NX: (2, 0.143657), CG: (2, 0.143657)\n", + "NX: (31, 0.138276), CG: (31, 0.138276)\n", + "NX: (8, 0.055927), CG: (8, 0.055927)\n", + "NX: (1, 0.053937), CG: (1, 0.053937)\n", + "NX: (13, 0.045863), CG: (13, 0.045863)\n", + "NX: (19, 0.032475), CG: (19, 0.032475)\n", + "NX: (5, 0.029987), CG: (5, 0.029987)\n", + "NX: (6, 0.029987), CG: (6, 0.029987)\n", + "NX: (27, 0.022333), CG: (27, 0.022333)\n", + "NX: (23, 0.017614), CG: (23, 0.017614)\n", + "NX: (30, 0.014412), CG: (30, 0.014412)\n", + "NX: (3, 0.011909), CG: (3, 0.011909)\n", + "NX: (25, 0.003840), CG: (25, 0.003840)\n", + "NX: (29, 0.002922), CG: (29, 0.002922)\n", + "NX: (24, 0.002210), CG: (24, 0.002210)\n", + "NX: (28, 0.001795), CG: (28, 0.001795)\n", + "NX: (9, 0.000848), CG: (9, 0.000848)\n", + "NX: (4, 0.000631), CG: (4, 0.000631)\n", + "NX: (10, 0.000631), CG: (10, 0.000631)\n", + "NX: (7, 0.000000), CG: (7, 0.000000)\n", + "NX: (11, 0.000000), CG: (11, 0.000000)\n", + "NX: (12, 0.000000), CG: (12, 0.000000)\n", + "NX: (14, 0.000000), CG: (14, 0.000000)\n", + "NX: (15, 0.000000), CG: (15, 0.000000)\n", + "NX: (16, 0.000000), CG: (16, 0.000000)\n", + "NX: (17, 0.000000), CG: (17, 0.000000)\n", + "NX: (18, 0.000000), CG: (18, 0.000000)\n", + "NX: (20, 0.000000), CG: (20, 0.000000)\n", + "NX: (21, 0.000000), CG: (21, 0.000000)\n", + "NX: (22, 0.000000), CG: (22, 0.000000)\n", + "NX: (26, 0.000000), CG: (26, 0.000000)\n" + ] + } + ], + "source": [ + "# The lists contain tuples of (vertex ID, betweenness_centrality score),\n", + "# sorted based on the score.\n", + "nx_sorted = sorted(karate_nx_bc_results.items(), key=lambda t:t[1], reverse=True)\n", + "cg_sorted = sorted(karate_cg_bc_results.items(), key=lambda t:t[1], reverse=True)\n", + "\n", + "for i in range(len(nx_sorted)):\n", + " print(\"NX: (%d, %.6f), CG: (%d, %.6f)\" % (nx_sorted[i] + cg_sorted[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Here we can see that the results match exactly as expected. \n", + "\n", + "For larger graphs, results are harder to compare given that `betweenness_centrality` is an approximation algorithm influenced by the random selection of paths used to compute the betweenness_centrality score of each vertex. The argument `k` is used for limiting the number of paths used in the computation, since using every path for every vertex would be prohibitively expensive for large graphs. For small graphs, `k` need not be specified, which allows `betweenness_centrality` to use all paths for all vertices and makes for an easier comparison." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "### `betweenness_centrality` on larger graphs - The U.S. Patent Citation Network1\n", + "\n", + "The U.S. Patent Citation Network dataset is much larger with over 3.7M nodes and over 16.5M edges and demonstrates how `nx-cugraph` enables NetworkX to run `betweenness_centrality` on graphs this large (and larger) in seconds instead of minutes.\n", + "\n", + "#### NetworkX default implementation" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "downloading https://snap.stanford.edu/data/cit-Patents.txt.gz...done\n", + "unzipping cit-Patents.txt.gz...done\n", + "reading csv to dataframe...done\n", + "creating NX graph from dataframe...done\n" + ] + } + ], + "source": [ + "import os\n", + "# Unset NETWORKX_AUTOMATIC_BACKENDS so the default NetworkX implementation is used\n", + "os.environ.pop(\"NETWORKX_AUTOMATIC_BACKENDS\", None)\n", + "nx = reimport_networkx()\n", + "# Create the cit-Patents graph - this will also download the dataset if not previously downloaded\n", + "cit_patents_graph = create_cit_patents_graph()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Since this is a large graph, a k value must be set so the computation returns in a reasonable time\n", + "k = 40" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Because this run will take time, `%%timeit` is restricted to a single pass.\n", + "\n", + "*NOTE: this run may take approximately 1 minute*" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1min 4s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 1\n", + "results = nx.betweenness_centrality(cit_patents_graph, k=k)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "Something to note is that `%%timeit` disables garbage collection by default, which may not be something a user is able to do. To see a more realistic real-world run time, `gc` can be enabled." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# import and run the garbage collector upfront prior to using it in the benchmark\n", + "import gc\n", + "gc.collect()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "*NOTE: this run may take approximately 7 minutes!*" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "6min 50s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 1 gc.enable()\n", + "nx.betweenness_centrality(cit_patents_graph, k=k)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### `nx-cugraph`\n", + "\n", + "Running on a GPU using `nx-cugraph` can result in a tremendous speedup, especially when graphs reach sizes larger than a few thousand nodes or `k` values become larger to increase accuracy.\n", + "\n", + "Rather than setting the `NETWORKX_AUTOMATIC_BACKENDS` environment variable and re-importing again, this example will demonstrate the `backend=` keyword argument to explicitly direct the NetworkX dispatcher to use the `cugraph` backend." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 1 gc.enable()\n", + "nx.betweenness_centrality(cit_patents_graph, k=k, backend=\"cugraph\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "k = 150" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "11.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 1 gc.enable()\n", + "nx.betweenness_centrality(cit_patents_graph, k=k, backend=\"cugraph\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "For the same graph and the same `k` value, the `\"cugraph\"` backend returns results in seconds instead of minutes. Increasing the `k` value has very little relative impact to runtime due to the high parallel processing ability of the GPU, allowing the user to get improved accuracy for virtually no additional cost." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Type-based dispatching\n", + "\n", + "NetworkX also supports automatically dispatching to backends associated with specific graph types. This requires the user to write code for a specific backend, and therefore requires the backend to be installed, but has the advantage of ensuring a particular behavior without the potential for runtime conversions.\n", + "\n", + "To use type-based dispatching with `nx-cugraph`, the user must import the backend directly in their code to access the utilities provided to create a Graph instance specifically for the `nx-cugraph` backend." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import nx_cugraph as nxcg" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "The `from_networkx()` API will copy the data from the NetworkX graph instance to the GPU and return a new `nx-cugraph` graph instance. By passing an explicit `nx-cugraph` graph, the NetworkX dispatcher will automatically call the `\"cugraph\"` backend (and only the `\"cugraph\"` backend) without requiring future conversions to copy data to the GPU." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7.92 s ± 2.85 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 2 global nxcg_cit_patents_graph\n", + "nxcg_cit_patents_graph = nxcg.from_networkx(cit_patents_graph)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "%%timeit -r 1 gc.enable()\n", + "nx.betweenness_centrality(nxcg_cit_patents_graph, k=k)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "This notebook demonstrated `nx-cugraph`'s support for `betweenness_centrality`. At the time of this writing, `nx-cugraph` also provides support for `edge_netweenness_centrality` and `louvain_communities`. Other algorithms are scheduled to be supported based on their availability in the cuGraph [pylibcugraph](https://github.com/rapidsai/cugraph/tree/branch-23.10/python/pylibcugraph/pylibcugraph) package and demand by the NetworkX community.\n", + "\n", + "#### Benchmark Results\n", + "The results included in this notebook were generated on a workstation with the following hardware:\n", + "\n", + "\n", + " \n", + " \n", + "
CPU:Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, 45GB
GPU:Quatro RTX 8000, 50GB
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1 Information on the U.S. Patent Citation Network dataset used in this notebook is as follows:\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
Authors:Jure Leskovec and Andrej Krevl
Title:SNAP Datasets, Stanford Large Network Dataset Collection
URL:http://snap.stanford.edu/data
Date:June 2014
\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}