earthlab-education · github-classroom · Oct 22, 2024
diff --git a/notebooks/redlining-41-zonal-stats.ipynb b/notebooks/redlining-41-zonal-stats.ipynb
@@ -0,0 +1,178 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# STEP 6: Calculate zonal statistics\n",
+        "\n",
+        "In order to evaluate the connection between vegetation health and\n",
+        "redlining, we need to summarize NDVI across the same geographic areas as\n",
+        "we have redlining information.\n",
+        "\n",
+        "First, import variables from previous notebooks:"
+      ],
+      "id": "c6d4416c-848e-45b9-b5f0-b22939521c17"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "store -r denver_redlining_gdf denver_ndvi_da"
+      ],
+      "id": "8f1f6fdf"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Import packages</div></div><div class=\"callout-body-container callout-body\"><p>Some packages are included that will help you calculate statistics\n",
+        "for areas imported below. Add packages for:</p>\n",
+        "<ol type=\"1\">\n",
+        "<li>Interactive plotting of tabular and vector data</li>\n",
+        "<li>Working with categorical data in <code>DataFrame</code>s</li>\n",
+        "</ol></div></div>"
+      ],
+      "id": "0892b3e0-9c80-4c96-bfd4-be5f96b47fb4"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Interactive plots with pandas\n",
+        "# Ordered categorical data\n",
+        "import regionmask # Convert shapefile to mask\n",
+        "from xrspatial import zonal_stats # Calculate zonal statistics"
+      ],
+      "id": "1962e9a6"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Convert vector to raster</div></div><div class=\"callout-body-container callout-body\"><p>You can convert your vector data to a raster mask using the\n",
+        "<code>regionmask</code> package. You will need to give\n",
+        "<code>regionmask</code> the geographic coordinates of the grid you are\n",
+        "using for this to work:</p>\n",
+        "<ol type=\"1\">\n",
+        "<li>Replace <code>gdf</code> with your redlining\n",
+        "<code>GeoDataFrame</code>.</li>\n",
+        "<li>Add code to put your <code>GeoDataFrame</code> in the same CRS as\n",
+        "your raster data.</li>\n",
+        "<li>Replace <code>x_coord</code> and <code>y_coord</code> with the x and\n",
+        "y coordinates from your raster data.</li>\n",
+        "</ol></div></div>"
+      ],
+      "id": "3a9e01b4-ccea-4b3f-9e7f-785db5177aed"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "denver_redlining_mask = regionmask.mask_geopandas(\n",
+        "    gdf,\n",
+        "    x_coord, y_coord,\n",
+        "    # The regions do not overlap\n",
+        "    overlap=False,\n",
+        "    # We're not using geographic coordinates\n",
+        "    wrap_lon=False\n",
+        ")"
+      ],
+      "id": "d71e0470"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Calculate zonal statistics</div></div><div class=\"callout-body-container callout-body\"><p>Calculate zonal status using the <code>zonal_stats()</code> function.\n",
+        "To figure out which arguments it needs, use either the\n",
+        "<code>help()</code> function in Python, or search the internet.</p></div></div>"
+      ],
+      "id": "e6f5fd47-953e-4a0b-8a06-79b9bc7203b2"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Calculate NDVI stats for each redlining zone"
+      ],
+      "id": "e3b82bef"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Plot regional statistics</div></div><div class=\"callout-body-container callout-body\"><p>Plot the regional statistics:</p>\n",
+        "<ol type=\"1\">\n",
+        "<li>Merge the NDVI values into the redlining\n",
+        "<code>GeoDataFrame</code>.</li>\n",
+        "<li>Use the code template below to convert the <code>grade</code> column\n",
+        "(<code>str</code> or <code>object</code> type) to an ordered\n",
+        "<code>pd.Categorical</code> type. This will let you use ordered color\n",
+        "maps with the grade data!</li>\n",
+        "<li>Drop all <code>NA</code> grade values.</li>\n",
+        "<li>Plot the NDVI and the redlining grade next to each other in linked\n",
+        "subplots.</li>\n",
+        "</ol></div></div>"
+      ],
+      "id": "d14c6bd6-b3d4-4bd3-b880-190929833e3e"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Merge the NDVI stats with redlining geometry into one `GeoDataFrame`\n",
+        "\n",
+        "# Change grade to ordered Categorical for plotting\n",
+        "gdf.grade = pd.Categorical(\n",
+        "    gdf.grade,\n",
+        "    ordered=True,\n",
+        "    categories=['A', 'B', 'C', 'D']\n",
+        ")\n",
+        "\n",
+        "# Drop rows with NA grades\n",
+        "denver_ndvi_gdf = denver_ndvi_gdf.dropna()\n",
+        "\n",
+        "# Plot NDVI and redlining grade in linked subplots"
+      ],
+      "id": "7a0a1bc9"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "store denver_ndvi_gdf"
+      ],
+      "id": "ed98342c"
+    }
+  ],
+  "nbformat": 4,
+  "nbformat_minor": 5,
+  "metadata": {
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python"
+    }
+  }
+}
diff --git a/notebooks/redlining-42-tree-model.ipynb b/notebooks/redlining-42-tree-model.ipynb
@@ -0,0 +1,209 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# STEP 7: Fit a model\n",
+        "\n",
+        "One way to determine if redlining is related to NDVI is to see if we can\n",
+        "correctly predict the redlining grade from the mean NDVI value. With 4\n",
+        "categories, we’d expect to be right only about 25% of the time if we\n",
+        "guessed the redlining grade at random. Any accuracy greater than 25%\n",
+        "indicates that there is a relationship between vegetation health and\n",
+        "redlining.\n",
+        "\n",
+        "To start out, we’ll fit a Decision Tree Classifier to the data. A\n",
+        "decision tree is good at splitting data up into squares by setting\n",
+        "thresholds. That makes sense for us here, because we’re looking for\n",
+        "thresholds in the mean NDVI that indicate a particular redlining grade.\n",
+        "\n",
+        "First, import variables from previous notebooks:"
+      ],
+      "id": "0eb4dc23-022c-4b3c-97eb-8c13c90e196c"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "store -r denver_ndvi_gdf"
+      ],
+      "id": "bfd58b22"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Import packages</div></div><div class=\"callout-body-container callout-body\"><p>The cell below imports some functions and classes from the\n",
+        "<code>scikit-learn</code> package to help you fit and evaluate a\n",
+        "decision tree model on your data. You may need some additional packages\n",
+        "later one. Make sure to import them here.</p></div></div>"
+      ],
+      "id": "c5a6e603-b9b9-4bac-9626-9b402f240124"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.tree import DecisionTreeClassifier, plot_tree\n",
+        "from sklearn.model_selection import train_test_split, cross_val_score"
+      ],
+      "id": "8128d1af"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "As with all models, it is possible to **overfit** our Decision Tree\n",
+        "Classifier by splitting the data into too many disconnected rectangles.\n",
+        "We could theoretically get 100% accuracy this way, but drawing a\n",
+        "rectangle for each individual data point. There are many ways to try to\n",
+        "avoid overfitting. In this case, we can limit the **depth** of the\n",
+        "decision tree to 2. This means we’ll be drawing 4 rectangles, the same\n",
+        "as the number of categories we have.\n",
+        "\n",
+        "Alternative methods of limiting overfitting include:\n",
+        "\n",
+        "-   Splitting the data into test and train groups – the overfitted model\n",
+        "    is unlikely to fit data it hasn’t seen. In this case, we have\n",
+        "    relatively little data compared to the number of categories, and so\n",
+        "    it is hard to evaluate a test/train split.\n",
+        "-   Pruning the decision tree to maximize accuracy while minimizing\n",
+        "    complexity. `scikit-learn` will do this for you automatically. You\n",
+        "    can also fit the model at a variety of depths, and look for\n",
+        "    diminishing accuracy returns.\n",
+        "\n",
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Fit a tree model</div></div><div class=\"callout-body-container callout-body\"><p>Replace <code>predictor_variables</code> and\n",
+        "<code>observed_values</code> with the values you want to use in your\n",
+        "model.</p></div></div>"
+      ],
+      "id": "878dcf08-b82b-4ede-9403-f588d37cf047"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Convert categories to numbers\n",
+        "denver_ndvi_gdf['grade_codes'] = denver_ndvi_gdf.grade.cat.codes\n",
+        "\n",
+        "# Fit model\n",
+        "tree_classifier = DecisionTreeClassifier(max_depth=2).fit(\n",
+        "    predictor_variables,\n",
+        "    observed_values,\n",
+        ")\n",
+        "\n",
+        "# Visualize tree\n",
+        "plot_tree(tree_classifier)\n",
+        "plt.show()"
+      ],
+      "id": "10c162b6"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Plot model results</div></div><div class=\"callout-body-container callout-body\"><p>Create a plot of the results by:</p>\n",
+        "<ol type=\"1\">\n",
+        "<li>Predict grades for each region using the <code>.predict()</code>\n",
+        "method of your <code>DecisionTreeClassifier</code>.</li>\n",
+        "<li>Subtract the actual grades from the predicted grades</li>\n",
+        "<li>Plot the calculated prediction errors as a chloropleth.</li>\n",
+        "</ol></div></div>\n",
+        "\n",
+        "One method of evaluating your model’s accuracy is by cross-validation.\n",
+        "This involves selecting some of your data at random, fitting the model,\n",
+        "and then testing the model on a different group. Cross-validation gives\n",
+        "you a range of potential accuracies using a subset of your data. It also\n",
+        "has a couple of advantages, including:\n",
+        "\n",
+        "1.  It’s good at identifying overfitting, because it tests on a\n",
+        "    different set of data than it trains on.\n",
+        "2.  You can use cross-validation on any model, unlike statistics like\n",
+        "    $p$-values and $R^2$ that you may have used in the past.\n",
+        "\n",
+        "A disadvantage of cross-validation is that with smaller datasets like\n",
+        "this one, it is easy to end up with splits that are too small to be\n",
+        "meaningful, or don’t have all the categories.\n",
+        "\n",
+        "Remember – anything above 25% is better than random!\n",
+        "\n",
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-task\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Try It: Evaluate the model</div></div><div class=\"callout-body-container callout-body\"><p>Use cross-validation with the <code>cross_val_score</code> to\n",
+        "evaluate your model. Start out with the <code>'balanced_accuracy'</code>\n",
+        "scoring method, and 4 cross-validation groups.</p></div></div>"
+      ],
+      "id": "b29427a3-8a7f-4634-ba02-cf58fe3021ab"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Evaluate the model with cross-validation"
+      ],
+      "id": "03274512"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-extra\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Looking for an Extra Challenge?: Fit and evaluate an alternative model</div></div><div class=\"callout-body-container callout-body\"><p>Try out some other models and/or hyperparameters (e.g. changing the\n",
+        "max_depth). What do you notice?</p></div></div>"
+      ],
+      "id": "322e1b7b-7205-4776-aae3-dcc3b068977f"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {
+        "highlight": true
+      },
+      "outputs": [],
+      "source": [
+        "# Try another model"
+      ],
+      "id": "b2387dd1"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "<link rel=\"stylesheet\" type=\"text/css\" href=\"./assets/styles.css\"><div class=\"callout callout-style-default callout-titled callout-respond\"><div class=\"callout-header\"><div class=\"callout-icon-container\"><i class=\"callout-icon\"></i></div><div class=\"callout-title-container flex-fill\">Reflect and Respond</div></div><div class=\"callout-body-container callout-body\"><p>Practice writing about your model. In a few sentences, explain your\n",
+        "methods, including some advantages and disadvantages of the choice.\n",
+        "Then, report back on your results. Does your model indicate that\n",
+        "vegetation health in a neighborhood is related to its redlining\n",
+        "grade?</p></div></div>"
+      ],
+      "id": "89b80a8f-a63f-4ed8-9a20-51bb771a6bab"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# YOUR MODEL DESCRIPTION AND EVALUATION HERE"
+      ],
+      "id": "5286e179-8b3f-488d-889a-70829ae9cbaf"
+    }
+  ],
+  "nbformat": 4,
+  "nbformat_minor": 5,
+  "metadata": {
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python"
+    }
+  }
+}