From 50c234bc9526b19bba7eba8b4fc18403fc77b5c2 Mon Sep 17 00:00:00 2001 From: Elsa Culler Date: Tue, 22 Oct 2024 14:59:46 -0600 Subject: [PATCH] Add additional music --- notebooks/redlining-41-zonal-stats.ipynb | 178 +++++++++++++++++++ notebooks/redlining-42-tree-model.ipynb | 209 +++++++++++++++++++++++ 2 files changed, 387 insertions(+) create mode 100644 notebooks/redlining-41-zonal-stats.ipynb create mode 100644 notebooks/redlining-42-tree-model.ipynb diff --git a/notebooks/redlining-41-zonal-stats.ipynb b/notebooks/redlining-41-zonal-stats.ipynb new file mode 100644 index 0000000..2840aa1 --- /dev/null +++ b/notebooks/redlining-41-zonal-stats.ipynb @@ -0,0 +1,178 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# STEP 6: Calculate zonal statistics\n", + "\n", + "In order to evaluate the connection between vegetation health and\n", + "redlining, we need to summarize NDVI across the same geographic areas as\n", + "we have redlining information.\n", + "\n", + "First, import variables from previous notebooks:" + ], + "id": "c6d4416c-848e-45b9-b5f0-b22939521c17" + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "store -r denver_redlining_gdf denver_ndvi_da" + ], + "id": "8f1f6fdf" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Import packages

Some packages are included that will help you calculate statistics\n", + "for areas imported below. Add packages for:

\n", + "
    \n", + "
  1. Interactive plotting of tabular and vector data
  2. \n", + "
  3. Working with categorical data in DataFrames
  4. \n", + "
" + ], + "id": "0892b3e0-9c80-4c96-bfd4-be5f96b47fb4" + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Interactive plots with pandas\n", + "# Ordered categorical data\n", + "import regionmask # Convert shapefile to mask\n", + "from xrspatial import zonal_stats # Calculate zonal statistics" + ], + "id": "1962e9a6" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Convert vector to raster

You can convert your vector data to a raster mask using the\n", + "regionmask package. You will need to give\n", + "regionmask the geographic coordinates of the grid you are\n", + "using for this to work:

\n", + "
    \n", + "
  1. Replace gdf with your redlining\n", + "GeoDataFrame.
  2. \n", + "
  3. Add code to put your GeoDataFrame in the same CRS as\n", + "your raster data.
  4. \n", + "
  5. Replace x_coord and y_coord with the x and\n", + "y coordinates from your raster data.
  6. \n", + "
" + ], + "id": "3a9e01b4-ccea-4b3f-9e7f-785db5177aed" + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "denver_redlining_mask = regionmask.mask_geopandas(\n", + " gdf,\n", + " x_coord, y_coord,\n", + " # The regions do not overlap\n", + " overlap=False,\n", + " # We're not using geographic coordinates\n", + " wrap_lon=False\n", + ")" + ], + "id": "d71e0470" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Calculate zonal statistics

Calculate zonal status using the zonal_stats() function.\n", + "To figure out which arguments it needs, use either the\n", + "help() function in Python, or search the internet.

" + ], + "id": "e6f5fd47-953e-4a0b-8a06-79b9bc7203b2" + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Calculate NDVI stats for each redlining zone" + ], + "id": "e3b82bef" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Plot regional statistics

Plot the regional statistics:

\n", + "
    \n", + "
  1. Merge the NDVI values into the redlining\n", + "GeoDataFrame.
  2. \n", + "
  3. Use the code template below to convert the grade column\n", + "(str or object type) to an ordered\n", + "pd.Categorical type. This will let you use ordered color\n", + "maps with the grade data!
  4. \n", + "
  5. Drop all NA grade values.
  6. \n", + "
  7. Plot the NDVI and the redlining grade next to each other in linked\n", + "subplots.
  8. \n", + "
" + ], + "id": "d14c6bd6-b3d4-4bd3-b880-190929833e3e" + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Merge the NDVI stats with redlining geometry into one `GeoDataFrame`\n", + "\n", + "# Change grade to ordered Categorical for plotting\n", + "gdf.grade = pd.Categorical(\n", + " gdf.grade,\n", + " ordered=True,\n", + " categories=['A', 'B', 'C', 'D']\n", + ")\n", + "\n", + "# Drop rows with NA grades\n", + "denver_ndvi_gdf = denver_ndvi_gdf.dropna()\n", + "\n", + "# Plot NDVI and redlining grade in linked subplots" + ], + "id": "7a0a1bc9" + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "store denver_ndvi_gdf" + ], + "id": "ed98342c" + } + ], + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3 (ipykernel)", + "language": "python" + } + } +} \ No newline at end of file diff --git a/notebooks/redlining-42-tree-model.ipynb b/notebooks/redlining-42-tree-model.ipynb new file mode 100644 index 0000000..f10e7fc --- /dev/null +++ b/notebooks/redlining-42-tree-model.ipynb @@ -0,0 +1,209 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# STEP 7: Fit a model\n", + "\n", + "One way to determine if redlining is related to NDVI is to see if we can\n", + "correctly predict the redlining grade from the mean NDVI value. With 4\n", + "categories, we’d expect to be right only about 25% of the time if we\n", + "guessed the redlining grade at random. Any accuracy greater than 25%\n", + "indicates that there is a relationship between vegetation health and\n", + "redlining.\n", + "\n", + "To start out, we’ll fit a Decision Tree Classifier to the data. A\n", + "decision tree is good at splitting data up into squares by setting\n", + "thresholds. That makes sense for us here, because we’re looking for\n", + "thresholds in the mean NDVI that indicate a particular redlining grade.\n", + "\n", + "First, import variables from previous notebooks:" + ], + "id": "0eb4dc23-022c-4b3c-97eb-8c13c90e196c" + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "store -r denver_ndvi_gdf" + ], + "id": "bfd58b22" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Import packages

The cell below imports some functions and classes from the\n", + "scikit-learn package to help you fit and evaluate a\n", + "decision tree model on your data. You may need some additional packages\n", + "later one. Make sure to import them here.

" + ], + "id": "c5a6e603-b9b9-4bac-9626-9b402f240124" + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "from sklearn.tree import DecisionTreeClassifier, plot_tree\n", + "from sklearn.model_selection import train_test_split, cross_val_score" + ], + "id": "8128d1af" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As with all models, it is possible to **overfit** our Decision Tree\n", + "Classifier by splitting the data into too many disconnected rectangles.\n", + "We could theoretically get 100% accuracy this way, but drawing a\n", + "rectangle for each individual data point. There are many ways to try to\n", + "avoid overfitting. In this case, we can limit the **depth** of the\n", + "decision tree to 2. This means we’ll be drawing 4 rectangles, the same\n", + "as the number of categories we have.\n", + "\n", + "Alternative methods of limiting overfitting include:\n", + "\n", + "- Splitting the data into test and train groups – the overfitted model\n", + " is unlikely to fit data it hasn’t seen. In this case, we have\n", + " relatively little data compared to the number of categories, and so\n", + " it is hard to evaluate a test/train split.\n", + "- Pruning the decision tree to maximize accuracy while minimizing\n", + " complexity. `scikit-learn` will do this for you automatically. You\n", + " can also fit the model at a variety of depths, and look for\n", + " diminishing accuracy returns.\n", + "\n", + "
Try It: Fit a tree model

Replace predictor_variables and\n", + "observed_values with the values you want to use in your\n", + "model.

" + ], + "id": "878dcf08-b82b-4ede-9403-f588d37cf047" + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Convert categories to numbers\n", + "denver_ndvi_gdf['grade_codes'] = denver_ndvi_gdf.grade.cat.codes\n", + "\n", + "# Fit model\n", + "tree_classifier = DecisionTreeClassifier(max_depth=2).fit(\n", + " predictor_variables,\n", + " observed_values,\n", + ")\n", + "\n", + "# Visualize tree\n", + "plot_tree(tree_classifier)\n", + "plt.show()" + ], + "id": "10c162b6" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Try It: Plot model results

Create a plot of the results by:

\n", + "
    \n", + "
  1. Predict grades for each region using the .predict()\n", + "method of your DecisionTreeClassifier.
  2. \n", + "
  3. Subtract the actual grades from the predicted grades
  4. \n", + "
  5. Plot the calculated prediction errors as a chloropleth.
  6. \n", + "
\n", + "\n", + "One method of evaluating your model’s accuracy is by cross-validation.\n", + "This involves selecting some of your data at random, fitting the model,\n", + "and then testing the model on a different group. Cross-validation gives\n", + "you a range of potential accuracies using a subset of your data. It also\n", + "has a couple of advantages, including:\n", + "\n", + "1. It’s good at identifying overfitting, because it tests on a\n", + " different set of data than it trains on.\n", + "2. You can use cross-validation on any model, unlike statistics like\n", + " $p$-values and $R^2$ that you may have used in the past.\n", + "\n", + "A disadvantage of cross-validation is that with smaller datasets like\n", + "this one, it is easy to end up with splits that are too small to be\n", + "meaningful, or don’t have all the categories.\n", + "\n", + "Remember – anything above 25% is better than random!\n", + "\n", + "
Try It: Evaluate the model

Use cross-validation with the cross_val_score to\n", + "evaluate your model. Start out with the 'balanced_accuracy'\n", + "scoring method, and 4 cross-validation groups.

" + ], + "id": "b29427a3-8a7f-4634-ba02-cf58fe3021ab" + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Evaluate the model with cross-validation" + ], + "id": "03274512" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Looking for an Extra Challenge?: Fit and evaluate an alternative model

Try out some other models and/or hyperparameters (e.g. changing the\n", + "max_depth). What do you notice?

" + ], + "id": "322e1b7b-7205-4776-aae3-dcc3b068977f" + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "highlight": true + }, + "outputs": [], + "source": [ + "# Try another model" + ], + "id": "b2387dd1" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Reflect and Respond

Practice writing about your model. In a few sentences, explain your\n", + "methods, including some advantages and disadvantages of the choice.\n", + "Then, report back on your results. Does your model indicate that\n", + "vegetation health in a neighborhood is related to its redlining\n", + "grade?

" + ], + "id": "89b80a8f-a63f-4ed8-9a20-51bb771a6bab" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# YOUR MODEL DESCRIPTION AND EVALUATION HERE" + ], + "id": "5286e179-8b3f-488d-889a-70829ae9cbaf" + } + ], + "nbformat": 4, + "nbformat_minor": 5, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3 (ipykernel)", + "language": "python" + } + } +} \ No newline at end of file