-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-vector-geospatial-data.qmd
1481 lines (1049 loc) · 51.4 KB
/
05-vector-geospatial-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
knitr:
opts_chunk:
code-fold: show
results: hold
---
# Vector Geospatial Data {#sec-chap05}
```{r}
#| label: setup
#| results: hold
#| include: false
base::source(file = "R/helper.R")
## create data folder (only once, e.g., only in this chapter)
baseURL <- here::here()
pb_create_folder(base::paste0(baseURL, "/data"))
## create chapter folder (for each)
pb_create_folder(base::paste0(baseURL, "/data/Chapter5"))
## set theme for ggplot2 graphics
ggplot2::theme_set(ggplot2::theme_bw())
```
::::: {#obj-chap05}
:::: {.my-objectives}
::: {.my-objectives-header}
Chapter section list
:::
::: {.my-objectives-container}
1. Import geospatial data: @sec-05-import-geodata
2. Creating simple maps: @sec-05-create-maps
3. Overlaying vector datasets: @sec-05-overlaying-vector-datasets
4. Save spatial geodata files: @sec-05-save-geodata
5. Choropleth maps: @sec-05-choropleth-maps
6. Modifying map appearance: @sec-05-modify-map-appearance
7. Exporting graphics output: @sec-05-export-graphics
8. Resources: @sec-05-resources
9. Practice
:::
::::
:::::
## Import Geospatial Data {#sec-05-import-geodata}
### ESRI shapefile format
The data for import in chapter 5 are provided in `r glossary("ESRI")` shapefile format. This format was developed several decades ago but remains one of the widely used file formats for vector geospatial data. It is a multiple file format, where separate files contain the feature geometries, attribute table, spatial indices, and coordinate reference system.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-import-geospatial-data}
: Import Geospatial Data
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: import-geospatial-data
glue::glue("############### import esri data #############")
okcounty <- sf::st_read("data/Chapter5/ok_counties.shp", quiet = TRUE)
tpoint <- sf::st_read("data/Chapter5/ok_tornado_point.shp", quiet = TRUE)
tpath <- sf::st_read("data/Chapter5/ok_tornado_path.shp", quiet = TRUE)
glue::glue("")
glue::glue("############### show data class #############")
class(okcounty)
glue::glue("")
glue::glue("############### show data with dplyr #############")
dplyr::glimpse(okcounty)
```
***
The {**sf**} objects contain a column called geometry. This is a special column that contains the geospatial information about the location of each feature. This column should not be modified directly. It is used by the functions in the {**sf**} package for geospatial data processing.
::::
:::::
::: {.callout-note #nte-05-skimr}
###### Using {skimr} with {sf}
Normally I am using the `skimr::skim()` function for data summary. But for the {**sf**} data classes in the `geometry` column are no skimmers available. (Possible data types are: sfc_POINT, sfc_LINESTRING, sfc_POLYGON, sfc_MULTIPOINT, sfc_MULTILINESTRING, sfc_MULTIPOLYGON, and sfc_GEOMETRY.) In the above case the `class(okcounty$geometry)` = "*`r class(okcounty$geometry)`*" and not user-defined for {**skimr**} The fall back to the "character" class is not useful. (`sfc` stands for "simple feature list column".)
It is possible to adapt {**skimr**} for working with user defined data types using `skimr::skim_with()`. Resources that explain how to do this are:
- [Defining sfl’s for a package](https://docs.ropensci.org/skimr/articles/extending_skimr.html#defining-sfls-for-a-package): General article that explains how to generate and use with user defined data types. `sfl`stands for "skimr function list". It is a list-like data structure used to define custom summary statistics for specific data types.
- [skim of {**sf**} objects](https://github.com/ropensci/skimr/issues/88): Discussion specific to the {**sf**} package.
At the moment I do not understand enough about the {**sf**} package to get into more details for writing an appropriate function. I wonder if there is not already a solution available as spatial data processing with R and the {**sf**} package is not an extremely rare use case.
:::
In the R package {**sf**} (Simple Features), many functions are prefixed with `st_`. The `st_` prefix is inspired by [PostGIS](https://postgis.net/), which refers with the abbreviation to “spatial type”. This prefix is used consistently throughout {**sf**} to indicate that a function operates on spatial data. In the context of {**sf**}, `st_` serves as a namespace for spatial functions, allowing developers and users to easily identify and find functions related to spatial operations. This prefixing convention makes it simple to discover and use spatial functions.
Looking at the file names I noticed: All files have the same filename with different extensions. There are always four files with the extensions `.dbf`, `.prj`, `.shp`, `.shx`.
The shapefiles are imported to {**sf**} objects using the `sf::st_read()` function. The `quiet = TRUE` argument suppresses output to the console when importing spatial datasets. It
An example for the output when using `quit = FALSE` (the default option) is:
> Reading layer `ok_counties' from data source
`/Users/petzi/Documents/Meine-Repos/GDSWR/data/Chapter5/ok_counties.shp' using driver `ESRI Shapefile'
Simple feature collection with 77 features and 7 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -103.0025 ymin: 33.62184 xmax: -94.43151 ymax: 37.00163
Geodetic CRS: NAD83
To read in a shapefile, it is only necessary to specify the filename with a `.shp` extension. However, all the files, including the `.shp` file as well as the `.dbf`, `.shx`, and `.prj` files, need to be present in the directory from which the data are read.
- The `ok_counties.shp` dataset contains county boundaries for the state of Oklahoma.
- The `ok_tornado_point.shp` dataset and the `ok_tornado_path.shp` dataset contain historical information about tornadoes in Oklahoma.
- The points are the initial locations of tornado touchdown.
- The paths are lines that identify the path of each tornado after touchdown.
- These data were derived from larger, national-level datasets generated by the National Oceanographic and Atmospheric Administration (NOAA) [National Weather Service Storm Prediction Center](https://www.spc.noaa.gov/gis/svrgis/).
### Conversion data sf <-> sp
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-data-conversion-sf-sp}
: {**sf**} data to {**sp**} data and reverse
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: data-conversion-sf-sp
glue::glue("############### convert from sf to sp data #############")
okcounty_sp <- sf::as_Spatial(okcounty) # sf::as(okcounty, 'Spatial') does not work!
class(okcounty_sp)
glue::glue("")
glue::glue("############### convert from sp to sf data #############")
okcounty_sf <- sf::st_as_sf(okcounty_sp)
class(okcounty_sf)
```
::::
:::::
## Creating simple maps {#sec-05-create-maps}
A good strategy to get an overview about the data is to plot the data as map. There are two options: Using `ggplot2::geom_sf()` or `base::plot()`.
### Draw Oklahoma county boundaries
To generate a map of counties using `ggplot2::ggplot()` with a {**sf**} object, the `ggplot2::geom_sf()` function is used.
From the view of the {**ggplot2**} package the `ggplot2::geom_sf()` is an unusual geom because it will draw different geometric objects depending on what simple features are present in the data: you can get points, lines, or polygons. For text and labels, you can use `ggplot2::geom_sf_text()` and `ggplot2::geom_sf_label()`.
::: {.my-code-collection}
:::: {.my-code-collection-header}
::::: {.my-code-collection-icon}
:::::
:::::: {#exm-05-ploting-oklahoma-county-boundaries}
: Plotting Oklahoma county boundaries
::::::
::::
::::{.my-code-collection-container}
::: {.panel-tabset}
###### `ggplot2`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-oklahoma-county-boundaries-ggplot2}
: Oklahoma county boundaries with {**ggplot2**}
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-05-oklahoma-county-boundaries-ggplot2
#| fig-cap: "Oklahoma county boundaries plotted with {ggplot2}"
#| fig-height: 3
ggplot2::ggplot(data = okcounty) +
ggplot2::geom_sf(fill = NA) +
ggplot2::theme_void()
```
***
`fill = NA` makes the counties transparent.
(To get the same result as in the base R approach I used `ggplot2::theme_void()` to hide the coordinates which is shown in the original book example.)
::::
:::::
###### `base::plot()`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-oklahoma-county-boundaries-base-plot}
: Oklahoma county boundaries with `base::plot()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-05-oklahoma-county-boundaries-base-plot
#| fig-cap: "Oklahoma county boundaries plotted with base::plot()"
graphics::par(mar = c(0, 0, 0, 0))
base::plot(okcounty$geometry)
```
***
From R Graph Gallery I learend that I could also use bese R to plot spatial geodata. But everybody agrees that using {**ggplot2**} is the preferred approach.
::::
:::::
:::
::::
:::::
::: {.callout-note #nte-05-too-much-white-space}
###### Too much space around cholorpleth map
As you can see from both graphics there is ample space aorund the map. I do not know how to remove it. Therefore I wrote a [question on StackOverflow](https://stackoverflow.com/questions/79295599/remove-white-space-around-sf-ggplot2-choropleth-map). I used a simple example provide by the {**sf**} package.
:::
### Inspect `tpoint` and `tpath`
Because {**sf**} objects are a type of data frame, they can be modified using the normal {**tidyverse**} functions. Let's look at the two other R objects we've generated in @cnj-05-import-geospatial-data.
::: {.my-code-collection}
:::: {.my-code-collection-header}
::::: {.my-code-collection-icon}
:::::
:::::: {#exm-05-show-tornado-file-structure}
: Display structure of the tornado files
::::::
::::
::::{.my-code-collection-container}
::: {.panel-tabset}
###### `tpoint`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-glimpse-tpoint}
: Glimpse at `tpoint`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: glimpse-tpoint
dplyr::glimpse(tpoint)
```
***
The points are the initial locations of tornado touchdowns.
::::
:::::
###### `tpath`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-glimpse-tpath}
: Glimpse at `tpath`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: glimpse-tpath
dplyr::glimpse(tpath)
```
***
The paths are lines that identify the path of each tornado after touchdown.
::::
:::::
:::
::::
:::::
From `dplyr::glimpse()` we get an idea about the data structure. But we do not know the numeric span covered by the variable. This is especially important for our next task to focus on data from the last five years. We know from @exm-05-show-tornado-file-structure that the dataset starts with the year 1950 but we have no clue about the middle or end of the dataset.
For this reason I have designed a special functions that returns the first and last dataset and several random data. The default number of data shown is eight but this can be changed using a second parameter.
::: {.my-code-collection}
:::: {.my-code-collection-header}
::::: {.my-code-collection-icon}
:::::
:::::: {#exm-05-show-random-tornado-data}
: Show some random tornado data, including first and last record
::::::
::::
::::{.my-code-collection-container}
::: {.panel-tabset}
###### `tpoint`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-show-random-tpoint-data}
: Show random `tpoint` data
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: show-random-tpoint-data
pb_glance_data(tpoint)
```
::::
:::::
###### `tpath`
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-show-random-tpath-data}
: Show random `tpath` data
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: show-random-tpath-data
pb_glance_data(tpath)
```
::::
:::::
:::
::::
:::::
### Visualization of the Oklahoma tornado data (2016-2021)
Because {**sf**} objects are a type of data frame, they can be modified using the normal {**tidyverse**} functions.
- A reduced dataset for the years 2016-2021 and only with the columns ID (`om`), the year (`yr`), and the date (`date`) and is prepared in the first tab `reduce data`.
- Initiation points of tornadoes in Oklahoma from 2016–2021 is shown in tab `initiation points`.
- Tab `tornados path` shows the paths of tornadoes in Oklahoma from 2016–2021.
- Initiation points of tornadoes in Oklahoma from 2016–2021 with years represented by the color aesthetic is in tab `color aesthetic`.
- In the final tab `facets` you will see the initiation points of tornadoes in Oklahoma from 2016–2021 with years mapped as separate facets.
::: {.my-code-collection}
:::: {.my-code-collection-header}
::::: {.my-code-collection-icon}
:::::
:::::: {#exm-05-different-tornado-visualizations}
: Show different visualization of the Oklahoma tornado data (2016-2021)
::::::
::::
::::{.my-code-collection-container}
::: {.panel-tabset}
###### reduce data
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-reduce-tornado-data}
: Filter data from 2016 to 2021 and select only three columns (ID, year and date)
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: reduce-tornado-data
#| code-fold: show
tpoint_16_21 <- tpoint |>
dplyr::filter(yr >= 2016 & yr <= 2021) |>
dplyr::select(om, yr, date)
tpath_16_21 <- tpath |>
dplyr::filter(yr >= 2016 & yr <= 2021) |>
dplyr::select(om, yr, date)
```
<center>(*For this R code chunk is no output available*)</center>
::::
:::::
###### initiation points
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-show-tornado-initation-points}
: Show initiation points of tornadoes in Oklahoma from 2016–2021
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-show-tornado-initation-points
#| fig-cap: "Initiation points of tornadoes in Oklahoma from 2016–2021."
ggplot2::ggplot() +
ggplot2::geom_sf(data = okcounty, fill = NA) +
ggplot2::geom_sf(data = tpoint_16_21)
```
***
- Because each function maps a different dataset, the data argument must be provided in each `ggplot2::geom_sf()` function instead of in the `ggplot2::ggplot()` function.
- I am using as default theme the `ggplot2::theme_bw()` function (see setup chunk of this chapter) to display the map over a white background while retaining the graticules.
::::
:::::
###### tornado paths
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-show-tornado-paths}
: Show tornadoes paths in Oklahoma from 2016–2021
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-show-tornado-paths
#| fig-cap: "Paths of tornadoes in Oklahoma from 2016-2021."
ggplot2::ggplot() +
ggplot2::geom_sf(data = okcounty, fill = NA) +
ggplot2::geom_sf(data = tpath_16_21,
color = "red",
size = 1)
```
***
To make the tornado path lines easier to see in relation to the county boundaries, they are displayed in red and their sizes are increased to be larger (size = 1) than the default line width of 0.5.
::::
:::::
###### color aesthetic
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-tornado-with-color-aesthetic}
: Initiation points of tornadoes in Oklahoma from 2016-2021 with years represented by the color aesthetic
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-tornado-with-color-aesthetic
#| fig-cap: "Initiation points of tornadoes in Oklahoma from 2016-2021 with years represented by the color aesthetic."
ggplot2::ggplot() +
ggplot2::geom_sf(data = tpoint_16_21,
ggplot2::aes(color = forcats::as_factor(yr))) + # (1)
ggplot2::geom_sf(data = okcounty, fill = NA) +
# ggplot2::scale_color_discrete(name = "Year") + # (2)
ggokabeito::scale_color_okabe_ito(name = "Year") + # (2)
ggplot2::coord_sf(datum = NA) + # (3)
ggplot2::theme_void() # (3)
```
***
To view the years of the tornadoes on the map, an aesthetic can be specified.
**Line Comment 1**: In the book the color argument is specified as `base::as.factor(yr)` so that the year is displayed as a discrete variable instead of a continuous variable. Instead of the base function I have used `forcats::as_factor(yr)`.
> Compared to base R, when x is a character, this function creates levels in the order in which they appear, which will be the same on every platform. (Base R sorts in the current locale which can vary from place to place.) (from the {**forcats**)} help file).
**Line Comment 2**: The `ggplot2::scale_color_discrete()` function is used to set the legend name. But the used (standard) color scale is not appropriate for people with color-vision deficiency (`r glossary("CVD")`). I have therefore used `ggokabeito::scale_color_okabe_ito()`.
**Line Comment 3**: The book says that the `ggplot2::theme_void()` function removes the plot axes and labels and shows only the map. I suppose that this is not correct. `ggplot2::coord_sf(datum = NA)` removes the plot axes and labels; `ggplot2::theme_void()` removes the border frame around the graphic.
::::
:::::
###### facets
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-tornado-initiation-points-facets}
: Initiation points of tornadoes in Oklahoma from 2016-2021 as facet visualization
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-tornado-initiation-points-facets
#| fig-cap: "Initiation points of tornadoes in Oklahoma from 2016-2021 with years mapped as separate facets."
ggplot2::ggplot() +
ggplot2::geom_sf(data = okcounty,
fill = NA,
color = "gray") +
ggplot2::geom_sf(data = tpoint_16_21, size = 0.75) +
ggplot2::facet_wrap(ggplot2::vars(yr), ncol = 2) +
ggplot2::coord_sf(datum = NA) +
ggplot2::theme_void()
```
***
Alternately, `ggplot2::facet_wrap()` can be used to display the tornadoes for each year on a separate map. In comparison to the previous tab the sizes of the points are reduced slightly from the standard `size = 1` to `size = 0.75`, so that they are better suited for the smaller maps.
::::
:::::
:::
::::
:::::
::: {.callout-note #nte-too-much-horizontal-space}
With the exception of the facet graphics there is too much horizontal space above and below the {**sf**} graphic. Is this a known problem? How to reduce the horizontal space for {**sf**} graphics plotted with {**ggplot2**}?
:::
:::::{.my-solution}
:::{.my-solution-header}
Solution: Remove empty space in maps
:::
::::{.my-solution-container}
I found a solution after [posting the question in StackOverflow](https://stackoverflow.com/questions/79295599/remove-white-space-around-sf-ggplot2-choropleth-map): I need to set the figure size in the quarto chunk options so your figure has the right aspect ratio in the document. As far as I can see there are two options:
- Reducing the heigt of the figure from its standard height of 5 inches. For instance to three inches with `fig-heigt: 3` in the quarto chunk option. See an example in @cnj-05-choropleth-filled-colors or @cnj-05-choropleth-mappying-symbols. (To see the chunk options together with the code I have used `echo: fenced` for these two chunks.)
- Changing the aspect ratio from 1 to a smaller value, for instance to 3/4 with `ggplot2::theme(aspect.ratio = 3/4)`. See an example in @cnj-annex-b-zoom-europe-map.
::::
:::::
## Overlaying Vector Datasets {#sec-05-overlaying-vector-datasets}
### A first spatial join
The number of tornado points in each county can be calculated using the `sf::st_join()` function to carry out a spatial join. A spatial join with {**sf**} is different from the joinwith {**dplyr**}: `sf::st_join()` links rows from the two tables based on the spatial locations instead of their attributes.
In this case the functions compares the point coordinates of the `tpoint_16_21` dataset in its `geometry` column with the polygon coordinates from the `geometry` column of the `okcounty` dataset. It joins `tpoint_16_21` with the `geometry` row that includes the appropriate polygon from `okcounty` containing the point coordinates.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-spatial-join}
: Overlaying vector datasets with a spatial join
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: spatial-join
countypnt <- sf::st_join(tpoint_16_21, okcounty)
dplyr::glimpse(countypnt)
```
::::
:::::
### Count tornados per county
Afterward, each row in `countypnt` data contains additional columns from the `okcounty` dataset that correspond to the county that the tornado with it point coordinates is within. The dataset contains one record for each tornado.
To compute the total number of tornadoes per county, `countypnt` must be grouped by the `GEOID` county code or by the county `NAME` (here by `GEOID` county code).
But before grouping and summarizing, `countypnt` must be converted from an {**sf**} object to a normal data frame using `sf::st_drop_geometry()`.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-count-tornados-per-county}
: Count tornados per county
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: count-tornados
#| results: hold
glue::glue("#### show class before `sf::st_drop_geometry()` #####")
base::class(countypnt)
countypnt <- sf::st_drop_geometry(countypnt)
glue::glue("")
glue::glue("##### show class after `sf::st_drop_geometry()` ######")
base::class(countypnt)
countysum <- countypnt |>
dplyr::group_by(GEOID) |>
dplyr::summarize(tcnt = dplyr::n())
glue::glue("")
glue::glue("##### glimpse at the new summarized data frame` ######")
dplyr::glimpse(countysum)
```
::::
:::::
### Associate polygons with tornado counts
In the next step we join `okcounty` to `countysum` so that each polygon is associated with the appropriate tornado summary.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-join-polygons-with-tornado-counts}
: Associate each polygon with the appropriate tornado summary
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: join-polygons-with-tornado-counts
countymap <- okcounty |>
dplyr::left_join(countysum, by = "GEOID") |> # (1)
dplyr::mutate(tcnt =
base::ifelse(base::is.na(tcnt), 0, tcnt)) |> # (2)
dplyr::mutate(area = sf::st_area(okcounty),
tdens = 10^3 * 10^3 * tcnt / area) |> # (3)
units::drop_units() # (4)
dplyr::glimpse(countymap)
```
***
**Line comment 1**: Using `dplyr::left_join()` instead of `dplyr::inner_join()` ensures that all of the county polygons are retained in the output of the join. (`dplyr::inner_join()` only keeps observations from x that have a matching key in y, whereas `dplyr::left_join()` keeps all observations in x.)
**Line comment 2**: If there are between 2016-2021 several tornados per county than we get several rows. But the reverse is also true: If a county has had no tornado in the years 2016-2021 this county gets `NA` values as the number of tornados.
As a matter of fact a few counties had no tornadoes during 2016–2021 and are therefore missing from `countysum`, resulting in `NA` values in the joined table. In this case, we know that `NA` means zero tornadoes, so the we must replace `NA` values by zeroes. I did this with the `dplyr::mutate()` function instead of `base::replace()`. Besides this approach does not need the `.` symbol of the {**magrittr**} packages (exporting into dplyr) for refering to the database (respectively its equivalent `_` for the R pipe). See for details @nte-chap03.
**Line comment 3**: The second `dplyr::mutate()` function computes the area of each county using `sf::st_area()` and then calculates the density of tornadoes per county. Density is initially in tornadoes per square meter but is converted to tornadoes per 1000 km^2.
**Line comment 4**: The `sf::st_area()` function returns a column with explicit measurement units, but these are removed using the `units::drop_units()` function for simplicity. For more information see the vignettes and help pages of the {**units**} package.
::::
:::::
## Save spatial geodata files {#sec-05-save-geodata}
### ESRI format
The `sf::st_write()` function can be used to save sf objects to a variety of file formats. In many cases, the function can determine the output format from the output filename extension. The following code saves the county-level tornado summaries in ESRI shapefile format. The `append = FALSE` option overwrites the shapefile if it already exists.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-save-spatial-ESRI-format}
: Save spatial data files into ESRI format
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: save-spatial-ESRI-format
#| eval: false
sf::st_write(countymap,
dsn = "data/Chapter5/oktornadosum.shp",
append = FALSE)
```
::::
:::::
After a message what the script does
> > Writing layer `oktornadosum' to data source
`data/Chapter5/oktornadosum.shp' using driver `ESRI Shapefile'
Writing 77 features with 10 fields and geometry type Polygon.
I got for every feature (= 77 rows) a warning message emitted by the GDAL library:
> Warning: GDAL Message 1: Value 1890663260.74707699 of field area of feature 0 not successfully written. Possibly due to too larger number with respect to field width
It turned out that this is a [misleading warning](https://github.com/r-spatial/sf/issues/306) and that one should not use the old ESRI format but the newer and better Open Geospatial Consortium (OGC) GeoPackage format. See also [StackOverflow](https://stackoverflow.com/a/73242539/7322615) and the [answer from the {**sf**} developer](https://github.com/r-spatial/sf/issues/2368):
> The general recommendation is to not use shapefiles: the format is not an open standard, it has many limitations and modern formats are available. A good alternative is GeoPackage.
### GeoPackage format
GeoPackage is also mentioned as an alternative in the book. The data are stored in an SQLite database that may contain one or more layers. In this example, the `delete_dsn = TRUE` argument overwrites the entire GeoPackage. Leaving this argument at its default value of `FALSE` would add a new layer to an existing database.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-save-spatial-geodata-in-GeoPackage-format}
: Save spatial geodata in GeoPackage format
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: save-spatial-GeoPackage-format
sf::st_write(countymap,
dsn = "data/Chapter5/oktornado.gpkg",
layer = "countysum",
delete_dsn = TRUE)
```
::::
:::::
### GeoJSON format
Another commonly-used open geospatial data format is GeoJSON. It is based on Javascript Object Notation (`r glossary("JSON")`), a human-readable text format that stores data in ASCII files. The layer_options argument must be set to "RFC7946 = YES" to save the data in the newest GeoJSON specification.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-save-spatial-geodata-in-GeoJSON-format}
: Save spatial geodata in GeoJSON format
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: save-spatial-GeoJSON-format
sf::st_write(obj = countymap,
dsn = "data/Chapter5/oktornado.geojson",
layer_options = "RFC7946 = YES",
delete_dsn = TRUE)
```
::::
:::::
Here again I had to add `delete_dsn = TRUE` (`append = FALSE` did not work for this format!). Otherwise I would get an error message that the dataset already exists.
## Choropleth Maps {#sec-05-choropleth-maps}
### Filling with colors (standard)
Another way to display the tornadoes is as a choropleth map, where summary statistics for each county are displayed as different colors. The county-level tornado density can be as a choropleth using the `fill` aesthetic with `ggplot2::geom_sf()`. By default, the fill colors are based on a dark-to-light blue color ramp. The `ggplot2::theme_void()` function eliminates the axes and graticules and displays only the map on a white background.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-choropleth-filled-colors}
: Densities of tornadoes mapped as a choropleth.
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-choropleth-filled-colors
#| fig-cap: "Densities of tornadoes in Oklahoma counties from 2016-2021 mapped as a choropleth."
#| fig-height: 3
#| echo: fenced
ggplot2::ggplot(data = countymap) +
ggplot2::geom_sf(ggplot2::aes(fill = tdens)) +
ggplot2::theme_void() +
ggplot2::coord_sf()
```
::::
:::::
### Mapping symbols
To map symbols, the county polygons must first be converted to points. The `sf::st_centroid()` generates a point feature located at the centroid of each county. The `sf::st_geometry_type()` function returns the feature geometry type. Setting `by_geometry = FALSE` returns one geometry type for the entire dataset instead of for every feature.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-convert-county-polygons-to-points}
: Convert county polygons to points
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: convert-county-polygons-to-points
#| results: hold
glue::glue("##### Return geometry type before converted to points #####")
sf::st_geometry_type(okcounty, by_geometry = FALSE)
############# Return the centroid of a geometry
okcntrd = sf::st_centroid(countymap)
glue::glue("")
glue::glue("##### Return geometry type after converted to points #####")
sf::st_geometry_type(okcntrd, by_geometry = FALSE)
```
::::
:::::
::: {.callout-note style="color: blue;" #nte-05-attributes-constant-warning}
###### How to get rid of the warning?
At the moment I do not know how to suppress the warning. Possible pointers to solve this problem are:
- **sf GitHub**: [suppress specific warning message](https://github.com/r-spatial/sf/issues/406)
- **Spatial Data Science**: [Chapter 5](https://r-spatial.org/book/05-Attributes.html)
> When, while manipulating geometries, attribute values are retained unmodified, support problems may arise. If we look into a simple case of replacing a county polygon with the centroid of that polygon on a dataset that has attributes, we see that R package sf issues a warning:
>
> *Warning: st_centroid assumes attributes are constant over geometries*
:::
The tornado counts can be mapped using the `okcentrd` dataset with the `size` aesthetic. One point is plotted for each county centroid, and the size of the point is proportional to the number of tornadoes in the county.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-choropleth-mappying-symbols}
: Choropleth map using graduated symbols
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-choropleth-mappying-symbols
#| fig-cap: "Numbers of tornadoes in Oklahoma counties from 2016-2021 mapped as graduated symbols."
#| fig.height: 3
#| echo: fenced
ggplot2::ggplot() +
ggplot2::geom_sf(data = okcntrd, ggplot2::aes(size = tcnt)) +
ggplot2::geom_sf(data = okcounty, fill = NA) +
ggplot2::theme_void()
```
::::
:::::
## Modifying Map Appearance {#sec-05-modify-map-appearance}
### {**RColorBrewer**}: Color palettes for choropleth mapping
The {**RColorBrewer**} package provides a selection of palettes designed for choropleth mapping (Harrower and Brewer 2003). The `display_brewer_all()` function generates a chart with examples of all the available palettes (@fig-rcolorbrewer-palettes).
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-rcolorbrewer-palettes}
: Show color palettes of the {**RColorBrewer**} package
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-rcolorbrewer-palettes
#| fig-cap: "Color palettes available in the RColorBrewer package."
#| fig-height: 9
RColorBrewer::display.brewer.all()
```
::::
:::::
There are three types of ColorBrewer palettes.
1. The top group in @fig-rcolorbrewer-palettes contains sequential palettes that are suitable for mapping ordered data along numerical scales (e.g., temperatures ranging from 0 to 30 degrees C) or ordinal categories (e.g., temperatures classified as cold, warm, and hot). These sequential palettes may include a single color or multiple colors, but have no clear breaks in the scale.
2. The middle group in @fig-rcolorbrewer-palettes contains qualitative palettes, which use color to distinguish between different categories without implying order.
3. The lower group in @fig-rcolorbrewer-palettes contains divergent palettes that emphasize a specific breakpoint in the data. Divergent palettes are often used to indicate values that are above or below the mean or to highlight values that are higher or lower than zero.
More details about these palettes, including recommendations for color schemes that are most effective for different types of computer displays and for accommodating color-blind viewers, are available at [http://colorbrewer2.org](@fig-rcolorbrewer-palettes).
### Specifying a color palette for continuous data
Additional {**ggplot2**} functions can be added to improve the appearance of the map.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-specify-color-palette}
: Specifying a color palette for continuous data with {**RColorBrewer**}
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-specify-continuous-color-palette
#| fig-cap: "Densities of tornadoes in Oklahoma counties from 2016-2021 mapped as a choropleth with a custom palette."
ggplot2::ggplot(data = countymap) +
ggplot2::geom_sf(ggplot2::aes(fill = tdens * 10^3)) + # (1)
ggplot2::scale_fill_distiller( # (2)
name = base::expression("Tornadoes/1000 km"^2), # (3)
palette = "YlOrRd", # (4)
breaks = scales::extended_breaks(n = 6), # (5)
direction = 1) + # (6)
ggplot2::theme_void() + # (7)
ggplot2::theme(legend.position = "bottom") # (8)
```
***
- **Line comment 1**: In contrast to the code in the book, I had to multiply the column for the tornado densities (`tdens`) with 10^3. Otherwise I would get decimal numbers overlapping each other after moving the legend to the bottom.
- **Line comment 2**: The `ggplot2::scale_fill_distiller()` function allows the specification of a different color ramp.
- **Line comment 3**: The `base::expression()` function is used for specifying the name argument for `ggplot2::scale_fill_distiller()` and to add text with a superscript.
- **Line comment 4**: In this example we have used the “YlOrRd” palette from the {**RColorBrewer**} package. As the name says it starts from yellow and goes to red.
- **Line comment 5**: The book uses the superseded `scales::pretty_breaks()` function instead the newer `scales::breaks_pretty()` function. This standard R break algorithm is primariy useful for date/times, for numeric scales the `scales::extended_breaks()` function does a slightly better job. `n = 6` is the number of desired breaks. You may get slightly more or fewer breaks that requested. (After trying it out I learned that in this case the `n` parameter wouldn't be necessary to get the same result.
- **Line comment 6**: The default value is `direction = -1` and produces scales from dark to light colors. We want the reverse representing lighter colors with few and dark colors with many tornados.
- **Line comment 7**: Note that “complete” themes like `ggplot2::theme_void()` will remove any settings made by a previous `ggplot2::theme()` function. Therefore, it is necessary to call `ggplot2::theme_void()` before `ggplot2::theme()` to implement specific theme setting settings.
- **Line comment 8**: We moved the legend to the bottom of the map to better accomodate the longer legend title.
::::
:::::
### Specifying a color palette for discrete data
The {**RColorBrewer**} palettes each contain a finite number of colors that are intended to be associated with categories in a choropleth map. Note that the `ggplot2::scale_fill_distiller()` function used to generate the color scale for the map in @fig-specify-continuous-color-palette operates a bit differently. This function takes a ColorBrewer palette and converts it to a continuous color ramp.
The next map example will show how to define categories and map each one as a distinctive color. To view the colors for a given number of categories and a specific palette, the `RColorBrewer::display.brewer.pal()` function is used with the number of categories as the first argument and the palette name as the second palette (@fig-colorbrewer-discrete-palette).
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-annex-b-colorbrewer-discrete-palette}
: ColorBrewer discrete color palette
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: fig-colorbrewer-discrete-palette
#| fig-cap: "The ColorBrewer 'YlOrRd' (yellow to red) color palette with four categories."
RColorBrewer::display.brewer.pal(4, "YlOrRd")
```
::::
:::::
Rather than using continuous scales for color and size, it is often recommended to aggregate the data into a small number of classes (typically 3-6). Using discrete classes makes it easier to associate each color or symbol in the map with a specific range of values.
To accomplish this step, we need to add a couple of new classified variables using `dplyr::mutate()`. The `base::cut()` function is used to split the continuous variables based on user-specified breaks. The incidence variable is split based on quantiles (i.e., percentiles) defined in the `qbrks` object. The population breaks are manually specified.
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-05-generate-discrete-classes}