From 10091738ef460484634e895873c57811f3df8206 Mon Sep 17 00:00:00 2001 From: Sean Raleigh Date: Tue, 21 Jan 2025 06:41:32 +0000 Subject: [PATCH] Minor fixes --- 06-correlation-web.qmd | 26 +++++++++-------------- chapter_downloads/06-correlation.qmd | 26 +++++++++-------------- docs/06-correlation-web.html | 21 ++++++++---------- docs/chapter_downloads/06-correlation.qmd | 26 +++++++++-------------- docs/search.json | 10 ++++----- 5 files changed, 44 insertions(+), 65 deletions(-) diff --git a/06-correlation-web.qmd b/06-correlation-web.qmd index a3a103c..4ba798c 100644 --- a/06-correlation-web.qmd +++ b/06-correlation-web.qmd @@ -132,10 +132,9 @@ Please write up your answer here. ::: - ***** -We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the number of FAIR plan policies obtained in that ZIP code. +We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate of FAIR plan policies obtained in that ZIP code. ##### Exercise 5(a) {-} @@ -176,7 +175,7 @@ Create the same kind of graph as above, but for `involact`. (Again, go back and ::: {.answer} ```{r} -# Add code here to create a plot of race +# Add code here to create a plot of involact ``` ::: @@ -263,7 +262,7 @@ In between 0 and 1 (or -1), we often use words like weak, moderately weak, moder A correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average. -A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer calories, on average. +A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average. ## Conditions for correlation @@ -320,12 +319,10 @@ Create a scatterplot of `income` against `race`. (Put `income` on the y-axis and ##### Exercise 8(b) {-} -Check the three conditions for the relationship between `income` and `race`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `income` and `race`. Which condition(s) are seriously violated here? ::: {.answer} -Please write up your answer here. - 1. 2. 3. @@ -346,7 +343,7 @@ Create a scatterplot of `theft` against `fire`. (Put `theft` on the y-axis and ` ##### Exercise 9(b) {-} -Check the three conditions for the relationship between `theft` and `fire`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `theft` and `fire`. Which condition(s) are seriously violated here? ::: {.answer} @@ -354,8 +351,6 @@ Check the three conditions for the relationship between `theft` and `fire`. Whic 2. 3. -Please write up your answer here. - ::: ##### Exercise 9(c) {-} @@ -387,13 +382,13 @@ The lesson learned here is that you should never try to interpret a correlation When two variables are correlated---indeed, associated in any way, not just in a linear relationship---that means that there is a relationship between them. However, that does not mean that one variable *causes* the other variable. -For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, though racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. +For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. In the Chicago example, there is still likely a causal connection between one variable (`race`) and the other (`involact`), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples. ##### Exercise 10 {-} -Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Does drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) +Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) See if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths. @@ -513,9 +508,7 @@ ggplot(bdims, aes(y = sho_gi, x = che_gi)) + Is there a possible lurking variable here, though? You may wonder about `sex`. (In this data set, the `sex` variable is presumed to be biological sex assigned at birth.) -Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). - -We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. +Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. ```{r} bdims <- bdims |> @@ -568,7 +561,7 @@ Please write up your answer here. ***** -In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two exercises? +In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises? ##### Exercise 15(a) {-} @@ -701,6 +694,7 @@ There is not much correlation between bill depth and bill length, but if anythin cor(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs") ``` + Now split by species: ```{r} diff --git a/chapter_downloads/06-correlation.qmd b/chapter_downloads/06-correlation.qmd index 02bf0a4..62497e4 100644 --- a/chapter_downloads/06-correlation.qmd +++ b/chapter_downloads/06-correlation.qmd @@ -142,10 +142,9 @@ Please write up your answer here. ::: - ***** -We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the number of FAIR plan policies obtained in that ZIP code. +We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate ofFAIR plan policies obtained in that ZIP code. ##### Exercise 5(a) @@ -186,7 +185,7 @@ Create the same kind of graph as above, but for `involact`. (Again, go back and ::: {.answer} ```{r} -# Add code here to create a plot of race +# Add code here to create a plot of involact ``` ::: @@ -273,7 +272,7 @@ In between 0 and 1 (or -1), we often use words like weak, moderately weak, moder A correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average. -A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer calories, on average. +A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average. ## Conditions for correlation @@ -330,12 +329,10 @@ Create a scatterplot of `income` against `race`. (Put `income` on the y-axis and ##### Exercise 8(b) -Check the three conditions for the relationship between `income` and `race`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `income` and `race`. Which condition(s) are seriously violated here? ::: {.answer} -Please write up your answer here. - 1. 2. 3. @@ -356,7 +353,7 @@ Create a scatterplot of `theft` against `fire`. (Put `theft` on the y-axis and ` ##### Exercise 9(b) -Check the three conditions for the relationship between `theft` and `fire`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `theft` and `fire`. Which condition(s) are seriously violated here? ::: {.answer} @@ -364,8 +361,6 @@ Check the three conditions for the relationship between `theft` and `fire`. Whic 2. 3. -Please write up your answer here. - ::: ##### Exercise 9(c) @@ -397,13 +392,13 @@ The lesson learned here is that you should never try to interpret a correlation When two variables are correlated---indeed, associated in any way, not just in a linear relationship---that means that there is a relationship between them. However, that does not mean that one variable *causes* the other variable. -For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, though racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. +For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. In the Chicago example, there is still likely a causal connection between one variable (`race`) and the other (`involact`), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples. ##### Exercise 10 -Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Does drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) +Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) See if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths. @@ -523,9 +518,7 @@ ggplot(bdims, aes(y = sho_gi, x = che_gi)) + Is there a possible lurking variable here, though? You may wonder about `sex`. (In this data set, the `sex` variable is presumed to be biological sex assigned at birth.) -Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). - -We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. +Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. ```{r} bdims <- bdims |> @@ -578,7 +571,7 @@ Please write up your answer here. ***** -In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two exercises? +In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises? ##### Exercise 15(a) @@ -711,6 +704,7 @@ There is not much correlation between bill depth and bill length, but if anythin cor(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs") ``` + Now split by species: ```{r} diff --git a/docs/06-correlation-web.html b/docs/06-correlation-web.html index 24d40b6..f0c2ebd 100644 --- a/docs/06-correlation-web.html +++ b/docs/06-correlation-web.html @@ -411,7 +411,7 @@
Exercise 4(b)

Please write up your answer here.


-

We are interested in the association between race and involact. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the number of FAIR plan policies obtained in that ZIP code.

+

We are interested in the association between race and involact. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate of FAIR plan policies obtained in that ZIP code.

Exercise 5(a)
@@ -441,7 +441,7 @@
Exercise 5(d)

Create the same kind of graph as above, but for involact. (Again, go back and set the binwidth and boundary to sensible values.)

-
# Add code here to create a plot of race
+
# Add code here to create a plot of involact
@@ -507,7 +507,7 @@
Exercise 6

In between 0 and 1 (or -1), we often use words like weak, moderately weak, moderate, and moderately strong. There are no exact cutoffs for when such words apply. You must learn from experience how to judge scatterplots and r values to make such determinations.

A correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average.

-

A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer calories, on average.

+

A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average.

@@ -558,9 +558,8 @@
Exercise 8(a)
Exercise 8(b)
-

Check the three conditions for the relationship between income and race. Which condition is pretty seriously violated here?

+

Check the three conditions for the relationship between income and race. Which condition(s) are seriously violated here?

-

Please write up your answer here.

  1. @@ -579,14 +578,13 @@
    Exercise 9(a)
Exercise 9(b)
-

Check the three conditions for the relationship between theft and fire. Which condition is pretty seriously violated here?

+

Check the three conditions for the relationship between theft and fire. Which condition(s) are seriously violated here?

-

Please write up your answer here.

@@ -610,11 +608,11 @@
Exercise 9(d)

6.6 Correlation is not causation

When two variables are correlated—indeed, associated in any way, not just in a linear relationship—that means that there is a relationship between them. However, that does not mean that one variable causes the other variable.

-

For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, though racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance.

+

For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance.

In the Chicago example, there is still likely a causal connection between one variable (race) and the other (involact), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples.

Exercise 10
-

Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Does drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?)

+

Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?)

See if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths.

Please write up your answer here.

@@ -732,8 +730,7 @@

bdims <- bdims |>
   mutate(sex_fct = as_factor(sex))
@@ -847,7 +844,7 @@ 
Exercise 14

Please write up your answer here.


-

In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two exercises?

+

In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises?

Exercise 15(a)
diff --git a/docs/chapter_downloads/06-correlation.qmd b/docs/chapter_downloads/06-correlation.qmd index 02bf0a4..62497e4 100644 --- a/docs/chapter_downloads/06-correlation.qmd +++ b/docs/chapter_downloads/06-correlation.qmd @@ -142,10 +142,9 @@ Please write up your answer here. ::: - ***** -We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the number of FAIR plan policies obtained in that ZIP code. +We are interested in the association between `race` and `involact`. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate ofFAIR plan policies obtained in that ZIP code. ##### Exercise 5(a) @@ -186,7 +185,7 @@ Create the same kind of graph as above, but for `involact`. (Again, go back and ::: {.answer} ```{r} -# Add code here to create a plot of race +# Add code here to create a plot of involact ``` ::: @@ -273,7 +272,7 @@ In between 0 and 1 (or -1), we often use words like weak, moderately weak, moder A correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average. -A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer calories, on average. +A correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average. ## Conditions for correlation @@ -330,12 +329,10 @@ Create a scatterplot of `income` against `race`. (Put `income` on the y-axis and ##### Exercise 8(b) -Check the three conditions for the relationship between `income` and `race`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `income` and `race`. Which condition(s) are seriously violated here? ::: {.answer} -Please write up your answer here. - 1. 2. 3. @@ -356,7 +353,7 @@ Create a scatterplot of `theft` against `fire`. (Put `theft` on the y-axis and ` ##### Exercise 9(b) -Check the three conditions for the relationship between `theft` and `fire`. Which condition is pretty seriously violated here? +Check the three conditions for the relationship between `theft` and `fire`. Which condition(s) are seriously violated here? ::: {.answer} @@ -364,8 +361,6 @@ Check the three conditions for the relationship between `theft` and `fire`. Whic 2. 3. -Please write up your answer here. - ::: ##### Exercise 9(c) @@ -397,13 +392,13 @@ The lesson learned here is that you should never try to interpret a correlation When two variables are correlated---indeed, associated in any way, not just in a linear relationship---that means that there is a relationship between them. However, that does not mean that one variable *causes* the other variable. -For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, though racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. +For example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance. In the Chicago example, there is still likely a causal connection between one variable (`race`) and the other (`involact`), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples. ##### Exercise 10 -Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Does drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) +Ice cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?) See if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths. @@ -523,9 +518,7 @@ ggplot(bdims, aes(y = sho_gi, x = che_gi)) + Is there a possible lurking variable here, though? You may wonder about `sex`. (In this data set, the `sex` variable is presumed to be biological sex assigned at birth.) -Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). - -We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. +Before we go any further, go back to the help file and the `glimpse` output above and note that `sex` is coded as an integer (a whole number). We'll use the `mutate` and `as_factor` commands---illustrated in Chapters 3 and 5---to make a new factor variable. ```{r} bdims <- bdims |> @@ -578,7 +571,7 @@ Please write up your answer here. ***** -In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two exercises? +In the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises? ##### Exercise 15(a) @@ -711,6 +704,7 @@ There is not much correlation between bill depth and bill length, but if anythin cor(penguins$bill_depth_mm, penguins$bill_length_mm, use = "complete.obs") ``` + Now split by species: ```{r} diff --git a/docs/search.json b/docs/search.json index 9c5b5da..046648e 100644 --- a/docs/search.json +++ b/docs/search.json @@ -724,7 +724,7 @@ "href": "06-correlation-web.html#redlining-in-chicago", "title": "6  Correlation", "section": "6.2 Redlining in Chicago", - "text": "6.2 Redlining in Chicago\nThe data set we will use throughout this chapter is from Chicago in the 1970s studying the practice of “redlining”.\n\nExercise 1\nDo an internet search for “redlining”.\nConsult at least two or three sources. Then, in your own words (not copied and pasted from any of the websites you consulted), explain what “redlining” means.\n\nPlease write up your answer here.\n\n\nThe chredlin data set appears in the faraway package accompanying a book by Julian Faraway (Practical Regression and Anova using R, 2002.) Faraway explains:\n\n“In a study of insurance availability in Chicago, the U.S. Commission on Civil Rights attempted to examine charges by several community organizations that insurance companies were redlining their neighborhoods, i.e. canceling policies or refusing to insure or renew. First the Illinois Department of Insurance provided the number of cancellations, non-renewals, new policies, and renewals of homeowners and residential fire insurance policies by ZIP code for the months of December 1977 through February 1978. The companies that provided this information account for more than 70% of the homeowners insurance policies written in the City of Chicago. The department also supplied the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978. Since most FAIR plan policyholders secure such coverage only after they have been rejected by the voluntary market, rather than as a result of a preference for that type of insurance, the distribution of FAIR plan policies is another measure of insurance availability in the voluntary market.”\n\nIn other words, the degree to which residents obtained FAIR policies can be seen as an indirect measure of redlining. This participation in an “involuntary” market is thought to be largely driven by rejection of coverage under more traditional insurance plans.\n\n\n6.2.1 Exploratory data analysis\nBefore we learn about correlation, let’s get to know our data a little better.\nType ?chredlin at the Console to read the help file. While it’s not very informative about how the data was collected, it does have crucial information about the way the data is structured.\nHere is the data set:\n\nchredlin\n\n\n \n\n\n\n\nExercise 2\nWhat do each of the rows of this data set represent? You’ll need to refer to the help file. (They are not individual people.)\n\nPlease write up your answer here.\n\n\n\nExercise 3\nThe race variable is numeric. Why? What do these numbers represent? (Again, refer to the help file.)\n\nPlease write up your answer here.\n\n\nThe glimpse command gives a concise overview of all the variables present.\n\nglimpse(chredlin)\n\nRows: 47\nColumns: 7\n$ race <dbl> 10.0, 22.2, 19.6, 17.3, 24.5, 54.0, 4.9, 7.1, 5.3, 21.5, 43.1…\n$ fire <dbl> 6.2, 9.5, 10.5, 7.7, 8.6, 34.1, 11.0, 6.9, 7.3, 15.1, 29.1, 2…\n$ theft <dbl> 29, 44, 36, 37, 53, 68, 75, 18, 31, 25, 34, 14, 11, 11, 22, 1…\n$ age <dbl> 60.4, 76.5, 73.5, 66.9, 81.4, 52.6, 42.6, 78.5, 90.1, 89.8, 8…\n$ involact <dbl> 0.0, 0.1, 1.2, 0.5, 0.7, 0.3, 0.0, 0.0, 0.4, 1.1, 1.9, 0.0, 0…\n$ income <dbl> 11.744, 9.323, 9.948, 10.656, 9.730, 8.231, 21.480, 11.104, 1…\n$ side <fct> n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n…\n\n\n\n\nExercise 4(a)\nWhich variable listed above represents participation in the FAIR plan? How is it measured? (Again, refer to the help file.)\n\nPlease write up your answer here.\n\n\n\nExercise 4(b)\nWhy is it important to analyze the number of plans per 100 housing units as opposed to the total number of plans across each ZIP code? (Hint: what happens if some ZIP codes are larger than others?)\n\nPlease write up your answer here.\n\n\nWe are interested in the association between race and involact. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the number of FAIR plan policies obtained in that ZIP code.\n\n\nExercise 5(a)\nSince race is a numerical variable, what type of graph or chart is appropriate for visualizing it? (You may need to refer back to the “Numerical data” chapter.)\n\nPlease write up your answer here.\n\n\n\nExercise 5(b)\nUsing ggplot code, create the type of graph you identified above. After creating the initial plot, be sure to go back and set the binwidth and boundary to sensible values. (Refer back to the “Numerical data” chapter for sample code if you’ve forgotten how to make such a graph. If you were unsure about part (a), the instructions about binwidth and boundary should be a pretty big hint.)\n\n\n# Add code here to create a plot of race\n\n\n\n\nExercise 5(c)\nDescribe the shape of the race variable using the three key shape descriptors (modes, symmetry, and outliers).\n\nPlease write up your answer here.\n\n\n\nExercise 5(d)\nCreate the same kind of graph as above, but for involact. (Again, go back and set the binwidth and boundary to sensible values.)\n\n\n# Add code here to create a plot of race\n\n\n\n\nExercise 5(e)\nDescribe the shape of the involact variable using the three key shape descriptors (modes, symmetry, and outliers).\n\nPlease write up your answer here.\n\n\n\nExercise 5(f)\nSince both race and involact are numerical variables, what type of graph or chart is appropriate for visualizing the relationship between them?\n\nPlease write up your answer here.\n\n\n\nExercise 5(g)\nFor our research question, is race functioning as a predictor variable or as the response variable? What about involact? Why? Explain why it makes more sense to think of one of them as the predictor and the other as the response.\n\nPlease write up your answer here.\n\n\n\nExercise 5(h)\nUsing ggplot code, create the type of graph you identified above. Be sure to put involact on the y-axis and race` on the x-axis. (Again, that’s a hint in case you were confused in part (g).)\n\n\n# Add code here to create a plot of involact against race", + "text": "6.2 Redlining in Chicago\nThe data set we will use throughout this chapter is from Chicago in the 1970s studying the practice of “redlining”.\n\nExercise 1\nDo an internet search for “redlining”.\nConsult at least two or three sources. Then, in your own words (not copied and pasted from any of the websites you consulted), explain what “redlining” means.\n\nPlease write up your answer here.\n\n\nThe chredlin data set appears in the faraway package accompanying a book by Julian Faraway (Practical Regression and Anova using R, 2002.) Faraway explains:\n\n“In a study of insurance availability in Chicago, the U.S. Commission on Civil Rights attempted to examine charges by several community organizations that insurance companies were redlining their neighborhoods, i.e. canceling policies or refusing to insure or renew. First the Illinois Department of Insurance provided the number of cancellations, non-renewals, new policies, and renewals of homeowners and residential fire insurance policies by ZIP code for the months of December 1977 through February 1978. The companies that provided this information account for more than 70% of the homeowners insurance policies written in the City of Chicago. The department also supplied the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978. Since most FAIR plan policyholders secure such coverage only after they have been rejected by the voluntary market, rather than as a result of a preference for that type of insurance, the distribution of FAIR plan policies is another measure of insurance availability in the voluntary market.”\n\nIn other words, the degree to which residents obtained FAIR policies can be seen as an indirect measure of redlining. This participation in an “involuntary” market is thought to be largely driven by rejection of coverage under more traditional insurance plans.\n\n\n6.2.1 Exploratory data analysis\nBefore we learn about correlation, let’s get to know our data a little better.\nType ?chredlin at the Console to read the help file. While it’s not very informative about how the data was collected, it does have crucial information about the way the data is structured.\nHere is the data set:\n\nchredlin\n\n\n \n\n\n\n\nExercise 2\nWhat do each of the rows of this data set represent? You’ll need to refer to the help file. (They are not individual people.)\n\nPlease write up your answer here.\n\n\n\nExercise 3\nThe race variable is numeric. Why? What do these numbers represent? (Again, refer to the help file.)\n\nPlease write up your answer here.\n\n\nThe glimpse command gives a concise overview of all the variables present.\n\nglimpse(chredlin)\n\nRows: 47\nColumns: 7\n$ race <dbl> 10.0, 22.2, 19.6, 17.3, 24.5, 54.0, 4.9, 7.1, 5.3, 21.5, 43.1…\n$ fire <dbl> 6.2, 9.5, 10.5, 7.7, 8.6, 34.1, 11.0, 6.9, 7.3, 15.1, 29.1, 2…\n$ theft <dbl> 29, 44, 36, 37, 53, 68, 75, 18, 31, 25, 34, 14, 11, 11, 22, 1…\n$ age <dbl> 60.4, 76.5, 73.5, 66.9, 81.4, 52.6, 42.6, 78.5, 90.1, 89.8, 8…\n$ involact <dbl> 0.0, 0.1, 1.2, 0.5, 0.7, 0.3, 0.0, 0.0, 0.4, 1.1, 1.9, 0.0, 0…\n$ income <dbl> 11.744, 9.323, 9.948, 10.656, 9.730, 8.231, 21.480, 11.104, 1…\n$ side <fct> n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n…\n\n\n\n\nExercise 4(a)\nWhich variable listed above represents participation in the FAIR plan? How is it measured? (Again, refer to the help file.)\n\nPlease write up your answer here.\n\n\n\nExercise 4(b)\nWhy is it important to analyze the number of plans per 100 housing units as opposed to the total number of plans across each ZIP code? (Hint: what happens if some ZIP codes are larger than others?)\n\nPlease write up your answer here.\n\n\nWe are interested in the association between race and involact. If redlining plays a role in driving people toward FAIR plan policies, we would expect there to be a relationship between the racial composition of a ZIP code and the rate of FAIR plan policies obtained in that ZIP code.\n\n\nExercise 5(a)\nSince race is a numerical variable, what type of graph or chart is appropriate for visualizing it? (You may need to refer back to the “Numerical data” chapter.)\n\nPlease write up your answer here.\n\n\n\nExercise 5(b)\nUsing ggplot code, create the type of graph you identified above. After creating the initial plot, be sure to go back and set the binwidth and boundary to sensible values. (Refer back to the “Numerical data” chapter for sample code if you’ve forgotten how to make such a graph. If you were unsure about part (a), the instructions about binwidth and boundary should be a pretty big hint.)\n\n\n# Add code here to create a plot of race\n\n\n\n\nExercise 5(c)\nDescribe the shape of the race variable using the three key shape descriptors (modes, symmetry, and outliers).\n\nPlease write up your answer here.\n\n\n\nExercise 5(d)\nCreate the same kind of graph as above, but for involact. (Again, go back and set the binwidth and boundary to sensible values.)\n\n\n# Add code here to create a plot of involact\n\n\n\n\nExercise 5(e)\nDescribe the shape of the involact variable using the three key shape descriptors (modes, symmetry, and outliers).\n\nPlease write up your answer here.\n\n\n\nExercise 5(f)\nSince both race and involact are numerical variables, what type of graph or chart is appropriate for visualizing the relationship between them?\n\nPlease write up your answer here.\n\n\n\nExercise 5(g)\nFor our research question, is race functioning as a predictor variable or as the response variable? What about involact? Why? Explain why it makes more sense to think of one of them as the predictor and the other as the response.\n\nPlease write up your answer here.\n\n\n\nExercise 5(h)\nUsing ggplot code, create the type of graph you identified above. Be sure to put involact on the y-axis and race` on the x-axis. (Again, that’s a hint in case you were confused in part (g).)\n\n\n# Add code here to create a plot of involact against race", "crumbs": [ "6  Correlation" ] @@ -734,7 +734,7 @@ "href": "06-correlation-web.html#correlation", "title": "6  Correlation", "section": "6.3 Correlation", - "text": "6.3 Correlation\nThe word correlation describes a linear relationship between two numerical variables. As long as certain conditions are met, we can calculate a statistic called the correlation coefficient, often denoted with a lowercase r.\nThere are several different ways to compute a statistic that measures correlation. The most common way, and the way we will learn in this chapter, is often attributed to an English mathematician named Karl Pearson. According to his Wikipedia page,\n\n“Pearson was also a proponent of social Darwinism, eugenics and scientific racism.”\n\n\nExercise 6\nDo an internet search for each of the following terms:\n\nSocial Darwinism\nEugenics\nScientific racism\n\nConsult at least two or three sources for each term. Then, in your own words (not copied and pasted from any of the websites you consulted), explain each of these terms.\n\nPlease write up your answer here.\n\n\nWhile Pearson is often credited with its discovery, the so-called “Pearson correlation coefficient” was first developed by a French scientist, Auguste Bravais. Due to the misattribution of discovery, along with the desire to disassociate the useful tool of correlation from its problematic applications to racism and eugenics, we will just refer to it as the correlation coefficient (without a name attached).\nThe correlation coefficient, r, has some important properties.\n\nThe correlation coefficient is a number between -1 and 1.\nA value close to 0 indicates little or no correlation.\nA value close to 1 indicates strong positive correlation.\nA value close to -1 indicates strong negative correlation.\n\nIn between 0 and 1 (or -1), we often use words like weak, moderately weak, moderate, and moderately strong. There are no exact cutoffs for when such words apply. You must learn from experience how to judge scatterplots and r values to make such determinations.\nA correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average.\nA correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer calories, on average.", + "text": "6.3 Correlation\nThe word correlation describes a linear relationship between two numerical variables. As long as certain conditions are met, we can calculate a statistic called the correlation coefficient, often denoted with a lowercase r.\nThere are several different ways to compute a statistic that measures correlation. The most common way, and the way we will learn in this chapter, is often attributed to an English mathematician named Karl Pearson. According to his Wikipedia page,\n\n“Pearson was also a proponent of social Darwinism, eugenics and scientific racism.”\n\n\nExercise 6\nDo an internet search for each of the following terms:\n\nSocial Darwinism\nEugenics\nScientific racism\n\nConsult at least two or three sources for each term. Then, in your own words (not copied and pasted from any of the websites you consulted), explain each of these terms.\n\nPlease write up your answer here.\n\n\nWhile Pearson is often credited with its discovery, the so-called “Pearson correlation coefficient” was first developed by a French scientist, Auguste Bravais. Due to the misattribution of discovery, along with the desire to disassociate the useful tool of correlation from its problematic applications to racism and eugenics, we will just refer to it as the correlation coefficient (without a name attached).\nThe correlation coefficient, r, has some important properties.\n\nThe correlation coefficient is a number between -1 and 1.\nA value close to 0 indicates little or no correlation.\nA value close to 1 indicates strong positive correlation.\nA value close to -1 indicates strong negative correlation.\n\nIn between 0 and 1 (or -1), we often use words like weak, moderately weak, moderate, and moderately strong. There are no exact cutoffs for when such words apply. You must learn from experience how to judge scatterplots and r values to make such determinations.\nA correlation is positive when low values of one variable are associated with low values of the other value. Similarly, high values of one variable are associated with high values of the other. For example, exercise is positively correlated with burning calories. Low exercise levels will burn a few calories; high exercise levels burn more calories, on average.\nA correlation is negative when low values of one variable are associated with high values of the other value, and vice versa. For example, tooth brushing is negatively correlated with cavities. Less tooth brushing may result in more cavities; more tooth brushing is associated with fewer cavities, on average.", "crumbs": [ "6  Correlation" ] @@ -754,7 +754,7 @@ "href": "06-correlation-web.html#calculating-correlation", "title": "6  Correlation", "section": "6.5 Calculating correlation", - "text": "6.5 Calculating correlation\nSince the conditions are met, We calculate the correlation coefficient using the cor command.\n\ncor(chredlin$race, chredlin$involact)\n\n[1] 0.713754\n\n\nThe order of the variables doesn’t matter; correlation is symmetric, so the r value is the same independent of the choice of response and predictor variables.\nSince the correlation between involact and race is a positive number and slightly closer to 1 than 0, we might call this a “moderate” positive correlation. You can tell from the scatterplot above that the relationship is not a strong relationship. The words you choose should match the graphs you create and the statistics you calculate.\n\nExercise 8(a)\nCreate a scatterplot of income against race. (Put income on the y-axis and race on the x-axis.)\n\n\n# Add code here to create a scatterplot of income against race\n\n\n\n\nExercise 8(b)\nCheck the three conditions for the relationship between income and race. Which condition is pretty seriously violated here?\n\nPlease write up your answer here.\n\n\n\n\n\n\n\n\nExercise 9(a)\nCreate a scatterplot of theft against fire. (Put theft on the y-axis and fire on the x-axis.)\n\n\n# Add code here to create a scatterplot of theft against fire\n\n\n\n\nExercise 9(b)\nCheck the three conditions for the relationship between theft and fire. Which condition is pretty seriously violated here?\n\n\n\n\n\n\nPlease write up your answer here.\n\n\n\nExercise 9(c)\nEven though the conditions are not met, what if you calculated the correlation coefficient anyway? Try it.\n\n\n# Add code here to calculate the correlation coefficient between theft and fire\n\n\n\n\nExercise 9(d)\nSuppose you hadn’t looked at the scatterplot and you only saw the correlation coefficient you calculated in the previous part. What would your conclusion be about the relationship between theft and fire. Why would that conclusion be misleading?\n\nPlease write up your answer here.\n\nThe lesson learned here is that you should never try to interpret a correlation coefficient without looking at a plot of the data to assure that the conditions are met and that the result is a sensible thing to interpret.", + "text": "6.5 Calculating correlation\nSince the conditions are met, We calculate the correlation coefficient using the cor command.\n\ncor(chredlin$race, chredlin$involact)\n\n[1] 0.713754\n\n\nThe order of the variables doesn’t matter; correlation is symmetric, so the r value is the same independent of the choice of response and predictor variables.\nSince the correlation between involact and race is a positive number and slightly closer to 1 than 0, we might call this a “moderate” positive correlation. You can tell from the scatterplot above that the relationship is not a strong relationship. The words you choose should match the graphs you create and the statistics you calculate.\n\nExercise 8(a)\nCreate a scatterplot of income against race. (Put income on the y-axis and race on the x-axis.)\n\n\n# Add code here to create a scatterplot of income against race\n\n\n\n\nExercise 8(b)\nCheck the three conditions for the relationship between income and race. Which condition(s) are seriously violated here?\n\n\n\n\n\n\n\n\n\nExercise 9(a)\nCreate a scatterplot of theft against fire. (Put theft on the y-axis and fire on the x-axis.)\n\n\n# Add code here to create a scatterplot of theft against fire\n\n\n\n\nExercise 9(b)\nCheck the three conditions for the relationship between theft and fire. Which condition(s) are seriously violated here?\n\n\n\n\n\n\n\n\n\nExercise 9(c)\nEven though the conditions are not met, what if you calculated the correlation coefficient anyway? Try it.\n\n\n# Add code here to calculate the correlation coefficient between theft and fire\n\n\n\n\nExercise 9(d)\nSuppose you hadn’t looked at the scatterplot and you only saw the correlation coefficient you calculated in the previous part. What would your conclusion be about the relationship between theft and fire. Why would that conclusion be misleading?\n\nPlease write up your answer here.\n\nThe lesson learned here is that you should never try to interpret a correlation coefficient without looking at a plot of the data to assure that the conditions are met and that the result is a sensible thing to interpret.", "crumbs": [ "6  Correlation" ] @@ -764,7 +764,7 @@ "href": "06-correlation-web.html#correlation-is-not-causation", "title": "6  Correlation", "section": "6.6 Correlation is not causation", - "text": "6.6 Correlation is not causation\nWhen two variables are correlated—indeed, associated in any way, not just in a linear relationship—that means that there is a relationship between them. However, that does not mean that one variable causes the other variable.\nFor example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, though racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance.\nIn the Chicago example, there is still likely a causal connection between one variable (race) and the other (involact), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples.\n\nExercise 10\nIce cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Does drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?)\nSee if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths.\n\nPlease write up your answer here.\n\n\nIn the Chicago example, the causal effect was indirect. In the example from the exercise above, there is no causation whatsoever between the two variables. Instead, the causal effect was generated by a third factor that caused both ice cream sales to go up, and also happened to cause drowning deaths to go up. (Or, equivalently stated, it caused ice cream sales to be low during certain times of the year and also caused the drowning deaths to be low as well.) Such a factor is called a lurking variable. When a correlation between two variables exists due solely to the intervention of a lurking variable, that correlation is called a spurious correlation. The correlation is real; a scatterplot of ice cream sales and drowning deaths would show a positive relationship. But the reasons for that correlation to exist have nothing to do with any kind of direct causal link between the two.\nHere’s another one:\n\n\nExercise 11\nMost studies involving children create a number of weird correlations. For example, the height of children is very strongly correlated to pretty much everything you can measure about scholastic aptitude. For example, vocabulary count (the number of words children can use fluently in a sentence) is strongly correlated to height. Are tall people just smarter than short people?\nThe answer is, of course, no. The correlation is spurious. So what’s the lurking variable?\n\nPlease write up your answer here.", + "text": "6.6 Correlation is not causation\nWhen two variables are correlated—indeed, associated in any way, not just in a linear relationship—that means that there is a relationship between them. However, that does not mean that one variable causes the other variable.\nFor example, we discovered above that there was a moderate correlation between the racial composition of a ZIP code and the new FAIR policies created in those ZIP codes. However, being part of a racial minority does not cause someone to seek out alternative forms of insurance, at least not directly. In this case, the racial composition of certain neighborhoods, through racist policies, affected the availability of certain forms of insurance for residents in those neighborhoods. And that, in turn, caused residents to seek other forms of insurance.\nIn the Chicago example, there is still likely a causal connection between one variable (race) and the other (involact), but it is indirect. In other cases, there is no causal connection at all. Here are a few of my favorite examples.\n\nExercise 10\nIce cream sales are positively correlated with drowning deaths. Does eating ice cream cause you to drown? (Perhaps the myth about swimming within one hour of eating is really true!) Do drowning deaths cause ice cream sales to rise? (Perhaps people are so sad about all the drownings that they have to go out for ice cream to cheer themselves up?)\nSee if you can figure out the real reason why ice cream sales are positively correlated with drowning deaths.\n\nPlease write up your answer here.\n\n\nIn the Chicago example, the causal effect was indirect. In the example from the exercise above, there is no causation whatsoever between the two variables. Instead, the causal effect was generated by a third factor that caused both ice cream sales to go up, and also happened to cause drowning deaths to go up. (Or, equivalently stated, it caused ice cream sales to be low during certain times of the year and also caused the drowning deaths to be low as well.) Such a factor is called a lurking variable. When a correlation between two variables exists due solely to the intervention of a lurking variable, that correlation is called a spurious correlation. The correlation is real; a scatterplot of ice cream sales and drowning deaths would show a positive relationship. But the reasons for that correlation to exist have nothing to do with any kind of direct causal link between the two.\nHere’s another one:\n\n\nExercise 11\nMost studies involving children create a number of weird correlations. For example, the height of children is very strongly correlated to pretty much everything you can measure about scholastic aptitude. For example, vocabulary count (the number of words children can use fluently in a sentence) is strongly correlated to height. Are tall people just smarter than short people?\nThe answer is, of course, no. The correlation is spurious. So what’s the lurking variable?\n\nPlease write up your answer here.", "crumbs": [ "6  Correlation" ] @@ -794,7 +794,7 @@ "href": "06-correlation-web.html#visualizing-lurking-variables", "title": "6  Correlation", "section": "6.9 Visualizing lurking variables", - "text": "6.9 Visualizing lurking variables\nWhen we create a scatterplot, we can visualize associations between the two numerical variables. Is there a way to see lurking variables in the scatterplot as well?\nOne simple case is when the lurking variable is a categorical variable. We saw several examples of that in Chapters 3 and 4 in the penguins data. The association (or lack thereof) between variables was often misleading when we failed to take into account the fact that there were three different species of penguin.\nHere are a few more interesting examples. The bdims data (hosted in the openintro package) consists of many body measurements taken from 507 physically active individuals. Type ?bdims at the Console to read the help file.\nHere is the data:\n\nbdims\n\n\n \n\n\n\n\nglimpse(bdims)\n\nRows: 507\nColumns: 25\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40.…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29.…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33.…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20.…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30.…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13.…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10.…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19.…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14.…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0, …\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, 9…\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81.…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81.…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 98…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60.…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34.…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27.…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38.…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40.…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23.…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16.…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22,…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76.…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0, …\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …\n\n\nMost physical body measurements are known to be correlated; this makes sense because when one part of the body is larger, we expect lots of other body parts to be larger as well (and similarly for smaller individuals).\nFor example, it’s no surprise that shoulder girth (sho_gi) and chest girth (che_gi) are strongly correlated:\n\nggplot(bdims, aes(y = sho_gi, x = che_gi)) +\n geom_point()\n\n\n\n\n\n\n\n\nIs there a possible lurking variable here, though? You may wonder about sex. (In this data set, the sex variable is presumed to be biological sex assigned at birth.)\nBefore we go any further, go back to the help file and the glimpse output above and note that sex is coded as an integer (a whole number).\nWe’ll use the mutate and as_factor commands—illustrated in Chapters 3 and 5—to make a new factor variable.\n\nbdims <- bdims |>\n mutate(sex_fct = as_factor(sex))\nglimpse(bdims)\n\nRows: 507\nColumns: 26\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0,…\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, …\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 9…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0,…\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ sex_fct <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n\n\nIf you look at the glimpse output above, you see that we do have a new variable called sex_fct and it is properly coded as a factor variable. However, the labels 0 and 1 (for females and males, respectively) are not very helpful. Can we change them? Yes, the forcats package has a fct_recode function that does just that. Here is what it looks like:\n\nbdims <- bdims |>\n mutate(sex_fct = fct_recode(sex_fct, \"female\" = \"0\", \"male\" = \"1\"))\nglimpse(bdims)\n\nRows: 507\nColumns: 26\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0,…\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, …\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 9…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0,…\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ sex_fct <fct> male, male, male, male, male, male, male, male, male, male, ma…\n\n\nThis will be a lot more helpful!\nNow, back to the scatterplots.\nOne way we learned (in Chapters 3 and 4) to incorporate a third variable into the analysis is through the use of color as an additional aesthetic element. We’ll use our new sex_fct variable. Also, don’t forget to use the Viridis color palette and the black-and-white theme.\n\nggplot(bdims, aes(y = sho_gi, x = che_gi, color = sex_fct)) +\n geom_point() +\n scale_color_viridis_d() +\n theme_bw()\n\n\n\n\n\n\n\n\nIn this example, there is a strong correlation between shoulder girth and chest girth, but females and males lie in completely different parts of the graph. Having said that, if you focus on the females separately, you can still see a strong positive correlation, and if you focus on males separately, there is also a strong positive correlation there. So the inclusion of sex didn’t really change much about the nature of the correlation in this example. Even still, the correlation coefficients do change a little depending on whether we look at the whole data set versus females/males separately:\n\ncor(bdims$sho_gi, bdims$che_gi)\n\n[1] 0.9271923\n\n\n\nbdims |>\n group_by(sex_fct) |>\n summarise(corr = cor(sho_gi, che_gi))\n\n\n \n\n\n\n\nExercise 14\nWhy would the correlation coefficient be stronger for the whole data set and slightly less strong for the sexes separately? (Hint: think about sample size.)\n\nPlease write up your answer here.\n\n\nIn the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two exercises?\n\n\nExercise 15(a)\nCreate a scatterplot of thigh girth against weight (put thi_gi on the y-axis and wgt on the x-axis).\n\n\n# Add code here to create a scatterplot of thigh girth against weight.\n\n\n\n\nExercise 15(b)\nChange the scatterplot above to include sex_fct as a color aesthetic. (Use the Viridis color palette and theme_bw.)\n\n\n# Add code here to add color for sex_fct.\n\n\n\n\nExercise 15(c)\nCalculate the correlation coefficients for thigh girth and weight, once for the whole data set, and again and for the data split by sex_fct (as above).\n\n\n# Add code here to calculate the correlation coefficient\n# between thigh girth and weight.\n\n\n# Add code here to calculate the correlation coefficient\n# between thigh girth and weight split by sex.\n\n\n\n\nExercise 15(d)\nExplain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between thigh girth and weight? What changed about the nature of the correlation within each sex category?\n\nPlease write up your answer here.\n\n\n\nExercise 16(a)\nThe help file for the bia_di variable describes it as the “respondent’s biacromial diameter in centimeters.” What is “biacromial diameter”?\n\nPlease write up your answer here.\n\n\n\nExercise 16(b)\nCreate a scatterplot of biacromial diameter against weight (put bia_di on the y-axis and wgt on the x-axis).\n\n\n# Add code here to create a scatterplot of biacromial diameter against weight.\n\n\n\n\nExercise 16(c)\nChange the scatterplot above to include sex_fct as a color aesthetic. (Use the Viridis color palette and theme_bw.)\n\n\n# Add code here to add color for sex_fct.\n\n\n\n\nExercise 16(d)\nCalculate the correlation coefficients for biacromial diameter and weight, once for the whole data set, and again and for the data split by sex_fct (as above).\n\n\n# Add code here to calculate the correlation coefficient\n# between biacromial diameter and weight\n\n\n# Add code here to calculate the correlation coefficient\n# between biacromial diameter and weight split by sex\n\n\n\n\nExercise 16(e)\nExplain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between biacromial diameter and weight? What changed about the nature of the correlation within each sex category?\n\nPlease write up your answer here.\n\n\nThe take-home message here is that lurking variables can change the strength of the correlation between two variables, making it appear stronger or weaker. In more extreme cases, it’s even possible to change the direction of the correlation altogether! There isn’t an example of this phenomenon in the bdims data, but we do find one in the penguins data.\nHere is a scatterplot of bill depth against bill length.\n\nggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm)) +\n geom_point()\n\nWarning: Removed 2 rows containing missing values or values outside the scale range\n(`geom_point()`).\n\n\n\n\n\n\n\n\n\nThere is not much correlation between bill depth and bill length, but if anything, it looks like there might be a slightly negative association. (In the following code chunk, the cor command uses a different method for dealing with missing data.)\n\ncor(penguins$bill_depth_mm, penguins$bill_length_mm,\n use = \"complete.obs\")\n\n[1] -0.2350529\n\n\nNow split by species:\n\nggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm,\n color = species)) +\n geom_point() +\n scale_color_viridis_d() +\n theme_bw()\n\nWarning: Removed 2 rows containing missing values or values outside the scale range\n(`geom_point()`).\n\n\n\n\n\n\n\n\n\n\npenguins |>\n group_by(species) |>\n summarise(corr = cor(bill_depth_mm, bill_length_mm,\n use = \"complete.obs\"))\n\n\n \n\n\n\nThere was a very weak negative correlation in the full data set, but, behold, bill depth and bill length are positive correlated within each species!\nThe phenomenon of an association between two variables “reversing” direction when considering a third variable is often called “Simpson’s Paradox”.2 We’ll revisit Simpson’s Paradox in a future chapter.", + "text": "6.9 Visualizing lurking variables\nWhen we create a scatterplot, we can visualize associations between the two numerical variables. Is there a way to see lurking variables in the scatterplot as well?\nOne simple case is when the lurking variable is a categorical variable. We saw several examples of that in Chapters 3 and 4 in the penguins data. The association (or lack thereof) between variables was often misleading when we failed to take into account the fact that there were three different species of penguin.\nHere are a few more interesting examples. The bdims data (hosted in the openintro package) consists of many body measurements taken from 507 physically active individuals. Type ?bdims at the Console to read the help file.\nHere is the data:\n\nbdims\n\n\n \n\n\n\n\nglimpse(bdims)\n\nRows: 507\nColumns: 25\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40.…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29.…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33.…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20.…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30.…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13.…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10.…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19.…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14.…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0, …\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, 9…\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81.…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81.…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 98…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60.…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34.…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27.…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38.…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40.…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23.…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16.…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22,…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76.…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0, …\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …\n\n\nMost physical body measurements are known to be correlated; this makes sense because when one part of the body is larger, we expect lots of other body parts to be larger as well (and similarly for smaller individuals).\nFor example, it’s no surprise that shoulder girth (sho_gi) and chest girth (che_gi) are strongly correlated:\n\nggplot(bdims, aes(y = sho_gi, x = che_gi)) +\n geom_point()\n\n\n\n\n\n\n\n\nIs there a possible lurking variable here, though? You may wonder about sex. (In this data set, the sex variable is presumed to be biological sex assigned at birth.)\nBefore we go any further, go back to the help file and the glimpse output above and note that sex is coded as an integer (a whole number). We’ll use the mutate and as_factor commands—illustrated in Chapters 3 and 5—to make a new factor variable.\n\nbdims <- bdims |>\n mutate(sex_fct = as_factor(sex))\nglimpse(bdims)\n\nRows: 507\nColumns: 26\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0,…\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, …\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 9…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0,…\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ sex_fct <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n\n\nIf you look at the glimpse output above, you see that we do have a new variable called sex_fct and it is properly coded as a factor variable. However, the labels 0 and 1 (for females and males, respectively) are not very helpful. Can we change them? Yes, the forcats package has a fct_recode function that does just that. Here is what it looks like:\n\nbdims <- bdims |>\n mutate(sex_fct = fct_recode(sex_fct, \"female\" = \"0\", \"male\" = \"1\"))\nglimpse(bdims)\n\nRows: 507\nColumns: 26\n$ bia_di <dbl> 42.9, 43.7, 40.1, 44.3, 42.5, 43.3, 43.5, 44.4, 43.5, 42.0, 40…\n$ bii_di <dbl> 26.0, 28.5, 28.2, 29.9, 29.9, 27.0, 30.0, 29.8, 26.5, 28.0, 29…\n$ bit_di <dbl> 31.5, 33.5, 33.3, 34.0, 34.0, 31.5, 34.0, 33.2, 32.1, 34.0, 33…\n$ che_de <dbl> 17.7, 16.9, 20.9, 18.4, 21.5, 19.6, 21.9, 21.8, 15.5, 22.5, 20…\n$ che_di <dbl> 28.0, 30.8, 31.7, 28.2, 29.4, 31.3, 31.7, 28.8, 27.5, 28.0, 30…\n$ elb_di <dbl> 13.1, 14.0, 13.9, 13.9, 15.2, 14.0, 16.1, 15.1, 14.1, 15.6, 13…\n$ wri_di <dbl> 10.4, 11.8, 10.9, 11.2, 11.6, 11.5, 12.5, 11.9, 11.2, 12.0, 10…\n$ kne_di <dbl> 18.8, 20.6, 19.7, 20.9, 20.7, 18.8, 20.8, 21.0, 18.9, 21.1, 19…\n$ ank_di <dbl> 14.1, 15.1, 14.1, 15.0, 14.9, 13.9, 15.6, 14.6, 13.2, 15.0, 14…\n$ sho_gi <dbl> 106.2, 110.5, 115.1, 104.5, 107.5, 119.8, 123.5, 120.4, 111.0,…\n$ che_gi <dbl> 89.5, 97.0, 97.5, 97.0, 97.5, 99.9, 106.9, 102.5, 91.0, 93.5, …\n$ wai_gi <dbl> 71.5, 79.0, 83.2, 77.8, 80.0, 82.5, 82.0, 76.8, 68.5, 77.5, 81…\n$ nav_gi <dbl> 74.5, 86.5, 82.9, 78.8, 82.5, 80.1, 84.0, 80.5, 69.0, 81.5, 81…\n$ hip_gi <dbl> 93.5, 94.8, 95.0, 94.0, 98.5, 95.3, 101.0, 98.0, 89.5, 99.8, 9…\n$ thi_gi <dbl> 51.5, 51.5, 57.3, 53.0, 55.4, 57.5, 60.9, 56.0, 50.0, 59.8, 60…\n$ bic_gi <dbl> 32.5, 34.4, 33.4, 31.0, 32.0, 33.0, 42.4, 34.1, 33.0, 36.5, 34…\n$ for_gi <dbl> 26.0, 28.0, 28.8, 26.2, 28.4, 28.0, 32.3, 28.0, 26.0, 29.2, 27…\n$ kne_gi <dbl> 34.5, 36.5, 37.0, 37.0, 37.7, 36.6, 40.1, 39.2, 35.5, 38.3, 38…\n$ cal_gi <dbl> 36.5, 37.5, 37.3, 34.8, 38.6, 36.1, 40.3, 36.7, 35.0, 38.6, 40…\n$ ank_gi <dbl> 23.5, 24.5, 21.9, 23.0, 24.4, 23.5, 23.6, 22.5, 22.0, 22.2, 23…\n$ wri_gi <dbl> 16.5, 17.0, 16.9, 16.6, 18.0, 16.9, 18.8, 18.0, 16.5, 16.9, 16…\n$ age <int> 21, 23, 28, 23, 22, 21, 26, 27, 23, 21, 23, 22, 20, 26, 23, 22…\n$ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62.0, 81.6, 76…\n$ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 184.5, 175.0,…\n$ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…\n$ sex_fct <fct> male, male, male, male, male, male, male, male, male, male, ma…\n\n\nThis will be a lot more helpful!\nNow, back to the scatterplots.\nOne way we learned (in Chapters 3 and 4) to incorporate a third variable into the analysis is through the use of color as an additional aesthetic element. We’ll use our new sex_fct variable. Also, don’t forget to use the Viridis color palette and the black-and-white theme.\n\nggplot(bdims, aes(y = sho_gi, x = che_gi, color = sex_fct)) +\n geom_point() +\n scale_color_viridis_d() +\n theme_bw()\n\n\n\n\n\n\n\n\nIn this example, there is a strong correlation between shoulder girth and chest girth, but females and males lie in completely different parts of the graph. Having said that, if you focus on the females separately, you can still see a strong positive correlation, and if you focus on males separately, there is also a strong positive correlation there. So the inclusion of sex didn’t really change much about the nature of the correlation in this example. Even still, the correlation coefficients do change a little depending on whether we look at the whole data set versus females/males separately:\n\ncor(bdims$sho_gi, bdims$che_gi)\n\n[1] 0.9271923\n\n\n\nbdims |>\n group_by(sex_fct) |>\n summarise(corr = cor(sho_gi, che_gi))\n\n\n \n\n\n\n\nExercise 14\nWhy would the correlation coefficient be stronger for the whole data set and slightly less strong for the sexes separately? (Hint: think about sample size.)\n\nPlease write up your answer here.\n\n\nIn the previous example, sex was a lurking variable, but it did not radically alter the nature of the association. What about the examples in these next two sets of exercises?\n\n\nExercise 15(a)\nCreate a scatterplot of thigh girth against weight (put thi_gi on the y-axis and wgt on the x-axis).\n\n\n# Add code here to create a scatterplot of thigh girth against weight.\n\n\n\n\nExercise 15(b)\nChange the scatterplot above to include sex_fct as a color aesthetic. (Use the Viridis color palette and theme_bw.)\n\n\n# Add code here to add color for sex_fct.\n\n\n\n\nExercise 15(c)\nCalculate the correlation coefficients for thigh girth and weight, once for the whole data set, and again and for the data split by sex_fct (as above).\n\n\n# Add code here to calculate the correlation coefficient\n# between thigh girth and weight.\n\n\n# Add code here to calculate the correlation coefficient\n# between thigh girth and weight split by sex.\n\n\n\n\nExercise 15(d)\nExplain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between thigh girth and weight? What changed about the nature of the correlation within each sex category?\n\nPlease write up your answer here.\n\n\n\nExercise 16(a)\nThe help file for the bia_di variable describes it as the “respondent’s biacromial diameter in centimeters.” What is “biacromial diameter”?\n\nPlease write up your answer here.\n\n\n\nExercise 16(b)\nCreate a scatterplot of biacromial diameter against weight (put bia_di on the y-axis and wgt on the x-axis).\n\n\n# Add code here to create a scatterplot of biacromial diameter against weight.\n\n\n\n\nExercise 16(c)\nChange the scatterplot above to include sex_fct as a color aesthetic. (Use the Viridis color palette and theme_bw.)\n\n\n# Add code here to add color for sex_fct.\n\n\n\n\nExercise 16(d)\nCalculate the correlation coefficients for biacromial diameter and weight, once for the whole data set, and again and for the data split by sex_fct (as above).\n\n\n# Add code here to calculate the correlation coefficient\n# between biacromial diameter and weight\n\n\n# Add code here to calculate the correlation coefficient\n# between biacromial diameter and weight split by sex\n\n\n\n\nExercise 16(e)\nExplain how sex is a lurking variable here. In other words, how did ignoring/considering sex alter the way we perceived the correlation between biacromial diameter and weight? What changed about the nature of the correlation within each sex category?\n\nPlease write up your answer here.\n\n\nThe take-home message here is that lurking variables can change the strength of the correlation between two variables, making it appear stronger or weaker. In more extreme cases, it’s even possible to change the direction of the correlation altogether! There isn’t an example of this phenomenon in the bdims data, but we do find one in the penguins data.\nHere is a scatterplot of bill depth against bill length.\n\nggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm)) +\n geom_point()\n\nWarning: Removed 2 rows containing missing values or values outside the scale range\n(`geom_point()`).\n\n\n\n\n\n\n\n\n\nThere is not much correlation between bill depth and bill length, but if anything, it looks like there might be a slightly negative association. (In the following code chunk, the cor command uses a different method for dealing with missing data.)\n\ncor(penguins$bill_depth_mm, penguins$bill_length_mm,\n use = \"complete.obs\")\n\n[1] -0.2350529\n\n\nNow split by species:\n\nggplot(penguins, aes(y = bill_depth_mm, x = bill_length_mm,\n color = species)) +\n geom_point() +\n scale_color_viridis_d() +\n theme_bw()\n\nWarning: Removed 2 rows containing missing values or values outside the scale range\n(`geom_point()`).\n\n\n\n\n\n\n\n\n\n\npenguins |>\n group_by(species) |>\n summarise(corr = cor(bill_depth_mm, bill_length_mm,\n use = \"complete.obs\"))\n\n\n \n\n\n\nThere was a very weak negative correlation in the full data set, but, behold, bill depth and bill length are positive correlated within each species!\nThe phenomenon of an association between two variables “reversing” direction when considering a third variable is often called “Simpson’s Paradox”.2 We’ll revisit Simpson’s Paradox in a future chapter.", "crumbs": [ "6  Correlation" ]