-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02_getting_started_with_r.qmd
903 lines (689 loc) · 33.4 KB
/
02_getting_started_with_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
# Getting Started with R
## The RStudio interface
The first screen we see in **R** should look like this:
[![The first screen we see when we open **R**Studio](./images/rstudio_first_screen.jpg)](./images/rstudio_first_screen.jpg)
Here, there are three main window panes. The "Console" window is where we type
the code to tell the computer to do stuff.
The "Environment-History-Connections" window has three tabs. "Environment"
shows the objects that we save during our session (I will explain this
soon); "History" shows a record of all the code we ask the computer to run; and
"Connections" (which we will not use here) shows us the connections we have to
remote databases.
The "Files-Plots-Packages" window has several tabs. "Files" shows us all the
files in our current working directory, which is the default location where
**R** will look for files we want to load and where it will put any files we
save. "Plots" will display the plots we make in **R**. "Packages" shows us the
packages installed in our computer and whether they are loaded in our current
session. "Help" allows us to search and read the documentation for packages and
functions. "Viewer" can display content that usually belongs to a web browser
(i.e., html files).
We can tell **R** to do stuff by typing *commands* in the line that begins with
`>` in the console window. This line is called "command prompt". Let's begin
with something simple:
```{r simplest operation}
1 + 1
```
The `[1]` that appears next to the result informs us that the line begins with
the first value in the result. This information is helpful when we run commands
that produce multiple values, like in this case:
```{r quick sequence of numbers}
80:110
```
If we type an incomplete command, **R** will assume that we are still writing
and will show a `+`. This `+` is *not* an arithmetical operator; it is a prompt
for us to continue writing in the next line. The prompt will not go away unless
we complete the command or we press the "Escape" key. For example:
```{r}
#| eval: false
> 42 -
+
+
+ 1
[1] 41
```
If **R** does not understand a command, it will display an error message.
```{r}
#| eval: false
> 7 % 2
Error: unexpected input in "7 % 2"
>
```
In this case, what I meant was
```{r}
7 %% 2
```
## Using R as a calculator
**R** can perform basic arithmetic operations. Type the
following expressions at the command prompt (the line that begins with `>`) in
the **R** console window.
```{r simple arithmetic}
5 + 3
5 - 3
5*3
5/3
```
**R** can also do exponentiation with `^`^[`**` also works for exponentiation,
but it's better to stick to `^`.]; modulus, also known as remainder from
division, with `%%`; and integer division with `%/%`.
If we try to do shoddy math, **R** will inform us (but not necessarily with an
error message):
```{r shoddy math}
#| error: true
0/0
-1/0
```
**R** also has logical operators that will return "TRUE" or
"FALSE".
```{r logical operators}
5 == 6
5 != 6
5 < 6
```
The logical operators "and" (`&`) and "or" (`|`) allow us to combine multiple logical statements:
```{r conjunction and disjunction}
(5 < 6) & (7 == 8)
(5 < 6) | (7 == 8)
```
::: {.callout-tip}
## Practice
Think of an integer, double it, add six, divide it in half, subtract the number
you started with, and then square it. If your code is correct, the answer should
be nine.
:::
**Be careful of comparisons and floating point arithmetic**:
```{r floating point arithmetic}
(.1 + .2) == .3
(.1 + .2) - .3
```
<!--
The `all.equal` function will test if two values are "close enough":
`> all.equal(5, sqrt(5)^2)`
`> ?all.equal`
-->
This is not **R**'s fault. All computers have limited space to
store numbers and decimal positions. So, computers often use numbers with
"hidden" decimals even when it looks like they use round numbers. For example,
what looks like a `3` may actually be `3.000005`. This is common enough to
justify the **R** [FAQ (7.31)](https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f).
Being able to do these computations is nice, but what happens if we want to use
a result later without redoing the computation? **R** can save numbers (and
more) using something called "objects".
## Objects
In simple terms, an object is a named piece of information that we can access
whenever we want. A single number, a word, a plot, and a set of instructions can
all be objects if we name them. To create an object, we must first choose a
name, then type an arrow `<-` (a "lower than" symbol `<` followed by a "minus"
sign `-`), and then write the value that we want to save.
```{r create first object}
my_object <- 42
```
Another way of creating the object is to use the "equal to" sign `=` instead of
the arrow.
```{r create first object method 2}
my_object = 42
```
In this context, both `<-` and `=` are acting as "assignment operators" because
they assign the value `42` to the name `my_object`. In **R**, `=` has several
uses, so if you want to avoid confusion, I suggest you use `<-`. But you can
pick whichever symbol you like the most as long as you are consistent.
Now the environment window shows that there is an object called `my_object`,
which we can access at anytime by typing its name.
[![The environment pane with an object](./images/environment_with_object.jpg)](./images/environment_with_object.jpg)
```{r call my_object}
my_object
```
`my_object` has a number as its value, so we can do mathematical operations with
it.
```{r}
my_object + 7
my_object^my_object
```
We can also create objects based on other objects.
```{r}
a <- 10
b <- a
a
b
```
Be careful: if we try to reuse a name, **R** will overwrite the previous object
without notifying us in any way.
```{r overwriting object}
c <- 2
c
c <- 711
c
```
However, if an object is the copy of another object, overwriting the original
does not affect the copy; and overwriting the copy does not affect the
original.
```{r}
a <- 10
b <- a
b
a <- 5
b
```
We can name our objects in almost any way we want. The only rules are:
+ Names *can* be a combination of letters, digits, periods `.` and underscores
`_`.
+ Names *cannot* include white spaces.
+ If a name starts with a period `.`, it *cannot* be followed by a digit.
+ Names *cannot* start with a number or an underscore `_`.
+ Names are case sensitive (for example, `age`, `Age` and `AGE` are three
different objects).
+ Reserved words (like `TRUE`, `FALSE`, `NULL`, `if`, ...) *cannot* be used as
names.
Also, here are a few suggestions that can save you many hours of frustration:
+ Avoid giving your object the same name as a built-in function.
+ If you need to create objects with multiple words in their name, separate them
with an underscore (`my_object`) or a dot (`my.object`), or capitalize the
different words (`MyObject`). Choose whichever format you like most and be
consistent with it.
+ Use informative names. It is quick and easy to use names like `x` or `value1`.
But your code will be easier and faster to understand if your objects have names
that illustrate their purposes. Your colleagues and your future self will really
appreciate it.
## Using functions
**R** has many built-in functions to do anything from basic math to text
manipulation to advanced statistical models. **R** thinks of functions as
objects that store commands that we can apply to an input. For example, we can
input a number to round it, or to calculate its square root, its factorial, or
its natural logarithm. To use a function, we must write the name of a function
followed by the input we want it to use inside a pair of round brackets.
```{r simple mathematical functions}
#| results: hide
round(2.1415)
sqrt(9)
factorial(7)
log(100)
```
The pieces of information inside the round brackets are called "arguments". Some
functions, like `sqrt()` and `factorial()`, only use one argument (the number
that we want to use). But other functions, like `round()` and `log()`, can take
more arguments (separated by commas) to modify their behavior. For example,
`round()` accepts a second argument that corresponds to the number of decimal
places that we want to keep.
```{r rounding to different decimal places}
round(2.1415, 0)
round(2.1415, 2)
```
The order of these arguments is important (what happens if we try to run
`round(2, 2.1415)`?) Fortunately, arguments have names that we can use to
specify which data to use in each case. So, we can write
```{r naming arguments}
round(x = 2.1415, digits = 3)
```
Even better, we can pass arguments in any order as long as we name them all.
```{r}
round(digits = 3, x = 2.1415)
```
Note that using the wrong name in a function will likely yield an error message.
```{r}
#| error: true
round(x = 2.1415, basket = 3)
```
A quick way to check what arguments we can use is to invoke `args()`, which
takes the name of a function as its argument:
```{r}
args(round)
```
The output shows that `round()` has two arguments: `x`, which is the number we
want to round; and `digits`, which determines how many decimal places to keep.
The output also shows that `digits` is already set to zero. This means that
`round` will use 0 as the default value for `digits`, i.e., it will not keep any
decimal places. We can omit arguments with default values and **R** will
automatically fill them for us in the background.
I encourage you to name all the arguments in the function, or at least the ones
after the first argument. It is hard to remember all the arguments that can go
into more complicated functions. And it is even harder to remember them in the
correct order. Accidentally switching unnamed arguments can produce wrong
results without us knowing. Naming the arguments will prevent errors and clarify
the function's behavior.
We can also run functions inside other functions. This works because **R** first
runs the innermost function and then uses the result in the outermost function.
```{r nested function}
sqrt(round(16.483, digits = 0))
```
The functions we have used are simple and intuitive to use. But sometimes we
need to know more about a function before we can use it. Where can we find
explanations to help us?
## Getting help
The `help()` function allows us to access **R**'s built-in help information on
any function. For example, to open the help page for `round()`, we do
```{r help function}
#| eval: false
help(round)
```
A shorter way of writing this is to use `?` before the name of the function.
```{r help shortcut}
#| eval: false
?round
```
After you run the code, the help page is displayed in the "Help" tab in the
"Files-Plots-Packages" pane (usually in the bottom right of **R**Studio).
As a novice user, help pages may seem arcane---perhaps because they aim for
shortness and use jargon. But this short jargon makes (most of) the explanations
precise, so we can use the information we need without having to read the entire
document. Also, all help pages are organized similarly, so we don't have to
relearn how to navigate them. With a bit of practice, you will be able to find
exactly what you need in mere seconds.
The first line of the help document displays the name of the function and the
package that contains the function. Other sections are:
+ **Description**: summarizes the function.
+ **Usage**: names the arguments associated with the function and possible
default values.
+ **Arguments**: describes the purpose and format of each input.
+ **Details**: explicates what the function does and how it does it.
+ **Value**: if applicable, describes the type and structure of the object that
the function returns.
+ **See Also**: suggests other help pages with similar or related content.
+ **Examples**: shows code to illustrate how to use the function. We can copy
and paste examples in the console, and we can access them at any time by using
the `example()` function (e.g., `example("round")`).
`help()` is useful if we know the full name of the function. But if all we remember is a key word in the name, we can search through **R**'s help system using `help.search()`
```{r}
#| eval: false
help.search("round")
```
Note that in this case we have to use quotes `" "` around the name of the function. We can also use the shortcut `??`
```{r}
#| eval: false
??round
```
As before, the 'Help' tab in **R**Studio will display the results of the search.
`help.search()` looks for the pattern in the help documentation, code
demonstrations, and package vignettes and displays the results as clickable
links that we can follow.
Another useful function is `apropos()`, which lists all functions containing a
specified character string. Note that `apropos()` returns names even when the
word we used is only part of a larger word, like in "surround". For example, to
find all functions with `round` in their name, we use^[This output should show 6
names instead of 3. I can not fix the output in the notes, but the code works
fine inside **R**Studio.]
```{r}
apropos("round")
```
Another useful function is `RSiteSearch()`, which allows us to search for
keywords and phrases in function help pages and vignettes for *all* CRAN
packages, and in CRAN task views. This way we can access the
[online search engine](https://www.r-project.org/search.html)
directly from the Console and display the results in our web browser.
```{r}
#| eval: false
RSiteSearch("regression")
```
:::callout-tip
## How to get started with help pages?
Start by reading the help pages of functions that you already understand. This
will teach you how to understand the structure of the pages and will familiarize
you with the jargon. As you use **R** you will likely need other, more
complicated functions, so reading more help pages will happen almost naturally.
Keep in mind that help pages explain the code, not the underlying concepts. If
you don't know what it means to round a number, reading the documentation for
`round()` will not help you.
:::
<!--
Now that we know how to get help, we can move on to more advanced stuff.
-->
Until now, the console satisfied all our coding needs. But if we wanted to reuse
a command from before, we would have to either rewrite it from scratch, or look
tediously through our command history in the "Environment-History-Connections"
window. Besides, once we close this session in **R**Studio, all of our work will
be gone, like tears in the rain. It would be much better if we wrote our code in
a place where we can easily review and modify it.
## Working with scripts
A script is a plain text file where we write our code. With a script we can
edit, proofread, and reuse code; we can easily write multi-line code (which is
cumbersome to do in the console); and we can save our code so we can come back
to it later or share it with others. I think the best way to code in **R** is to
use a script, and I strongly suggest you always use one.
To open a script, click on `File > New File > R script` in the menu bar on the upper left-hand side of the screen.
[![A screen with a new script](./images/new_script_screen.jpg)](./images/new_script_screen.jpg)
We can run a line of code in the script using the `Run` button in the upper
right-hand side of the script window (see picture below). **R** will run
whichever line of code your cursor is on. We can also run all of the code in the
script by clicking the `Source` button that's next to `Run`. I find that
clicking buttons is slow, so I use a keyboard shortcut for running a line of
code: `Control + Enter` in Windows and Linux; `Command + Enter` in Mac.
[![Buttons for running code](./images/run_code_buttons_edited.jpg)](./images/run_code_buttons_edited.jpg)
To save your script, you can click on the blue square^[Young reader, this square
is a [floppy disk](https://en.wikipedia.org/wiki/Floppy_disk); not-so-young
reader, I feel your pain.] right below the tab with the script name. The first
time you save your script, **R** will ask you to choose a name and a location
to save your file in. I strongly suggest that you pick a name and a location
that you will remember easily in the future.
Moving forward, I will write all my code in a script and will assume you are
doing the same. Trust me, it's for your own benefit.
Now that we know how to preserve our hard-gained code, we can do more
elaborate work. Previously we worked with one number at the time. But we can
also work with groups of numbers (and of other stuff) by using something called
"vectors".
## Working with vectors
In **R**, an ordered group of numbers is called a vector. To create a vector of
numbers, we need to use the function `c()` (short for "combine"). The arguments
of `c()` are the numbers you want to use in the vector, in the order you want to
use them.
```{r first vector}
my_vec <- c(5, 3, 7, 1, 1, 8) # This is a vector of numbers
my_vec
```
::: {.callout-note}
## Comments
Did you see the text I added next to the code? **R** will ignore everything in a
line that comes after a hashtag `#`. This `#` is known as *commenting symbol*
because it allows us to comment our code. Comments can explain confusing chunks
of code, or warn about potential problems.
:::
Vectors can also contain other types of data, like words:
```{r vector of names}
vector_of_words <- c("monday", "lemon")
vector_of_words
```
Note that we must enclose words in quotation marks to let **R** know that we
want to use "monday" and "lemon" as values and not as names for objects. Look at
what happens if we forget the quotation marks:
```{r}
#| error: true
vector_of_words <- c(monday, lemon)
```
Later on we will work more with vectors of words, but for now let's focus on
vectors of numbers.
### Named vectors
We can name the elements of a vector if we add a name followed by an equal size
before the value that we want to save:
```{r named vector}
my_named_vec <- c(tigers = 2, lions = 5)
my_named_vec
```
Named vectors help us store more information in the same object. They also give
us another way of accessing the elements inside the vector.
### Operations with numerical vectors
**R** is a "vectorized" language, which means that it can often operate on an
entire vector of numbers as easily as on a single number. All the logical and
mathematical functions we used before work with vectors:
```{r simple math with vectors}
my_vec / 11
my_vec <= 7
log(my_vec)
my_vec*my_vec
```
In the last example, **R** did not obey the rules of linear algebra to multiply
two vectors. Instead, **R** used "element-wise execution", which means that
**R** applied the same operation to each element of the vector independently of
the other elements.
To decide how to apply element-wise execution, **R** considers the *length* of
the vectors, which refers to the number of elements inside them. When we use two
vectors with the same length, **R** will line up the vectors and perform a
sequence of individual operations. For instance, in `my_vec*my_vec`, **R**
multiplies the first element of vector 1 by the first element of vector 2, then
the second element of vector 1 by the second element of vector 2, and so on,
until all elements are multiplied. The result will be a new vector with the same
length as the first two.
When we use two vectors with *different* lengths, **R** will repeat the shorter
vector until it has as many elements as the longer vector, and then it will do
the same as before. For example:
```{r recycling vectors 1}
my_vec * c(1, 2)
```
If the length of the short vector does not divide evenly into the length of the
long vector, **R** will do an incomplete repeat of the shorter vector and return
a warning.
```{r recycling vectors 2}
my_vec * c(1, 2, 3, 4)
```
Repeating the numbers of the vector is known as "vector recycling", and it helps **R** do element-wise operations.
Element-wise execution allows us to apply the same instruction to all elements
in an object rather than one element at a time. When working with a data set,
element-wise execution will ensure that values from one observation or case are
only paired with values from the same observation or case. Element-wise
execution also facilitates writing our functions in **R**.
**R** can do vector and matrix multiplications, but we have to ask for them
explicitly. For example, to get the inner product, we need the operator `%*%`.
And to get the outer product, we need `%o%`. Don't worry if you are not familiar
with matrix operations; you won't need them in these notes.
### Extracting elements
We can access specific elements of vectors using the square bracket `[ ]`
notation. To use it, we first write the name of the vector we want to extract
from, followed by the square brackets with an index of the element we wish to
extract. This index can be a position or a logical statement. To extract
elements based on their position we simply write the position number inside the
`[ ]`. For example, to extract the 3rd value of `my_vec`, we use
```{r extract value from my_vec}
my_vec[3]
```
We can store this value in another object.
```{r}
value_3 = my_vec[3]
```
We can also extract multiple elements at the time by using a vector of indices
inside the square brackets:
```{r}
my_vec[c(1, 3, 5)]
```
Or we can use `:` notation. Remember that `:` helps us create a sequence of
values. For example,
```{r}
3:6
```
So,
```{r}
my_vec[3:6]
```
::: {.callout-note}
## Note
In **R**, the positional index starts at 1, so to call the first element of a
vector we need to use `[1]`. In many other programming languages (like Python
and C++), the positional index starts at 0.
:::
If the elements of the vector have names, we can extract a value using its name
(surrounded by quotes) instead of a positional index.
```{r extracting named element}
my_named_vec
my_named_vec["lions"]
```
Another convenient way to extract elements from a vector is to use a logical
expression as an index. For example, to extract all elements greater than 4 from
`my_vec`, we do
```{r}
my_vec[my_vec > 4]
```
This works because **R** uses element-wise execution even for logical
statements. So, `my_vec > 4` asks if each item of `my_vec` meets the condition
"greater than four" and returns the corresponding vector of `TRUE` and `FALSE`.
Then, when we add this result to the square brackets, **R** examines each
element of `my_vec` asking "should I extract this element?". If the answer is
`TRUE`, the value is extracted; if it's `FALSE`, the value is ignored. Under the
hood, using `my_vec > 4` is equivalent to
```{r}
my_vec[c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE)]
```
Two useful functions to extract values of a vector are `head()`, which extracts
the first few values of a vector; and `tail()`, which gives us the last few
values of a vector. Try them yourself!
### Replacing elements
We can replace the elements of a vector by combining the square bracket notation with the assignment operator. For example, to replace the second element of `my_vec`, we do
```{r replace element}
my_vec[2] <- 99
my_vec
```
To replace multiple elements with the same value, say, elements 5 and 6, we do
```{r}
my_vec[c(5, 6)] <- 55
my_vec
```
**R** can also do element-wise replacement:
```{r}
my_vec[c(5, 6)] <- c(100, 200)
my_vec
```
What happens if you try to replace two values with a vector that has three (or more) values?
Logical expressions help us replace values that meet specific conditions without having to find them ourselves.
```{r}
my_vec[my_vec > 44] <- -1
my_vec
```
### Reordering elements of vectors
To sort the elements of a vector from lowest to highest, we can use `sort()`
```{r}
sorted_vec <- sort(my_vec)
sorted_vec
```
If we want to sort from highest to lowest, we need to set the optional argument `decreasing` to `TRUE`.
```{r}
sorted_vec_decreasing <- sort(my_vec, decreasing = TRUE)
sorted_vec_decreasing
```
Another option is to first use `sort()` and then reverse the sorted vector using `rev()`.
```{r}
sorted_vec_decreasing <- rev(sort(my_vec))
sorted_vec_decreasing
```
A more useful feature of vectors is that we can reorder their elements based on the values of other vectors. To show this, let's first create a vector of city names and a vector with (my guess of) their typical daily temps in degrees Fahrenheit.
```{r}
cities <- c("Tokyo", "Cairo", "Mexico City", "Helsinki")
temps_fahr <- c(50, 90, 65, -10)
```
Now imagine that we want to order the vector of cities, going from coldest to hottest. The first step is to use `order()` to create a new variable called "temps_ordered".
```{r}
temps_ordered <- order(temps_fahr)
temps_ordered
```
This output says that the lowest value in `temps_fahr` is in the fourth position, the second lowest value is in the first position, and so on. We can think of `temps_ordered` as a vector of positional indices of temps in ascending order. Now we can use these indices to reorder the vector of cities.
```{r}
cities_ordered <- cities[temps_ordered]
cities_ordered
```
Ta-da!
These vector manipulations can do more than dazzle your friends. Imagine that we
have a data set with two columns of data and that we want to sort it based on
the values of the first column. If we use `sort()` on each column separately,
the values of each column will become uncoupled from each other. Instead, we can
use `order()` on one column to make a vector of positional indices. Then we can
use these indices on both columns to keep the values of each coupled in the
original way.
So far we have relied on **R**'s built-in capabilities to do everything we need. But sometimes we need to do something for which there isn't a built-in function. In these cases, we can get inventive and write a function ourselves.
## Writing our own functions
Functions, as you may remember, are objects that store commands. The three basic parts of a function are name, code to implement, and arguments. To assemble these parts, we can use the `function()` function (yes, really) followed by a pair of curly brackets `{}`:
```{r}
my_function <- function() {}
```
`function()` will run the code that we write inside the curly brackets. This code is called the body of the function. Let's try something simple, like adding 1 + 1:
```{r make simple_function}
simple_function <- function() {
1 + 1 # This function is very simple
}
```
:::callout-tip
## Indents make code more readable
Indenting the body of the function helps the reader notice that the code is only supposed to run inside the function. Indentation doesn't affect our functions, but it is helpful and predominant among **R** coders.
:::
To run our function, we have to write its name followed by round brackets, just like with any other function:
```{r run simple function}
simple_function()
```
Remember to write the round brackets even if they are empty. The round brackets make **R** *run* the code inside the function. If we don't write these brackets, **R** will *show us* the code inside the function (try it!).
Let's write a function to convert Fahrenheit degrees to Celsius. We want to use this function with different temperatures, so we need to include an argument that will tell **R** which temperature to convert each time.
```{r fahr to celsius without function}
fahr_to_cels <- function(temperature) {
(temperature - 32) / 1.8
}
fahr_to_cels(27)
```
Now let's try something a bit more complicated: solving a quadratic equation. If we have an equation of the form $ax^2 + bx + c = 0$, then the solutions are given by $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. We can write a function that will apply this formula for us.
```{r solve_quadratic incomplete}
solve_quadratic = function(a, b, c) {
# Quadratic equations can have two solutions
solution_1 = (-b + sqrt(b**2 - 4*a*c)) / 2*a # First solution
solution_2 = (-b - sqrt(b**2 - 4*a*c)) / 2*a # Second solution
}
```
Now all we need is to identify values of $a, b,$ and $c$ to pass as arguments.
```{r}
solve_quadratic(a = 1, b = -1, c = -3)
```
Why didn't our function show a result? When we run a function, **R** runs the
code in the body and returns the result of the last line of code. In
`solve_quadratic()`, the last line saves the second solution but doesn't show
it. We need more code to ensure that the last line displays both solutions.
Since we are using a script (right?), it is easy add one more line to our
function:
```{r solve_quadratic}
solve_quadratic = function(a, b, c) {
# Quadratic equations can have two solutions
solution_1 = (-b + sqrt(b**2 - 4*a*c)) / 2*a # First solution
solution_2 = (-b - sqrt(b**2 - 4*a*c)) / 2*a # Second solution
c(solution_1, solution_2) # Show vector of solutions
}
solve_quadratic(a = 1, b = -1, c = -3)
```
A more explicit way of ensuring our function will display its result is to use the `return()` statement inside the function:
```{r solve_quadratic with return()}
solve_quadratic = function(a, b, c) {
# Quadratic equations can have two solutions
solution_1 = (-b + sqrt(b**2 - 4*a*c)) / 2*a # First solution
solution_2 = (-b - sqrt(b**2 - 4*a*c)) / 2*a # Second solution
return(c(solution_1, solution_2)) # Return vector of solutions
}
solve_quadratic(a = 1, b = -1, c = -3)
```
`return()` can appear anywhere in the body and the function output will still be
whatever `return()` contains. Note that running `return(solution_1, solution_2)`
will produce an error because the result of a function must always be
a single object. This is why we combined both solutions from `solve_quadratic()`
into a single vector.
Our function can have as many arguments as we like. It is enough to add their
names, separated by commas, in the parentheses that follow the function. When
the function runs, **R** will replace each argument name in the function body
with the corresponding value that we supply. If we don't supply a value, **R**
will replace the argument name with the argument's default value (if we defined
one).
We can also use our functions to create other functions. Imagine that we want to multiply the solutions to our equation by an arbitrary value (maybe because we want to convert the units of $x$ to something else). And let's pretend that, by default, we want to double the solutions. We can write another function that first calls `solve_quadratic()` and then multiplies the result by our arbitrary number.
```{r}
multiply_solutions = function(a, b, c, multiplier = 2) {
solve_quadratic(a, b, c) * multiplier
}
multiply_solutions(a = 1, b = -1, c = -3)
multiply_solutions(a = 1, b = -1, c = -3, multiplier = 10)
```
::: {.callout-note}
## Objects created in functions disappear
All of the objects that we create inside a function will disappear after it finishes running. Only the output will remain, and to save it we need to assign it to an object.
:::
Being able to write our own functions is great, but we don't need to reinvent the wheel every time we need to do something that is not available in **R**'s default version. We can easily download packages from CRAN's online repositories to get many useful functions.
## Acquiring external packages
To install a package from CRAN, we can use the `install.packages()` function. For example, if we wanted to install the package `readxl` (for loading .xslx files), we would need:
```{r installing readxl}
#| eval: false
install.packages("readxl", dependencies = TRUE)
```
The argument `dependencies` tells **R** whether it should also download other packages that `readxl` needs to work.
**R** may ask you to select a CRAN mirror which, simply put, refers to the location of the servers you want to download from. Choose a mirror close to where you are.
After installing a package, we need to load it into **R** before we can use its functions. To load the package `readxl`, we need to use the function `library()`, which will also load any other packages required to load `readxl` and may print additional information.
```{r loading readxl}
#| eval: false
library("readxl")
```
Whenever we start a new **R** session we need to load the packages we need. If we try to run a function without loading its package first, **R** will not be able to find it and will return an error message.
Writing all our `library()` statements at the top of our **R** scripts is almost always good because it helps us know that we need to load the libraries at the start our sessions. It also helps others know quickly that they will need to have those libraries installed to be able to use our code.
A library can contain many objects, but sometimes we only need one or two of its functions. To avoid loading the entire library, we can access the specific function directly by specifying the package name followed by two colons and then the function name. For example:
```{r using specific function from library}
#| eval: false
readxl::read_xlsx("fake_data_file.xlsx")
```
## Exercise
Let's try to practice all of the basic features of **R** that you just learned. Write a function that can simulate the roll of a pair of six-sided dice (let's call them red and blue) an arbitrary number of times. This function should return a vector with the values of the red die that were strictly larger than the corresponding values of the blue die. Hint: to simulate rolling a die, you can use the function `sample()`.
::: {.callout-tip collapse="true"}
## Exercise step by step
+ Step 1: define a function that takes one argument, `num_rolls`, representing
the number of times to roll the dice.
+ Step 2: create two objects called `red` and `blue` to store the results from
the dice rolls.
+ Step 3: simulate the dice rolls using function `sample()` (read its help page
if you need to).
+ Step 4: create a vector of indices that identifies the values in the red die
that were larger than the values in the blue die.
+ Step 5: use this vector of indices to extract the values from the red die.
+ Step 6: make sure that your function returns the values you extracted in step 5.
:::
## References
Most of this section is based on ["Hands-On Programming with R"](https://rstudio-education.github.io/hopr/), by Garret Grolemund; and on ["An Introduction to R"](https://intro2r.com/), by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.