From 78dceab9ef7443c19624ca5b690e22780e6603b9 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Mon, 12 Aug 2024 11:10:05 -0400 Subject: [PATCH] Deployed 240510a with MkDocs version: 1.6.0 --- Workshop_Schedule/index.html | 28 +++++++++--------- search/search_index.json | 2 +- sitemap.xml | 54 +++++++++++++++++------------------ sitemap.xml.gz | Bin 624 -> 623 bytes stylesheets/extra.css | 16 +++++------ 5 files changed, 50 insertions(+), 50 deletions(-) diff --git a/Workshop_Schedule/index.html b/Workshop_Schedule/index.html index d645707..e646854 100644 --- a/Workshop_Schedule/index.html +++ b/Workshop_Schedule/index.html @@ -434,19 +434,19 @@

Day 11. Workshop Introduction +Workshop Introduction Welcome and housekeeping Will 10:00-10:30 -2. Intro to R and RStudio +Intro to R and RStudio Introduction to R and RStudio Noor 10:30-11:45 -3. Self learning materials +Self learning materials Overview of self-learning materials Will 11:45-12:00 @@ -554,19 +554,19 @@

Day 24. Review self-learning +Review self-learning Questions about self-learning All 10:00-10:50 -5. In-class exercises +In-class exercises Use and customize function and arguments Noor 10:50-11:15 -6. Data Wrangling +Data Wrangling Subsetting Vectors and Factors Will 11:15-12:00 @@ -750,19 +750,19 @@

Day 37. Review self-learning +Review self-learning Questions about self-learning All 10:00-10:35 -8. In-class exercises +In-class exercises Customizing functions and arguments Will 10:50-11:15 -9. Plotting with ggplot2 +Plotting with ggplot2 ggplot2 for data visualization Noor 11:15-12:00 @@ -904,25 +904,25 @@

Day 410. Review self-learning +Review self-learning Questions about self-learning All 10:00-10:35 -11. In-class exercises +In-class exercises In class exercises Will 10:50-11:15 -12. Discussion +Discussion Q&A Noor 11:15 - 11:45 -13. Wrap Up +Wrap Up Wrap up and checking out Noor 11:45 - 12:00 @@ -1016,7 +1016,7 @@

Additional resources - 2024-08-09 + 2024-08-12 diff --git a/search/search_index.json b/search/search_index.json index bc98425..b901194 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"Introduction to R Audience Computational skills required Duration Biologists None 4-session online workshop (~ 8 hours of trainer-led time)"},{"location":"#description","title":"Description","text":"

This repository has teaching materials for a hands-on Introduction to R workshop taught online. The workshop will introduce participants to the basics of R and RStudio. R is a simple programming environment that enables the effective handling of data, while providing excellent graphical support. RStudio is a tool that provides a user-friendly environment for working with R. These materials are intended to provide both basic R programming knowledge and its application for increasing efficiency for data analysis.

Note for Trainers

The schedule linked below assumes that learners will spend between 2-3 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion.

"},{"location":"#learning-objectives","title":"Learning Objectives","text":"
  1. R syntax:

    Familiarize the basic syntax and the use of Rstudio.

  2. Data types and data structures:

    Describe frequently-used data types and data structures in R.

  3. Data inspection and wrangling:

    Demonstrate the utilization of functions and indices to inspect and subset data from various data structures.

  4. Data visualization:

    Apply the ggplot2 package to create plots for data visualization.

"},{"location":"#setup-requirements","title":"Setup Requirements","text":"

Download the most recent version of R and RStudio for the appropriate OS following the links below.

R software download

RStudio download

All the data files used in the lessons are linked within, but can also be accessed through the link below.

Dataset download

"},{"location":"#lessons","title":"Lessons","text":"
  • Trainer led workshop Click here

  • Self learning materials Click here

Attribution & Citation

  • These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived from work that is Copyright \u00a9 Data Carpentry. All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0)

  • To cite material from this course in your publications, please use:

    Meeta Mistry, Mary Piper, Jihe Liu, & Radhika Khetani. (2021, May 5). hbctraining/Intro-to-R-flipped: R workshop first release. Zenodo. https://doi.org/10.5281/zenodo.4739342

  • A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

"},{"location":"Workshop_Schedule/","title":"Workshop Schedule","text":"Workshop Schedule"},{"location":"Workshop_Schedule/#day-1","title":"Day 1","text":"Lesson Overview Instructor Time 1. Workshop Introduction Welcome and housekeeping Will 10:00-10:30 2. Intro to R and RStudio Introduction to R and RStudio Noor 10:30-11:45 3. Self learning materials Overview of self-learning materials Will 11:45-12:00"},{"location":"Workshop_Schedule/#before-the-next-class","title":"Before the next class","text":"

A. Please study the contents and work through all the code within the following lessons.

B. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.

  • Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. R Syntax and Data Structure

    About data types and data structure

    In order to utilize R effectively, you will need to understand what types of data you can use in R and also how you can store data in \"objects\" or \"variables\".

    This lesson will cover:

    • Assigning a value to a object

    • What types of information can you store in R

    • What are the different objects that you can use to store data in R

  • 2. Functions and Arguments

    Functions and Arguments in R

    Functions are the basic \"commands\" used in R to get something done. To use functions (denoted by function_name followed by \"()\"), one has to enter some information within the parenthesis and optionally some arguments to change the default behavior of a function.

    You can also create your own functions! When you want to perform a task or a series of tasks more than once, creating a custom function is the best way to go.

    In this lesson you will explore:

    • Using built-in functions

    • Creating your own custom functions

  • 3. Reading in and inspecting data

    Read and inspect data structures in R

    When using R, it is almost a certainty that you will have to bring data into the R environment.

    In this lesson you will learn:

    • Reading different types (formats) of data

    • Inspecting the contents and structure of the dataset once you have read it in

  • Submit here:

    Submit a day before the next class.

"},{"location":"Workshop_Schedule/#day-2","title":"Day 2","text":"Lesson Overview Instructor Time 4. Review self-learning Questions about self-learning All 10:00-10:50 5. In-class exercises Use and customize function and arguments Noor 10:50-11:15 6. Data Wrangling Subsetting Vectors and Factors Will 11:15-12:00"},{"location":"Workshop_Schedule/#before-the-next-class_1","title":"Before the next class","text":"

A. Please study the contents and work through all the code within the following lessons.

B. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.

  • Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. Packages and libraries

    Installing and loading packages in R

    Base R is incredibly powerful, but it cannot do everything. R has been built to encourage community involvement in expanding functionality. Thousands of supplemental add-ons, also called \"packages\" have been contributed by the community. Each package comprises of several functions that enable users to perform their desired analysis.

    This lesson will cover:

    • Descriptions of package repositories

    • Installing a package

    • Loading a package

    • Accessing the documention for your installed packages and getting help

  • 2. Data wrangling: data frames, matrics and lists

    Subset, merge, and create new datasets

    In class we covered data wrangling (extracting/subsetting) information from single-dimensional objects (vectors, factors). The next step is to learn how to wrangle data in two-dimensional objects.

    This lesson will cover:

    • Examining and extracting values from two-dimensional data structures using indices, row names, or column names

    • Retreiving information from lists

  • 3. The %in% operator

    %in% operator, any and all functions

    Very often you will have to compare two vectors to figure out if, and which, values are common between them. The %in% operator can be used for this purpose.

    This lesson will cover:

    • Implementing the %in% operator to evaluate two vectors

    • Distinguishing %in% from == and other logical operators

    • Using any() and all() functions

  • 4. Reordering and matching

    Ordering of vectors and data frames

    Sometimes you will want to rearrange values within a vector (row names or column names). The match() function can be very powerful for this task.

    This lesson will cover:

    • Maunually rearranging values within a vector

    • Implementing the match() function to automatically rearrange the values within a vector

  • 5. Data frame for plotting

    Learn about map() function for iterative tasks

    We will be starting with visualization in the next class. To set up for this, you need to create a new metadata data frame with information from the counts data frame. You will need to use a function over every column within the counts data frame iteratively. You could do that manually, but it is error-prone; the map() family of functions makes this more efficient.

    This lesson will cover:

    • Utilizing map_dbl() to take the average of every column in a data frame

    • Briefly discuss other functions within the map() family of functions

    • Create a new data frame for plotting

  • Submit here

    Submit a day before the next class.

Prepare for in-class exercise:

  • Download the data and place the file into the data directory.
Data Download link Animal data Right click & Save link as...
  • Read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.

  • Save the R project when you close Rstudio.

"},{"location":"Workshop_Schedule/#day-3","title":"Day 3","text":"Lesson Overview Instructor Time 7. Review self-learning Questions about self-learning All 10:00-10:35 8. In-class exercises Customizing functions and arguments Will 10:50-11:15 9. Plotting with ggplot2 ggplot2 for data visualization Noor 11:15-12:00"},{"location":"Workshop_Schedule/#before-the-next-class_2","title":"Before the next class","text":"
  1. Please study the contents and work through all the code within the following lessons.

  2. Complete the exercises:

  3. Each lesson above contains exercises; please go through each of them.

  4. Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. Custom functions for plots

    Consistent formats for plotting

    When creating your plots in ggplot2 you may want to have consistent formatting (using theme() functions) across your plots, e.g. if you are generating plots for a manuscript.

    This lesson will cover:

    • Developing a custom function for creating consistently formatted plots
  • 2. Boxplot with ggplot2

    Customizing barplots with ggplot2

    Previously, you created a scatterplot using ggplot2. However, ggplot2 can be used to create a very wide variety of plots. One of the other frequently used plots you can create with ggplot2 is a barplot.

    This lesson will cover:

    • Creating and customizing a barplot using ggplot2
  • 3. Exporting files and plots

    Writing files and plots in different formats

    Now that you have completed some analysis in R, you will need to eventually export that work out of R/RStudio. R provides lots of flexibility in what and how you export your data and plots.

    This lesson will cover:

    • Exporting your figures from R using a variety of file formats

    • Writing your data from R to a file

  • 4. Finding help

    How to best look for help

    Hopefully, this course has given you the basic tools you need to be successful when using R. However, it would be impossible to cover every aspect of R and you will need to be able to troubleshoot future issues as they arise.

    This lesson will cover:

    • Suggestions for how to best ask for help

    • Where to look for help

  • 5. Tidyverse

    Data wrangling within Tidyverse

    The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. Tidyverse is becoming increasingly prevalent and it is necessary that R users are conversant in the basics of Tidyverse. We have already used two Tidyverse packages in this workshop (ggplot2 and purrr) and in this lesson we will learn some key features from a few additional packages that make up Tidyverse.

    This lesson will cover:

    • Usage of pipes for connecting together multiple commands

    • Tibbles for two-dimensional data storage

    • Data wrangling within Tidyverse

  • Submit here

    Submit a day before the next class.

"},{"location":"Workshop_Schedule/#day-4","title":"Day 4","text":"Lesson Overview Instructor Time 10. Review self-learning Questions about self-learning All 10:00-10:35 11. In-class exercises In class exercises Will 10:50-11:15 12. Discussion Q&A Noor 11:15 - 11:45 13. Wrap Up Wrap up and checking out Noor 11:45 - 12:00"},{"location":"Workshop_Schedule/#additional-exercises-and-answer-keys","title":"Additional exercises and answer keys","text":"
  • Final Exercises
Answer Keys
  • Answer Keys Day 1
  • Answer Keys Day 2
  • Answer Keys Day 3
  • Answer Keys Final exercise
"},{"location":"Workshop_Schedule/#additional-resources","title":"Additional resources","text":"
  • Building on the basic R knowledge

    • DGE workshop
    • Single-cell RNA-seq workshop
    • RMarkdown
    • Functional analysis
    • More ggplot2
    • ggplot2 cookbook
    • Running R and Rstudio on O2
  • Resources

    • Online learning resources
    • All hbctraining materials

    Cheatsheets

    • base R cheatsheet
    • RStudio cheatsheet
    • ggplot2 cheatsheet

Attribution & Citation

  • These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived from work that is Copyright \u00a9 Data Carpentry. All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0)

  • To cite material from this course in your publications, please use:

    Meeta Mistry, Mary Piper, Jihe Liu, & Radhika Khetani. (2021, May 5). hbctraining/Intro-to-R-flipped: R workshop first release. Zenodo. https://doi.org/10.5281/zenodo.4739342

  • A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/","title":"Introduction to R and RStudio","text":"

Approximate time: 45 minutes

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#learning-objectives","title":"Learning Objectives","text":"
  • Describe what R and RStudio are.
  • Interact with R using RStudio.
  • Familiarize various components of RStudio.
  • Employ variables in R.
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-r","title":"What is R?","text":"

The common misconception is that R is a programming language but in fact it is much more than that. Think of R as an environment for statistical computing and graphics, which brings together a number of features to provide powerful functionality.

The R environment combines:

  • effective handling of big data
  • collection of integrated tools
  • graphical facilities
  • simple and effective programming language
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#why-use-r","title":"Why use R?","text":"

R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.

  • Data handling, wrangling, and storage
  • Wide array of statistical methods and graphical techniques available
  • Easy to install on any platform and use (and it\u2019s free!)
  • Open source with a large and growing community of peers

Examples of R used in the media and science\"

  • \"At the BBC data team, we have developed an R package and an R cookbook to make the process of creating publication-ready graphics in our in-house style...\" - BBC Visual and Data Journalism cookbook for R graphics

  • \"R package of data and code behind the stories and interactives at FiveThirtyEight.com, a data-driven journalism website founded by Nate Silver (initially began as a polling aggregation site, but now covers politics, sports, science and pop culture) and owned by ESPN...\" - fivethirtyeight Package

  • Single Cell RNA-seq Data analysis with Seurat

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-rstudio","title":"What is RStudio?","text":"

RStudio is freely available open-source Integrated Development Environment (IDE). RStudio provides an environment with many features to make using R easier and is a great alternative to working on R in the terminal.

  • Graphical user interface, not just a command prompt
  • Great learning tool
  • Free for academic use
  • Platform agnostic
  • Open source
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#creating-a-new-project-directory-in-rstudio","title":"Creating a new project directory in RStudio","text":"

Let's create a new project directory for our Introduction to R lesson today.

  1. Open RStudio.
  2. Go to the File menu and select New Project.
  3. In the New Project window, choose New Directory. Then, choose New Project. Name your new directory Intro-to-R and then \"Create the project as subdirectory of:\" the Desktop (or location of your choice).
  4. Click on Create Project.

  1. After your project is completed, if the project does not automatically open in RStudio, then go to the File menu, select Open Project, and choose Intro-to-R.Rproj.
  2. When RStudio opens, you will see three panels in the window.
  3. Go to the File menu and select New File, and select R Script.
  4. Go to the File menu and select Save As..., type Intro-to-R.R and select Save.

The RStudio interface should now look like the screenshot below.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-a-project-in-rstudio","title":"What is a project in RStudio?","text":"

It is simply a directory that contains everything related your analyses for a specific project. RStudio projects are useful when you are working on context-specific analyses and you wish to keep them separate. When creating a project in RStudio you associate it with a working directory of your choice (either an existing one, or a new one). A . RProj file is created within that directory and that keeps track of your command history and variables in the environment. The . RProj file can be used to open the project in its current state but at a later date.

When a project is (re)opened within RStudio the following actions are taken:

  • A new R session (process) is started
  • The .RData file in the project's main directory is loaded, populating the environment with any objects that were present when the project was closed.
  • The .Rhistory file in the project's main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
  • The current working directory is set to the project directory.
  • Previously edited source documents are restored into editor tabs
  • Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.

Information adapted from RStudio Support Site

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#rstudio-interface","title":"RStudio Interface","text":"

The RStudio interface has four main panels:

  1. Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
  2. Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
  3. Environment/History: environment shows all active objects and history keeps track of all commands run in console
  4. Files/Plots/Packages/Help
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#organizing-and-setting-up-rstudio","title":"Organizing and Setting up RStudio","text":""},{"location":"day_1/D1.2_introR-R-and-RStudio/#viewing-your-working-directory","title":"Viewing your working directory","text":"

Before we organize our working directory, let's check to see where our current working directory is located by typing into the console:

getwd()\n

Your working directory should be the Intro-to-R folder constructed when you created the project. The working directory is where RStudio will automatically look for any files you bring in and where it will automatically save any files you create, unless otherwise specified.

You can visualize your working directory by selecting the Files tab from the Files/Plots/Packages/Help window.

If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu which appears as a Cog and select Set As Working Directory.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#structuring-your-working-directory","title":"Structuring your working directory","text":"

To organize your working directory for a particular analysis, you should separate the original data (raw data) from intermediate datasets. For instance, you may want to create a data/ directory within your working directory that stores the raw data, and have a results/ directory for intermediate datasets and a figures/ directory for the plots you will generate.

Let's create these three directories within your working directory by clicking on New Folder within the Files tab.

When finished, your working directory should look like:

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#setting-up","title":"Setting up","text":"

This is more of a housekeeping task. We will be writing long lines of code in our script editor and want to make sure that the lines wrap and you don't have to scroll back and forth to look at your long line of code.

Click on Edit at the top of your RStudio screen and click on Preferences... in the pull down menu.

On the left, select Code and put a check against Soft-wrap R source files. Make sure you click the Apply button at the bottom of the Window before saying OK.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#interacting-with-r","title":"Interacting with R","text":"

Now that we have our interface and directory structure set up, let's start playing with R! There are two main ways of interacting with R in RStudio: using the console or by using script editor (plain text files that contain your code).

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#console-window","title":"Console window","text":"

The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session.

Let's test it out:

3 + 5\n

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#script-editor","title":"Script editor","text":"

Best practice is to enter the commands in the script editor, and save the script. You are encouraged to comment liberally to describe the commands you are running using #. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed.

Now let's try entering commands to the script editor and using the comments character # to add descriptions and run the code chunk.

# Intro to R Lesson\n# Feb 16th, 2016\n# Interacting with R\n\n## I am adding 3 and 5. R is fun!\n3+5\n

The Rstudio script editor allows you to 'send' the current line or the currently highlighted text to the R console by clicking on the Run button in the upper-right hand corner of the script editor.

Alternatively, you can run by simply pressing the Ctrl and Return/Enter keys at the same time as a shortcut.

You should see the command run in the console and output the result.

|

What happens if we do that same command without the comment symbol #? Re-run the command after removing the # sign in the front:

I am adding 3 and 5. R is fun!\n3+5\n

Error

Error: unexpected symbol in \"I am\"\n

Now R is trying to run that sentence as a command, and it doesn't work. We get an error message in the console. It means the R interpreter did not know what to do with that command.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#console-command-prompt","title":"Console command prompt","text":"

Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:

Prompt/command Meaning Remarks > Console is ready to accept commands When the console receives a command by directly typing into the console or running from the script editor Ctrl+Enter, R will try to execute it. + Console is waiting for you to enter more data It means that you haven't finished entering a complete command. Often this can be due to you having not 'closed' a parenthesis or quotation. ESC To escape the command and bring back a new prompt > If you are in Rstudio and you can't figure out why your command isn't running, click inside the console window and press ESC"},{"location":"day_1/D1.2_introR-R-and-RStudio/#keyboard-shortcuts-in-rstudio","title":"Keyboard shortcuts in RStudio","text":"

In addition to some of the shortcuts described earlier in this lesson, we have listed a few more that can be helpful as you work in RStudio.

Key Action Ctrl+Enter Run command from script editor in console ESC Escape the current command to return to the command prompt Ctrl+1 Move cursor from console to script editor Ctrl+2 Move cursor from script editor to console Tab Use this key to complete a file path Ctrl+Shift+C Comment the block of highlighted text

Exercise

Try highlighting only 3 + from your script editor and running it. Find a way to bring back the command prompt > in the console.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#the-r-syntax","title":"The R syntax","text":"

Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.

The main parts of speech in R (syntax) include:

  • The comments # and how they are used to document function and its content
  • variables and functions
  • The assignment operator <-
  • the = for arguments in functions

We will go through each of these parts of speech in more detail, starting with the assignment operator.

Note

Indentation and consistency in spacing is used to improve clarity and legibility.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#assignment-operator","title":"Assignment operator","text":"

To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to x by executing:

x <- 3\n

The assignment operator (<-) assigns values on the right to variables on the left.

Note

In RStudio, typing Alt + - (push Alt at the same time as the - key), on Mac type option + - will write <- in a single keystroke.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#variables","title":"Variables","text":"

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to \"buckets\", where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

In the example above, we created a variable or a 'bucket' called x. Inside we put a value, 3.

Let's create another variable called y and give it a value of 5.

y <- 5\n

When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.

y\n

You can also view information on the variable by looking in your Environment window in the upper right-hand corner of the RStudio interface.

Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation:

x + y\n

Try assigning the results of this operation to another variable called number.

number <- x + y\n

Exercise

  1. Try changing the value of the variable x to 5. What happens to number?
  2. Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?

Tips on variable names

Variables can be given almost any name, such as x, current_temperature, or subject_id. However, there are some rules / suggestions you should keep in mind:

  • Make your names explicit and not too long.
  • Avoid names starting with a number (2x is not valid but x2 is)
  • Avoid names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c, T, mean, data) as variable names. When in doubt check the help to see if the name is already in use.
  • Avoid dots (.) within a variable name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid them.
  • Use nouns for object names and verbs for function names
  • Keep in mind that R is case sensitive (e.g., genome_length is different from Genome_length)
  • Be consistent with the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are Hadley Wickham's style guide and Google's.
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#interacting-with-data-in-r","title":"Interacting with data in R","text":"

R is commonly used for handling big data, and so it only makes sense that we learn about R in the context of some kind of relevant data. Let's take a few minutes to add files to the folders we created and familiarize ourselves with the data.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#adding-files-to-your-working-directory","title":"Adding files to your working directory","text":"

You can access the files we need for this workshop using the links provided below. If you right click on the link, and \"Save link as..\". Choose ~/Desktop/Intro-to-R/data as the destination of the file. You should now see the file appear in your working directory. We will discuss these files a bit later in the lesson.

Data Download links Normalized count data Right click & Save link as... Metadata file Right click & Save link as... Functional analysis output Right click & Save link as...

NOTE

If the files download automatically to some other location on your laptop, you can move them to the your working directory using your file explorer or finder (outside RStudio), or navigating to the files in the Files tab of the bottom right panel of RStudio.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#about-the-dataset","title":"About the dataset","text":"

The count data

In this example dataset, we have collected whole brain samples from 12 mice and want to evaluate expression differences between them. The expression data represents normalized count data obtained from RNA-sequencing of the 12 brain samples. This data is stored in a comma separated values (CSV) file as a 2-dimensional matrix, with each row corresponding to a gene and each column corresponding to a sample

The metadata

We have another file in which we identify information about the data or metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.

The first column contains the row names, and note that these are identical to the column names in our expression data file above (albeit, in a slightly different order). The next few columns contain information about our samples that allow us to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout). What types of categories do you observe in the remaining columns?

R is particularly good at handling this type of categorical data. Rather than simply storing this information as text, the data is represented in a specific data structure which allows the user to sort and manipulate the data in a quick and efficient manner. We will discuss this in more detail as we go through the different lessons in R!

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#the-functional-analysis-results","title":"The functional analysis results","text":"

We will be using the results of the functional analysis to learn about packages/functions from the Tidyverse suite of integrated packages. These packages are designed to work together to make common data science operations like data wrangling, tidying, reading/writing, parsing, and visualizing, more user-friendly.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#best-practices","title":"Best practices","text":"

Before we move on to more complex concepts and getting familiar with the language, we want to point out a few things about best practices when working with R which will help you stay organized in the long run

  • Code and workflow are more reproducible if we can document everything that we do. Our end goal is not just to \"do stuff\", but to do it in a way that anyone can easily and exactly replicate our workflow and results. All code should be written in the script editor and saved to file, rather than working in the console.

  • The R console should be mainly used to inspect objects, test a function or get help.

  • Use # signs to comment. Comment liberally in your R scripts. This will help future you and other collaborators know what each line of code (or code block) was meant to do. Anything to the right of a # is ignored by R. A shortcut for this is Ctrl+Shift+C if you want to comment an entire chunk of text.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/","title":"R Syntax and Data Structures","text":"

Approximate time: 70 min

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#learning-objectives","title":"Learning Objectives","text":"
  • Describe frequently-used data types in R.
  • Construct data structures to store data.
"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-types","title":"Data Types","text":"

Variables can contain values of specific types within R. The six data types that R uses include:

  • \"numeric\" for any numerical value, including whole numbers and decimals. This is the most common data type for performing mathematical operations.
  • \"character\" for text values, denoted by using quotes (\"\") around value. For instance, while 5 is a numeric value, if you were to put quotation marks around it, it would turn into a character value, and you could no longer use it for mathematical operations. Single or double quotes both work, as long as the same type is used at the beginning and end of the character value.
  • \"integer\" for whole numbers (e.g., 2L, the L indicates to R that it's an integer). It behaves similar to the numeric data type for most tasks or functions; however, it takes up less storage space than numeric data, so often tools will output integers if the data is known to be comprised of whole numbers. Just know that integers behave similarly to numeric values. If you wanted to create your own, you could do so by providing the whole number, followed by an upper-case L.
  • \"logical\" for TRUE and FALSE (the Boolean data type). The logical data type can be specified using four values, TRUE in all capital letters, FALSE in all capital letters, a single capital T or a single capital F.
  • \"complex\" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that's all we're going to say about them
  • \"raw\" that we won't discuss further

The table below provides examples of each of the commonly used data types:

Data Type Examples Numeric: 1, 1.5, 20, pi Character: \u201canytext\u201d, \u201c5\u201d, \u201cTRUE\u201d Integer: 2L, 500L, -17L Logical: TRUE, FALSE, T, F

The type of data will determine what you can do with it. For example, if you want to perform mathematical operations, then your data type cannot be character or logical. Whereas if you want to search for a word or pattern in your data, then you data should be of the character data type. The task or function being performed on the data will determine what type of data can be used.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-structures","title":"Data Structures","text":"

We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when number was created, the result of the mathematical operation was a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list).

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#vectors","title":"Vectors","text":"

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's basically just a collection of values, mainly either numbers,

or characters,

or logical values,

Note

All values in a vector must be of the same data type.

If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.

For example, if you were to try to create the following vector:

R will coerce it into:

The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements.

Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single entity (bucket).

Let's create a vector of genome lengths and assign it to a variable called glengths.

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

# Create a numeric vector and store the vector as a variable called 'glengths'\nglengths <- c(4.6, 3000, 50000)\nglengths\n

Note

Your environment shows the glengths variable is numeric (num) and tells you the glengths vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values) as denoted by the [1:3].*

A vector can also contain characters. Create another vector called species with three elements, where each element corresponds with the genome sizes vector (in Mb).

# Create a character vector and store the vector as a variable called 'species'\nspecies <- c(\"ecoli\", \"human\", \"corn\")\nspecies\n
What do you think would happen if we forgot to put quotations around one of the values? Let's test it out with corn.

# Forget to put quotes around corn\nspecies <- c(\"ecoli\", \"human\", corn)\n
Note that RStudio is quite helpful in color-coding the various data types. We can see that our numeric values are blue, the character values are green, and if we forget to surround corn with quotes, it's black. What does this mean? Let's try to run this code.

When we try to run this code we get an error specifying that object 'corn' is not found. What this means is that R is looking for an object or variable in my Environment called 'corn', and when it doesn't find it, it returns an error. If we had a character vector called 'corn' in our Environment, then it would combine the contents of the 'corn' vector with the values \"ecoli\" and \"human\".

Since we only want to add the value \"corn\" to our vector, we need to re-run the code with the quotation marks surrounding corn. A quick way to add quotes to both ends of a word in RStudio is to highlight the word, then press the quote key.

# Create a character vector and store the vector as a variable called 'species'\nspecies <- c(\"ecoli\", \"human\", \"corn\")\n

Exercise

Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this.

Print the combined vector in the console, what looks different compared to the original vectors?

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#factors","title":"Factors","text":"

A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level). Factors are built on top of integer vectors such that each factor level is assigned an integer value, creating value-label pairs.

For instance, if we have four animals and the first animal is female, the second and third are male, and the fourth is female, we could create a factor that appears like a vector, but has integer values stored under-the-hood. The integer value assigned is a one for females and a two for males. The numbers are assigned in alphabetical order, so because the f- in females comes before the m- in males in the alphabet, females get assigned a one and males a two. In later lessons we will show you how you could change these assignments.

Let's create a factor vector and explore a bit more. We'll start by creating a character vector describing three different levels of expression. Perhaps the first value represents expression in mouse1, the second value represents expression in mouse2, and so on and so forth:

# Create a character vector and store the vector as a variable called 'expression'\nexpression <- c(\"low\", \"high\", \"medium\", \"high\", \"low\", \"medium\", \"high\")\n

Now we can convert this character vector into a factor using the factor() function:

# Turn 'expression' vector into a factor\nexpression <- factor(expression)\n

So, what exactly happened when we applied the factor() function?

The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low, medium, and high. By turning the expression vector into a factor, the categories are assigned integers alphabetically, with high=1, low=2, medium=3. This in effect assigns the different factor levels. You can view the newly created factor variable and the levels in the Environment window.

So now that we have an idea of what factors are, when would you ever want to use them?

Factors are extremely valuable for many operations often performed in R. For instance, factors can give order to values with no intrinsic order. In the previous 'expression' vector, if I wanted the low category to be less than the medium category, then we could do this using factors. Also, factors are necessary for many statistical methods. For example, descriptive statistics can be obtained for character vectors if you have the categorical information stored as a factor. Also, if you want to denote which category is your base level for a statistical comparison, then you would need to have your category variable stored as a factor with the base level assigned to 1. Anytime that it is helpful to have the categories thought of as groups in an analysis, the factor function makes this possible. For instance, if you want to color your plots by treatment type, then you would need the treatment variable to be a factor.

Exercises

Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup with nine elements: 3 control (\"CTL\") values, 3 knock-out (\"KO\") values, and 3 over-expressing (\"OE\") values.

  2. Turn samplegroup into a factor data structure.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#matrix","title":"Matrix","text":"

A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

Matrices are used commonly as part of the mathematical machinery of statistics. They are usually of numeric datatype and used in computational algorithms to serve as a checkpoint. For example, if input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-frame","title":"Data Frame","text":"

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it's a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors). In the data frame pictured below, the first column is character, the second column is numeric, the third is character, and the fourth is logical.

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.

# Create a data frame and store it as a variable called 'df'\ndf <- data.frame(species, glengths)\n

We can see that a new variable called df has been created in our Environment within a new section called Data. In the Environment, it specifies that df has 3 observations of 2 variables. What does that mean? In R, rows always come first, so it means that df has 3 rows and 2 columns. We can get additional information if we click on the blue circle with the white triangle in the middle next to df. It will display information about each of the columns in the data frame, giving information about what the data type is of each of the columns and the first few values of those columns.

Another handy feature in RStudio is that if we hover the cursor over the variable name in the Environment, df, it will turn into a pointing finger. If you click on df, it will open the data frame as it's own tab next to the script editor. We can explore the table interactively within this window. To close, just click on the X on the tab.

As with any variable, we can print the values stored inside to the console if we type the variable's name and run.

df\n

Exercise

Create a data frame called favorite_books with the following vectors as columns:

titles <- c(\"Catch-22\", \"Pride and Prejudice\", \"Nineteen Eighty Four\")\npages <- c(453, 432, 328)\n
"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#lists","title":"Lists","text":"

Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures.

If you have variables of different data structures you wish to combine, you can put all of those into one list object by using the list() function and placing all the items you wish to combine within parentheses:

list1 <- list(species, df, number)\n

We see list1 appear within the Data section of our environment as a list of 3 components or variables. If we click on the blue circle with a triangle in the middle, it's not quite as interpretable as it was for data frames.

Essentially, each component is preceded by a colon. The first colon give the species vector, the second colon precedes the df data frame, with the dollar signs indicating the different columns, the last colon gives the single value, number.

If I click on list1, it opens a tab where you can explore the contents a bit more, but it's still not super intuitive. The easiest way to view small lists is to print to the console.

Let's type list1 and print to the console by running it.

list1\n\n[[1]]\n[1] \"ecoli\" \"human\" \"corn\" \n\n[[2]]\n  species glengths\n1   ecoli      4.6\n2   human   3000.0\n3    corn  50000.0\n\n[[3]]\n[1] 5\n

There are three components corresponding to the three different variables we passed in, and what you see is that structure of each is retained. Each component of a list is referenced based on the number position. We will talk more about how to inspect and manipulate components of lists in later lessons.

Exercise

Create a list called list2 containing species, glengths, and number.

Now that we know what lists are, why would we ever want to use them? When getting started with R, you will most likely encounter lists with different tools or functions that you use. Oftentimes a tool will need a list as input, so that all the information needed to run the tool is present in a single variable. Sometimes a tool will output a list when working through an analysis. Knowing how to work with them and extract necessary information will be critically important.

As you become more comfortable with R, you will find yourself using lists more often. One common use of lists is to make iterative processes more efficient. For example, let's say you had multiple data frames containing the same weather information from different cities throughout North America. You wanted to perform the same task on each of the data frames, but that would take a long time to do individually. Instead you could create a list where each data frame is a component of the list. Then, you could perform the task on the list instead, which would be applied to each of the components.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/","title":"Functions in R","text":"

Approximate time: 30 min

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#learning-objectives","title":"Learning Objectives","text":"
  • Describe and utilize functions in R.
  • Modify default behavior of a function using arguments.
  • Identify R-specific sources of obtaining more information about functions.
  • Demonstrate how to create user-defined functions in R
"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#functions-and-their-arguments","title":"Functions and their arguments","text":""},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#what-are-functions","title":"What are functions?","text":"

A key feature of R is functions. Functions are \"self contained\" modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

The general usage for a function is the name of the function followed by parentheses:

function_name(input)\n
The input(s) are called arguments, which can include:

  1. the physical object (any data structure) on which the function carries out a task
  2. specifications that alter the way the function operates (e.g. options)

Not all functions take arguments, for example:

getwd()\n

However, most functions can take several arguments. If you don't specify a required argument when calling the function, you will either receive an error or the function will fall back on using a default.

The defaults represent standard values that the author of the function specified as being \"good enough in standard cases\". An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#basic-functions","title":"Basic functions","text":"

We have already used a few examples of basic functions in the previous lessons i.e getwd(), c(), and factor(). These functions are available as part of R's built in capabilities, and we will explore a few more of these base functions below.

Let's revisit a function that we have used previously to combine data c() into vectors. The arguments it takes is a collection of numbers, characters or strings (separated by a comma). The c() function performs the task of combining the numbers or characters into a single vector. You can also use the function to add elements to an existing vector:

glengths <- c(glengths, 90) # adding at the end \nglengths <- c(30, glengths) # adding at the beginning\n

What happens here is that we take the original vector glengths (containing three elements), and we are adding another item to either end. We can do this over and over again to build a vector or a dataset.

Since R is used for statistical computing, many of the base functions involve mathematical operations. One example would be the function sqrt(). The input/argument must be a number, and the output is the square root of that number. Let's try finding the square root of 81:

sqrt(81)\n

Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?

sqrt(glengths)\n

In this case the task was performed on each individual value of the vector glengths and the respective results were displayed.

Let's try another function, this time using one that we can change some of the options (arguments that change the behavior of the function), for example round:

round(3.14159)\n

We can see that we get 3. That's because the default is to round to the nearest whole number. What if we want a different number of significant digits? Let's first learn how to find available arguments for a function.

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#seeking-help-on-arguments-for-functions","title":"Seeking help on arguments for functions","text":"

The best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio that will provide a description of the function, usage, arguments, details, and examples:

?round\n

Alternatively, if you are familiar with the function but just need to remind yourself of the names of the arguments, you can use:

args(round)\n

Even more useful is the example() function. This will allow you to run the examples section from the Online Help to see exactly how it works when executing the commands. Let's try that for round():

example(\"round\")\n

In our example, we can change the number of digits returned by adding an argument. We can type digits=2 or however many we may want:

round(3.14159, digits=2)\n

Note

If you provide the arguments in the exact same order as they are defined (in the help manual) you don't have to name them:

round(3.14159, 2)\n
However, it's usually not recommended practice because it involves a lot of memorization. In addition, it makes your code difficult to read for your future self and others, especially if your code includes functions that are not commonly used. (It's however OK to not include the names of the arguments for basic functions like mean, min, etc...). Another advantage of naming arguments, is that the order doesn't matter. This is useful when a function has many arguments.

Exercise

  1. Let's use base R function to calculate mean value of the glengths vector. You might need to search online to find what function can perform this task.

  2. Create a new vector test <- c(1, NA, 2, 3, NA, 4). Use the same base R function from exercise 1 (with addition of proper argument), and calculate mean value of the test vector. The output should be 2.5.

    NOTE: In R, missing values are represented by the symbol NA (not available). It\u2019s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore NA during statistical calculation, or to remove NA from the vector. If you want more information related to missing data or NA you can go to this page (please note that there are many advanced concepts on that page that have not been covered in class).

  3. Another commonly used base function is sort(). Use this function to sort the glengths vector in descending order.
"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#user-defined-functions","title":"User-defined Functions","text":"

One of the great strengths of R is the user's ability to add functions. Sometimes there is a small task (or series of tasks) you need done and you find yourself having to repeat it multiple times. In these types of situations, it can be helpful to create your own custom function. The structure of a function is given below:

name_of_function <- function(argument1, argument2) {\n    statements or code that does something\n    return(something)\n}\n
  • First you give your function a name.
  • Then you assign value to it, where the value is the function.

When defining the function you will want to provide the list of arguments required (inputs and/or options to modify behaviour of the function), and wrapped between curly brackets place the tasks that are being executed on/using those arguments. The argument(s) can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it\u2019s not necessary to define what it is in any way.

Finally, you can \u201creturn\u201d the value of the object from the function, meaning pass the value of it into the global environment. The important idea behind functions is that objects that are created within the function are local to the environment of the function \u2013 they don\u2019t exist outside of the function.

Let's try creating a simple example function. This function will take in a numeric value as input, and return the squared value.

square_it <- function(x) {\n    square <- x * x\n    return(square)\n}\n

Once you run the code, you should see a function named square_it in the Environment panel (located at the top right of Rstudio interface). Now, we can use this function as any other base R functions. We type out the name of the function, and inside the parentheses we provide a numeric value x:

square_it(5)\n

Pretty simple, right? In this case, we only had one line of code that was run, but in theory you could have many lines of code to get obtain the final results that you want to \"return\" to the user.

Do I always have to return() something at the end of the function?

In the example above, we created a new variable called square inside the function, and then return the value of square. If you don't use return(), by default R will return the value of the last line of code inside that function. That is to say, the following function will also work.

square_it <- function(x) {\n    x * x\n}\n
However, we recommend always using return at the end of a function as the best practice.

We have only scratched the surface here when it comes to creating functions! We will revisit this in later lessons, but if interested you can also find more detailed information on this R-bloggers site, which is where we adapted this example from.

Exercise

  1. Write a function called multiply_it, which takes two inputs: a numeric value x, and a numeric value y. The function will return the product of these two numeric values, which is x * y. For example, multiply_it(x=4, y=6) will return output 24.

Attribution notice

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/","title":"Reading in and inspecting data","text":""},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to read existing data into R
  • Utilize base R functions to inspect data structures
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#reading-data-into-r","title":"Reading data into R","text":""},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#the-basics","title":"The basics","text":"

Regardless of the specific analysis in R we are performing, we usually need to bring data in for any analysis being done in R, so learning how to read in data is a crucial component of learning to use R.

Many functions exist to read data in, and the function in R you use will depend on the file format being read in. Below we have a table with some examples of functions that can be used for importing some common text data types (plain text).

Data type Extension Function Package Comma separated values csv read.csv() utils (default) read_csv() readr (tidyverse) Tab separated values tsv read_tsv() readr Other delimited formats txt read.table() utils read_table() readr read_delim() readr

For example, if we have text file where the columns are separated by commas (comma-separated values or comma-delimited), you could use the function read.csv. However, if the data are separated by a different delimiter in a text file (e.g. \":\", \";\", \" \"), you could use the generic read.table function and specify the delimiter (sep = \" \") as an argument in the function.

In the above table we refer to base R functions as being contained in the \"utils\" package. In addition to base R functions, we have also listed functions from some other packages that can be used to import data, specifically the \"readr\" package that installs when you install the \"tidyverse\" suite of packages.

In addition to plain text files, you can also import data from other statistical analysis packages and Excel using functions from different packages.

Data type Extension Function Package Stata version 13-14 dta readdta() haven Stata version 7-12 dta read.dta() foreign SPSS sav read.spss() foreign SAS sas7bdat read.sas7bdat() sas7bdat Excel xlsx, xls read_excel() readxl (tidyverse)

Note

These lists are not comprehensive, and may other functions exist for importing data. Once you have been using R for a bit, maybe you will have a preference for which functions you prefer to use for which data type.

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#metadata","title":"Metadata","text":"

When working with large datasets, you will very likely be working with \"metadata\" file which contains the information about each sample in your dataset.

The metadata is very important information and we encourage you to think about creating a document with as much metadata you can record before you bring the data into R. Here is some additional reading on metadata from the HMS Data Management Working Group.

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#the-readcsv-function","title":"The read.csv() function","text":"

Let's bring in the metadata file we downloaded earlier (mouse_exp_design.csv or mouse_exp_design.txt) using the read.csv function.

First, check the arguments for the function using the ? to ensure that you are entering all the information appropriately:

?read.csv\n

The first thing you will notice is that you've pulled up the documentation for read.table(), this is because that is the parent function and all the other functions are in the same family.

The next item on the documentation page is the function Description, which specifies that the output of this set of functions is going to be a data frame - \"Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.\"

In usage, all of the arguments listed for read.table() are the default values for all of the family members unless otherwise specified for a given function. Let's take a look at 2 examples:

  1. The separator

    • in the case of read.table() it is sep = \"\" (space or tab)
    • whereas for read.csv() it is sep = \",\" (a comma).
  2. The header

    This argument refers to the column headers that may (TRUE) or may not (FALSE) exist in the plain text file you are reading in.

    • in the case of read.table() it is header = FALSE (by default, it assumes you do not have column names)
    • whereas for read.csv() it is header = TRUE (by default, it assumes that all your columns have names listed).

The take-home from the \"Usage\" section for read.csv() is that it has one mandatory argument, the path to the file and filename in quotations; in our case that is data/mouse_exp_design.csv or data/mouse_exp_design.txt.

The stringsAsFactors argument

Note that the read.table {utils} family of functions has an argument called stringsAsFactors, which by default will take the value of default.stringsAsFactors().

Type out default.stringsAsFactors() in the console to check what the default value is for your current R session. Is it TRUE or FALSE?

If default.stringsAsFactors() is set to TRUE, then stringsAsFactors = TRUE. In that case any function in this family of functions will coerce character columns in the data you are reading in to factor columns (i.e. coerce from vector to factor) in the resulting data frame.

If you want to maintain the character vector data structure (e.g. for gene names), you will want to make sure that stringsAsFactors = FALSE (or that default.stringsAsFactors() is set to FALSE).

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#create-a-data-frame-by-reading-in-the-file","title":"Create a data frame by reading in the file","text":"

At this point, please check the extension for the mouse_exp_design file within your data folder. You will have to type it accordingly within the read.csv() function.

Note

read.csv is not fussy about extensions for plain text files, so even though the file we are reading in is a comma-separated value file, it will be read in properly even with a .txt extension.

Let's read in the mouse_exp_design file and create a new data frame called metadata.

metadata <- read.csv(file=\"data/mouse_exp_design.csv\")\n\n# OR \n# metadata <- read.csv(file=\"data/mouse_exp_design.txt\")\n

NOTE

RStudio supports the automatic completion of code using the Tab key. This is especially helpful for when reading in files to ensure the correct file path. The tab completion feature also provides a shortcut to listing objects, and inline help for functions. Tab completion is your friend! We encourage you to use it whenever possible.

Go to your Global environment and click on the name of the data frame you just created.

When you do this the metadata table will pop up on the top left hand corner of RStudio, right next to the R script.

You should see a subtle coloring (blue-gray) of the first row and first column, the rest of the table will have a white background. This is because your first row and first columns have different properties than the rest of the table, they are the names of the rows and columns respectively.

Earlier we noted that the file we just read in had column names (first row of values) and how read.csv() deals with \"headers\". In addition to column headers, read.csv() also assumes that the first column contains the row names. Not all functions in the read.table() family of functions will do this and depending on which one you use, you may have to specify an additional argument to properly assign the row names and column names.

Note

Row names and column names are really handy when subsetting data structures and they are also helpful to identify samples or genes. We almost always use them with data frames.

Exercise 1

  1. Download this tab-delimited .txt file and save it in your project's data folder.
  2. Read it in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
    • all the columns in the input text file have column name/headers
    • you want the first column of the text file to be used as row names (hint: look up the input for the row.names = argument in read.table())
  3. Display the contents of proj_summary in your console
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#inspecting-data-structures","title":"Inspecting data structures","text":"

There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Let's use the metadata file that we created to test out data inspection functions.

Take a look at the dataframe by typing out the variable name metadata and pressing return; the variable contains information describing the samples in our study. Each row holds information for a single sample, and the columns contain categorical information about the sample genotype(WT or KO), celltype (typeA or typeB), and replicate number (1,2, or 3).

metadata\n

Output

genotype celltype replicate\nsample1        Wt    typeA      1\nsample2        Wt    typeA      2\nsample3        Wt    typeA      3\nsample4        KO    typeA      1\nsample5        KO    typeA      2\nsample6        KO    typeA      3\nample7        Wt    typeB       1\nsample8        Wt    typeB      2\nsample9        Wt    typeB      3\nsample10       KO    typeB      1\nsample11       KO    typeB      2\nsample12       KO    typeB      3\n

Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head():

head(metadata)\n
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#list-of-functions-for-data-inspection","title":"List of functions for data inspection","text":"

We already saw how the functions head() and str() (in the releveling section) can be useful to check the content and the structure of a data.frame. Below is a non-exhaustive list of functions to get a sense of the content/structure of data. The list has been divided into functions that work on all types of objects, some that work only on vectors/factors (1 dimensional objects), and others that work on data frames and matrices (2 dimensional objects).

We have some exercises below that will allow you to gain more familiarity with these. You will definitely be using some of them in the next few homework sections.

  • All data structures - content display:

    • str(): compact display of data contents (similar to what you see in the Global environment)
    • class(): displays the data type for vectors (e.g. character, numeric, etc.) and data structure for dataframes, matrices, lists
    • summary(): detailed display of the contents of a given object, including descriptive statistics, frequencies
    • head(): prints the first 6 entries (elements for 1-D objects, rows for 2-D objects)
    • tail(): prints the last 6 entries (elements for 1-D objects, rows for 2-D objects)
  • Vector and factor variables:

    • length(): returns the number of elements in a vector or factor
  • Dataframe and matrix variables:

    • dim(): returns dimensions of the dataset (number_of_rows, number_of_columns) [Note, row numbers will always be displayed before column numbers in R]
    • nrow(): returns the number of rows in the dataset
    • ncol(): returns the number of columns in the dataset
    • rownames(): returns the row names in the dataset
    • colnames(): returns the column names in the dataset

Exercise 2

  • Use the class() function on glengths and metadata, how does the output differ between the two?
  • Use the summary() function on the proj_summary dataframe, what is the median \"rRNA_rate\"?
  • How long is the samplegroup factor?
  • What are the dimensions of the proj_summary dataframe?
  • When you use the rownames() function on metadata, what is the data structure of the output?
  • [Optional] How many elements in (how long is) the output of colnames(proj_summary)? Don't count, but use another function to determine this.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2/D2.1_in_class_exercises/","title":"Day 2: In class activities","text":""},{"location":"day_2/D2.1_in_class_exercises/#1-custom-functions","title":"1. Custom Functions","text":"

Let's create a function temp_conv(), which converts the temperature in Fahrenheit (input) to the temperature in Kelvin (output).

  • We could perform a two-step calculation: first convert from Fahrenheit to Celsius, and then convert from Celsius to Kelvin.

  • The formula for these two calculations are as follows: temp_c = (temp_f - 32) * 5 / 9; temp_k = temp_c + 273.15.

  • if your input is 70, the result of temp_conv(70) should be 294.2611.

"},{"location":"day_2/D2.1_in_class_exercises/#2-nesting-functions","title":"2. Nesting Functions","text":"

Now we want to round the temperature in Kelvin (output of temp_conv()) to a single decimal place. Use the round() function with the newly-created temp_conv() function to achieve this in one line of code. If your input is 70, the output should now be 294.3.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2/D2.2_data_wrangling/","title":"Data subsetting with base R: vectors and factors","text":"

Approximate time: 60 min

"},{"location":"day_2/D2.2_data_wrangling/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to subset vectors and factors
  • Explain the use of logical operators when subsetting vectors and factors
  • Demonstrate how to relevel factors in a desired order
"},{"location":"day_2/D2.2_data_wrangling/#selecting-data-using-indices-and-sequences","title":"Selecting data using indices and sequences","text":"

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let's begin with vectors and how to access different elements, and then extend those concepts to dataframes.

"},{"location":"day_2/D2.2_data_wrangling/#vectors","title":"Vectors","text":""},{"location":"day_2/D2.2_data_wrangling/#selecting-using-indices","title":"Selecting using indices","text":"

If we want to extract one or several values from a vector, we must provide one or several indices using square brackets [ ] syntax. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.

Let's start by creating a vector called age:

age <- c(15, 22, 45, 52, 73, 81)\n

Suppose we only wanted the fifth value of this vector, we would use the following syntax:

age[5]\n

If we wanted all values except the fifth value of this vector, we would use the following:

age[-5]\n

If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:

age[c(3,5,6)]   ## nested\n\n# OR\n\n## create a vector first then select\nidx <- c(3,5,6) # create vector of the elements of interest\nage[idx]\n

To select a sequence of continuous values from a vector, we would use : which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let's select the first four values from age:

age[1:4]\n

Alternatively, if you wanted the reverse could try 4:1 for instance, and see what is returned.

Exercise

  1. Create a vector called alphabets with the following letters, C, D, X, L, F.
  2. Use the associated indices along with [ ] to do the following:
    • only display C, D and F
    • display all except X
    • display the letters in the opposite order (F, L, X, D, C)
"},{"location":"day_2/D2.2_data_wrangling/#selecting-using-indices-with-logical-operators","title":"Selecting using indices with logical operators","text":"

We can also use indices with logical operators. Logical operators include greater than (>), less than (<), and equal to (==). A full list of logical operators in R is displayed below:

Operator Description > greater than >= greater than or equal to < less than <= less than or equal to == equal to != not equal to & and | or

We can use logical expressions to determine whether a particular condition is true or false. For example, let's use our age vector:

age\n

If we wanted to know if each element in our age vector is greater than 50, we could write the following expression:

age > 50\n

Returned is a vector of logical values the same length as age with TRUE and FALSE values indicating whether each element in the vector is greater than 50.

[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE\n

We can use these logical vectors to select only the elements in a vector with TRUE values at the same position or index as in the logical vector.

Select all values in the age vector over 50 or age less than 18:

age > 50 | age < 18\n\nage\n\nage[age > 50 | age < 18]  ## nested\n\n# OR\n\n## create a vector first then select\nidx <- age > 50 | age < 18\nage[idx]\n
"},{"location":"day_2/D2.2_data_wrangling/#indexing-with-logical-operators-using-the-which-function","title":"Indexing with logical operators using the which() function","text":"

While logical expressions will return a vector of TRUE and FALSE values of the same length, we could use the which() function to output the indices where the values are TRUE. Indexing with either method generates the same results, and personal preference determines which method you choose to use. For example:

which(age > 50 | age < 18)\n\nage[which(age > 50 | age < 18)]  ## nested\n\n# OR\n\n## create a vector first then select\nidx_num <- which(age > 50 | age < 18)\nage[idx_num]\n

Notice that we get the same results regardless of whether or not we use the which(). Also note that while which() works the same as the logical expressions for indexing, it can be used for multiple other operations, where it is not interchangeable with logical expressions.

"},{"location":"day_2/D2.2_data_wrangling/#factors","title":"Factors","text":"

Since factors are special vectors, the same rules for selecting values using indices apply. The elements of the expression factor created previously had the following categories or levels: low, medium, and high.

Let's extract the values of the factor with high expression, and let's using nesting here:

expression[expression == \"high\"]    ## This will only return those elements in the factor equal to \"high\"\n

Nesting note

The piece of code above was more efficient with nesting; we used a single step instead of two steps as shown below:

Step1 (no nesting): idx <- expression == \"high\"

Step2 (no nesting): expression[idx]

Exercise

Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).

"},{"location":"day_2/D2.2_data_wrangling/#releveling-factors","title":"Releveling factors","text":"

We have briefly talked about factors, but this data type only becomes more intuitive once you've had a chance to work with it. Let's take a slight detour and learn about how to relevel categories within a factor.

To view the integer assignments under the hood you can use str():

expression\n\nstr(expression)\nFactor w/ 3 levels \"high\",\"low\",\"medium\": 2 1 3 1 2 3 1\n
The categories are referred to as factor levels. As we learned earlier, the levels in the expression factor were assigned integers alphabetically, with high=1, low=2, medium=3. However, it makes more sense for us if low=1, medium=2 and high=3, i.e. it makes sense for us to relevel the categories in this factor.

To relevel the categories, you can add the levels argument to the factor() function, and give it a vector with the categories listed in the required order:

expression <- factor(expression, levels=c(\"low\", \"medium\", \"high\"))     # you can re-factor a factor \n\nstr(expression)\nFactor w/ 3 levels \"low\",\"medium\",..: 1 3 2 3 1 2 3\n

Now we have a releveled factor with low as the lowest or first category, medium as the second and high as the third. This is reflected in the way they are listed in the output of str(), as well as in the numbering of which category is where in the factor.

Note

Releveling becomes necessary when you need a specific category in a factor to be the \"base\" category, i.e. category that is equal to 1. One example would be if you need the \"control\" to be the \"base\" in a given RNA-seq experiment.

Exercise

Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/","title":"Packages and libraries","text":"

Approximate time: 25 min

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#learning-objectives","title":"Learning Objectives","text":"
  • Explain different ways to install external R packages
  • Demonstrate how to load a library and how to find functions specific to a package
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#packages-and-libraries","title":"Packages and Libraries","text":"

Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality. There are 10,000+ user contributed packages and growing.

There are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets; for example, all of the functions that we have been using so far in our examples.

The directories in R where the packages are stored are called the libraries. The terms package and library are sometimes used synonymously and there has been discussion amongst the community to resolve this. It is somewhat counter-intuitive to load a package using the library() function and so you can see how confusion can arise.

You can check what libraries are loaded in your current R session by typing into the console:

sessionInfo() #Print version information about R, the OS and attached or loaded packages\n\n# OR\n\nsearch() #Gives a list of attached packages\n

Previously we have introduced you to functions from the standard base packages. However, the more you work with R, you will come to realize that there is a cornucopia of R packages that offer a wide variety of functionality. To use additional packages will require installation. Many packages can be installed from the CRAN or Bioconductor repositories.

Helpful tips for package installations

  • Package names are case sensitive!
  • At any point (especially if you\u2019ve used R/Bioconductor in the past), in the console R may ask you if you want to \"update any old packages by asking Update all/some/none? [a/s/n]:\". If you see this, type \"a\" at the prompt and hit Enter to update any old packages. Updating packages can sometimes take awhile to run. If you are short on time, you can choose \"n\" and proceed. Without updating, you run the risk of conflicts between your old packages and the ones from your updated R version later down the road.
  • If you see a message in your console along the lines of \u201cbinary version available but the source version is later\u201d, followed by a question, \u201cDo you want to install from sources the package which needs compilation? y/n\u201d, type n for no, and hit enter.
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-cran","title":"Package installation from CRAN","text":"

CRAN is a repository where the latest downloads of R (and legacy versions) are found in addition to source code for thousands of different user contributed R packages.

Packages for R can be installed from the CRAN package repository using the install.packages function. This function will download the source code from on the CRAN mirrors and install the package (and any dependencies) locally on your computer.

An example is given below for the ggplot2 package that will be required for some plots we will create later on. Run this code to install ggplot2.

install.packages(\"ggplot2\")\n
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-bioconductor","title":"Package installation from Bioconductor","text":"

Alternatively, packages can also be installed from Bioconductor, another repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data. These packages includes (but is not limited to) tools for performing statistical analysis, annotation packages, and accessing public datasets.

There are many packages that are available in CRAN and Bioconductor, but there are also packages that are specific to one repository. Generally, you can find out this information with a Google search or by trial and error.

To install from Bioconductor, you will first need to install BiocManager. This only needs to be done once ever for your R installation.

Do Not Run This!

install.packages(\"BiocManager\")\n

Now you can use the install() function from the BiocManager package to install a package by providing the name in quotations.

Here we have the code to install ggplot2, through Bioconductor:

Do Not Run This!

BiocManager::install(\"ggplot2\")\n

Note

The code above may not be familiar to you - it is essentially using a new operator, a double colon :: to execute a function from a particular package. This is the syntax: package::function_name().

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-source","title":"Package installation from source","text":"

Finally, R packages can also be installed from source. This is useful when you do not have an internet connection (and have the source files locally), since the other two methods are retrieving the source files from remote sites.

To install from source, we use the same install.packages function but we have additional arguments that provide specifications to change from defaults:

Do Not Run This!

install.packages(\"~/Downloads/ggplot2_1.0.1.tar.gz\", type=\"source\", repos=NULL)\n
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#loading-libraries","title":"Loading libraries","text":"

Once you have the package installed, you can load the library into your R session for use. Any of the functions that are specific to that package will be available for you to use by simply calling the function as you would for any of the base functions. Note that quotations are not required here.

library(ggplot2)\n

You can also check what is loaded in your current environment by using sessionInfo() or search() and you should see your package listed as:

other attached packages:\n[1] ggplot2_2.0.0\n

In this case there are several other packages that were also loaded along with ggplot2.

We only need to install a package once on our computer. However, to use the package, we need to load the library every time we start a new R/RStudio environment. You can think of this as installing a bulb versus turning on the light.

Analogy and image credit to Dianne Cook of Monash University.

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#finding-functions-specific-to-a-package","title":"Finding functions specific to a package","text":"

This is your first time using ggplot2, how do you know where to start and what functions are available to you? One way to do this, is by using the Package tab in RStudio. If you click on the tab, you will see listed all packages that you have installed. For those libraries that you have loaded, you will see a blue checkmark in the box next to it. Scroll down to ggplot2 in your list:

If your library is successfully loaded you will see the box checked, as in the screenshot above. Now, if you click on ggplot2 RStudio will open up the help pages and you can scroll through.

An alternative is to find the help manual online, which can be less technical and sometimes easier to follow. For example, this website is much more comprehensive for ggplot2 and is the result of a Google search. Many of the Bioconductor packages also have very helpful vignettes that include comprehensive tutorials with mock data that you can work with.

If you can't find what you are looking for, you can use the rdocumention.org website that search through the help files across all packages available.

Exercise

The ggplot2 package is part of the tidyverse suite of integrated packages which was designed to work together to make common data science operations more user-friendly. We will be using the tidyverse suite in later lessons, and so let's install it.

NOTE:

This suite of packages is only available in CRAN._

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/","title":"Data wrangling: dataframes, matrices, and lists","text":"

Approximate time: 60 min

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to subset, merge, and create new datasets from existing data structures in R.
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#dataframes","title":"Dataframes","text":"

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the \"coordinates\" we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma). Let's explore the metadata dataframe, shown below are the first six samples:

Let's say we wanted to extract the wild type (Wt) value that is present in the first row and the first column. To extract it, just like with vectors, we give the name of the data frame that we want to extract from, followed by the square brackets. Now inside the square brackets we give the coordinates or indices for the rows in which the value(s) are present, followed by a comma, then the coordinates or indices for the columns in which the value(s) are present. We know the wild type value is in the first row if we count from the top, so we put a one, then a comma. The wild type value is also in the first column, counting from left to right, so we put a one in the columns space too.

# Extract value 'Wt'\nmetadata[1, 1]\n

Now let's extract the value 1 from the first row and third column.

# Extract value '1'\nmetadata[1, 3] \n

Now if you only wanted to select based on rows, you would provide the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure:

# Extract third row\nmetadata[3, ] \n
What kind of data structure does the output appear to be? We see that it is two-dimensional with row names and column names, so we can surmise that it's likely a data frame.

If you were selecting specific columns from the data frame - the rows are left blank:

# Extract third column\nmetadata[ , 3]   \n

What kind of data structure does this output appear to be? It looks different from the data frame, and we really just see a series of values output, indicating a vector data structure. This happens be default if just selecting a single column from a data frame. R will drop to the simplest data structure possible. Since a single column in a data frame is really just a vector, R will output a vector data structure as the simplest data structure. Oftentimes we would like to keep our single column as a data frame. To do this, there is an argument we can add when subsetting called drop, meaning do we want to drop down to the simplest data structure. By default it is TRUE, but we can change it's value to FALSE in order to keep the output as a data frame.

# Extract third column as a data frame\nmetadata[ , 3, drop = FALSE] \n

Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values.

We can extract consecutive rows or columns using the colon (:) to create the vector of indices to extract.

# Dataframe containing first two columns\nmetadata[ , 1:2] \n

Alternatively, we can use the combine function (c()) to extract any number of rows or columns. Let's extract the first, third, and sixth rows.

# Data frame containing first, third and sixth rows\nmetadata[c(1,3,6), ] \n

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 2? oh, right... they are in column 1). In some cases, the column/row number for values can change if the script you are using adds or removes columns/rows. It's, therefore, often better to use column/row names to refer to extract particular values, and it makes your code easier to read and your intentions clearer.

# Extract the celltype column for the first three samples\nmetadata[c(\"sample1\", \"sample2\", \"sample3\") , \"celltype\"] \n

It's important to type the names of the columns/rows in the exact way that they are typed in the data frame; for instance if I had spelled celltype with a capital C, it would not have worked.

If you need to remind yourself of the column/row names, the following functions are helpful:

# Check column names of metadata data frame\ncolnames(metadata)\n\n# Check row names of metadata data frame\nrownames(metadata)\n

If only a single column is to be extracted from a data frame, there is a useful shortcut available. If you type the name of the data frame, then the $, you have the option to choose which column to extract. For instance, let's extract the entire genotype column from our dataset:

# Extract the genotype column\nmetadata$genotype \n

The output will always be a vector, and if desired, you can continue to treat it as a vector. For example, if we wanted the genotype information for the first five samples in metadata, we can use the square brackets ([]) with the indices for the values from the vector to extract:

# Extract the first five values/elements of the genotype column\nmetadata$genotype[1:5]\n

Unfortunately, there is no equivalent $ syntax to select a row by name.

Exercise

  1. Return a data frame with only the genotype and replicate column values for sample2 and sample8.
  2. Return the fourth and ninth values of the replicate column.
  3. Extract the replicate column as a data frame.
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#selecting-using-indices-with-logical-operators","title":"Selecting using indices with logical operators","text":"

With data frames, similar to vectors, we can use logical expressions to extract the rows or columns in the data frame with specific values. First, we need to determine the indices in a rows or columns where a logical expression is TRUE, then we can extract those rows or columns from the data frame.

For example, if we want to return only those rows of the data frame with the celltype column having a value of typeA, we would perform two steps:

  1. Identify which rows in the celltype column have a value of typeA.
  2. Use those TRUE values to extract those rows from the data frame.

To do this we would extract the column of interest as a vector, with the first value corresponding to the first row, the second value corresponding to the second row, so on and so forth. We use that vector in the logical expression. Here we are looking for values to be equal to typeA, so our logical expression would be:

metadata$celltype == \"typeA\"\n

This will output TRUE and FALSE values for the values in the vector. The first six values are TRUE, while the last six are FALSE. This means the first six rows of our metadata have a vale of typeA while the last six do not. We can save these values to a variable, which we can call whatever we would like; let's call it logical_idx.

logical_idx <- metadata$celltype == \"typeA\"\n

Now we can use those TRUE and FALSE values to extract the rows that correspond to the TRUE values from the metadata data frame. We will extract as we normally would a data frame with metadata[ , ], and we need to make sure we put the logical_idx in the row's space, since those TRUE and FALSE values correspond to the ROWS for which the expression is TRUE/FALSE. We will leave the column's space blank to return all columns.

metadata[logical_idx, ]\n
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#selecting-indices-with-logical-operators-using-the-which-function","title":"Selecting indices with logical operators using the which() function","text":"

As you might have guessed, we can also use the which() function to return the indices for which the logical expression is TRUE. For example, we can find the indices where the celltype is typeA within the metadata dataframe:

which(metadata$celltype == \"typeA\")\n

This returns the values one through six, indicating that the first 6 values or rows are true, or equal to typeA. We can save our indices for which rows the logical expression is true to a variable we'll call idx, but, again, you could call it anything you want.

idx <- which(metadata$celltype == \"typeA\")\n

Then, we can use these indices to indicate the rows that we would like to return by extracting that data as we have previously, giving the idx as the rows that we would like to extract, while returning all columns:

metadata[idx, ]\n

Let's try another subsetting. Extract the rows of the metadata data frame for only the replicates 2 and 3. First, let's create the logical expression for the column of interest (replicate):

which(metadata$replicate > 1)\n

This should return the indices for the rows in the replicate column within metadata that have a value of 2 or 3. Now, we can save those indices to a variable and use that variable to extract those corresponding rows from the metadata table.

idx <- which(metadata$replicate > 1)\n\nmetadata[idx, ]\n

Alternatively, instead of doing this in two steps, we could use nesting to perform in a single step:

metadata[which(metadata$replicate > 1), ]\n

Either way works, so use the method that is most intuitive for you.

So far we haven't stored as variables any of the extractions/subsettings that we have performed. Let's save this output to a variable called sub_meta:

sub_meta <- metadata[which(metadata$replicate > 1), ]\n

Exercises

Subset the metadata dataframe to return only the rows of data with a genotype of KO.

NOTE

There are easier methods for subsetting dataframes using logical expressions, including the filter() and the subset() functions. These functions will return the rows of the dataframe for which the logical expression is TRUE, allowing us to subset the data in a single step. We will explore the filter() function in more detail in a later lesson.

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#lists","title":"Lists","text":"

Selecting components from a list requires a slightly different notation, even though in theory a list is a vector (that contains multiple data structures). To select a specific component of a list, you need to use double bracket notation [[]]. Let's use the list1 that we created previously, and index the second component:

list1[[2]]\n

What do you see printed to the console? Using the double bracket notation is useful for accessing the individual components whilst preserving the original data structure. When creating this list we know we had originally stored a dataframe in the second component. With the class function we can check if that is what we retrieve:

comp2 <- list1[[2]]\nclass(comp2)\n

You can also reference what is inside the component by adding an additional bracket. For example, in the first component we have a vector stored.

list1[[1]]\n\n[1] \"ecoli\" \"human\" \"corn\" \n

Now, if we wanted to reference the first element of that vector we would use:

list1[[1]][1]\n\n[1] \"ecoli\"\n

You can also do the same for dataframes and matrices, although with larger datasets it is not advisable. Instead, it is better to save the contents of a list component to a variable (as we did above) and further manipulate it. Also, it is important to note that when selecting components we can only access one at a time. To access multiple components of a list, see the note below.

Note

Using the single bracket notation also works wth lists. The difference is the class of the information that is retrieved. Using single bracket notation i.e. list1[1] will return the contents in a list form and not the original data structure. The benefit of this notation is that it allows indexing by vectors so you can access multiple components of the list at once.

Exercise

  1. Create a list named random with the following components: metadata, age, list1, samplegroup, and number.
  2. Extract the samplegroup component.

Assigning names to the components in a list can help identify what each list component contains, as well as, facilitating the extraction of values from list components.

Adding names to components of a list uses the names() function. Let's check and see if the list1 has names for the components:

names(list1) \n

When we created the list we had combined the species vector with a dataframe df and the number variable. Let's assign the original names to the components. To do this we can use the assignment operator in a new context. If we add names(list1) to the left side of the assignment arrow to be assigned to, then anything on the right side of the arrow will be assigned. Since we have three components in list1, we need three names to assign. We can create a vector of names using the combine (c()) function, and inside the combine function we give the names to assign to the components in the order we would like. So the first name is assigned to the first component of the list, and so on.

# Name components of the list\nnames(list1) <- c(\"species\", \"df\", \"number\")\n\nnames(list1)\n

Now that we have named our list components, we can extract components using the $ similar to extracting columns from a data frame. To obtain a component of a list using the component name, use list_name$component_name:

To extract the df dataframe from the list1 list:

# Extract 'df' component\nlist1$df\n

Exercise

Let's practice combining ways to extract data from the data structures we have covered so far:

  1. Set names for the random list you created in the last exercise.

  2. Extract the age component using the $ notation

An R package for data wrangling

The methods presented above are using base R functions for data wrangling. Later we will explore the Tidyverse suite of packages, specifically designed to make data wrangling easier.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/","title":"Advanced R, logical operators for matching","text":"

Approximate time: 45 min

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#learning-objectives","title":"Learning Objectives","text":"
  • Describe the use of %in% operator.
  • Explain the user case for any and all functions.
"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#logical-operators-for-identifying-matching-elements","title":"Logical operators for identifying matching elements","text":"

Oftentimes, we encounter different analysis tools that require multiple input datasets. It is not uncommon for these inputs to need to have the same row names, column names, or unique identifiers in the same order to perform the analysis. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill.

In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq, which we had downloaded previously; in addition, we have a metadata file corresponding to the RNA-seq samples. The metadata contains information about the samples present in the gene expression file, such as which sample group each sample belongs to and any batch or experimental variables present in the data.

Let's read in our gene expression data (RPKM matrix) that we downloaded previously:

rpkm_data <- read.csv(\"data/counts.rpkm.csv\")\n

Note

If the data file name ends with txt instead of csv, you can read in the data using the code: rpkm_data <- read.csv(\"data/counts.rpkm.txt\").

Take a look at the first few lines of the data matrix to see what's in there.

head(rpkm_data)\n

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.

ncol(rpkm_data)\nnrow(metadata)\n

What we want to know is, do we have data for every sample that we have metadata?

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#the-in-operator","title":"The %in% operator","text":"

Although lacking in documentation, this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

vector1 %in% vector2\n

It will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size. This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:

A <- c(1,3,5,7,9,11)   # odd numbers\nB <- c(2,4,6,8,10,12)  # even numbers\n\n# test to see if each of the elements of A is in B  \nA %in% B\n
## [1] FALSE FALSE FALSE FALSE FALSE FALSE\n

Since vector A contains only odd numbers and vector B contains only even numbers, the operation returns a logical vector containing six FALSE, suggesting that no element in vector A is present in vector B. Let's change a couple of numbers inside vector B to match vector A:

A <- c(1,3,5,7,9,11)   # odd numbers\nB <- c(2,4,6,8,1,5)  # add some odd numbers in \n
# test to see if each of the elements of A is in B\nA %in% B\n
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE\n

The returned logical vector denotes which elements in A are also in B - the first and third elements, which are 1 and 5.

We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to TRUE. Therefore, we can use the output logical vector to subset our data, and return only those elements in A, which are also in B by returning only the TRUE values:

intersection <- A %in% B\nintersection\n

A[intersection]\n

In these previous examples, the vectors were so small that it's easy to check every logical value by eye; but this is not practical when we work with large datasets (e.g. a vector with 1000 logical values). Instead, we can use any function. Given a logical vector, this function will tell you whether at least one value is TRUE. It provides us a quick way to assess if any of the values contained in vector A are also in vector B:

any(A %in% B)\n

The all function is also useful. Given a logical vector, it will tell you whether all values are TRUE. If there is at least one FALSE value, the all function will return a FALSE. We can use this function to assess whether all elements from vector A are contained in vector B.

all(A %in% B)\n

Exercise

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A

  2. Subset the B vector to only return those values that are also in A.

Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors. The operator returns a logical vector indicating TRUE/FALSE at each position. Then we can use all() function to check if all values in the returned vector are TRUE. If all values are TRUE, we know that these two vectors are the same. Unlike %in% operator, == operator requires that two vectors are of equal length.

A <- c(10,20,30,40,50)\nB <- c(50,40,30,20,10)  # same numbers but backwards \n\n# test to see if each element of A is in B\nA %in% B\n\n# test to see if each element of A is in the same position in B\nA == B\n\n# use all() to check if they are a perfect match\nall(A == B)\n

Let's try this on our genomic data, and see whether we have metadata information for all samples in our expression data. We'll start by creating two vectors: one is the rownames of the metadata, and one is the colnames of the RPKM data. These are base functions in R which allow you to extract the row and column names as a vector:

x <- rownames(metadata)\ny <- colnames(rpkm_data)\n

Now check to see that all of x are in y:

all(x %in% y)\n

Note that we can use nested functions in place of x and y and still get the same result:

all(rownames(metadata) %in% colnames(rpkm_data))\n

We know that all samples are present, but are they in the same order?

x == y\nall(x == y)\n

Looks like all of the samples are there, but they need to be reordered. To reorder our genomic samples, we will learn different ways to reorder data in our next lesson. But before that, let's work on exercise 2 to consolidate concepts from this lesson.

Exercise

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.

First, let's create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

    important_genes <- c(\"ENSMUSG00000083700\", \"ENSMUSG00000080990\", \"ENSMUSG00000065619\", \"ENSMUSG00000047945\", \"ENSMUSG00000081010\", \"ENSMUSG00000030970\")\n
  1. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.

  2. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.

  3. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/","title":"Advanced R, reordering to match datasets","text":"

Approximate time: 45 min

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#learning-objectives","title":"Learning Objectives","text":"
  • Implement manual reordering of vectors and data frames
  • Utilize the match() function to reorder vectors and data frames so that unique identifiers are in the same order
"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#reordering-data-to-match","title":"Reordering data to match","text":"

In the previous lesson, we learned how to determine whether the same data is present in two datasets, in addition to, whether it is in the same order. In this lesson, we will explore how to reorder the data such that the datasets are matching.

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#manual-reordering-of-data-using-indices","title":"Manual reordering of data using indices","text":"

Indexing [ ] can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values.

teaching_team <- c(\"Jihe\", \"Mary\", \"Meeta\", \"Radhika\", \"Will\", \"Emma\")\n

Remember that we can return values in a vector by specifying it's position or index:

# Extracting values from a vector\nteaching_team[c(2, 4)] \n

Also, note that we haven't changed the teaching_team variable. The only way to change the teaching_team variable would be to re-assign/overwrite it.

teaching_team\n

We can also extract the values and reorder them:

# Extracting values and reordering them\nteaching_team[c(4, 2)] \n

Similarly, we can extract all of the values and reorder them:

# Extracting all values and reordering them\nteaching_team[c(5, 4, 6, 2, 1, 3)]\n

If we want to save our results, we need to assign to a variable:

# Saving the results to a variable\nreorder_teach <- teaching_team[c(5, 4, 6, 2, 1, 3)] \n

Exercise

Now that we know how to reorder using indices, let's try to use it to reorder the contents of one vector to match the contents of another. Let's create the vectors first and second as detailed below:

first <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"B\",\"D\",\"E\",\"A\",\"C\")  # same letters but different order\n

How would you reorder the second vector to match first?

If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements, and it would be quite easy to make a typo or mistake. To help with matching datasets, there is a function called match().

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#the-match-function","title":"The match function","text":"

We can use the match() function to match the values in two vectors. We'll be using it to evaluate which values are present in both vectors, and how to reorder the elements to make the values match.

match() takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:

  1. a vector of values in the order you want
  2. a vector of values to be reordered

The function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let's use match() on the first and second vectors we created.

match(first,second)\n[1] 4 1 5 2 3\n

The output is the indices for how to reorder the second vector to match the first. These indices match the indices that we derived manually before.

Now, we can just use the indices to reorder the elements of the second vector to be in the same positions as the matching elements in the first vector:

# Saving indices for how to reorder `second` to match `first`\nreorder_idx <- match(first,second) \n

Then, we can use those indices to reorder the second vector similar to how we ordered with the manually derived indices.

# Reordering the second vector to match the order of the first vector\nsecond[reorder_idx]\n

If the output looks good, we can save the reordered vector to a new variable.

# Reordering and saving the output to a variable\nsecond_reordered <- second[reorder_idx]  \n

Now that we know how match() works, let's change vector second so that only a subset are retained:

first <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"D\",\"B\",\"A\")  # remove values\n
And try to match() again:

match(first,second)\n\n[1]  3  2 NA  1 NA\n

We see that the match() function takes every element in the first vector and finds the position of that element in the second vector, and if that element is not present, will return a missing value of NA. The value NA represents missing data for any data type within R. In this case, we can see that the match() function output represents the value at position 3 as first, which is A, then position 2 is next, which is B, the value coming next is supposed to be C, but it is not present in the second vector, so NA is returned, so on and so forth.

Note

For values that don't match by default return an NA value. You can specify what values you would have it assigned using nomatch argument. Also, if there is more than one matching value found only the first is reported.

If we rearrange second using these indices, then we should see that all the values present in both vectors are in the same positions and NAs are present for any missing values.

second[match(first, second)]\n
"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#reordering-genomic-data-using-match-function","title":"Reordering genomic data using match() function","text":"

While the input to the match() function is always going to be to vectors, often we need to use these vectors to reorder the rows or columns of a data frame to match the rows or columns of another dataframe. Let's explore how to do this with our use case featuring RNA-seq data. To perform differential gene expression analysis, we have a data frame with the expression data or counts for every sample and another data frame with the information about to which condition each sample belongs. For the tools doing the analysis, the samples in the counts data, which are the column names, need to be the same and in the same order as the samples in the metadata data frame, which are the rownames.

We can take a look at these samples in each dataset by using the rownames() and colnames() functions.

# Check row names of the metadata\nrownames(metadata)\n\n# Check the column names of the counts data\ncolnames(rpkm_data)\n

We see the row names of the metadata are in a nice order starting at sample1 and ending at sample12, while the column names of the counts data look to be the same samples, but are randomly ordered. Therefore, we want to reorder the columns of the counts data to match the order of the row names of the metadata. To do so, we will use the match() function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match.

To do so, we will use the match function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match().

Within the match() function, the rownames of the metadata is the vector in the order that we want, so this will be the first argument, while the column names of the count or rpkm data is the vector to be reordered. We will save these indices for how to reorder the column names of the count data such that it matches the rownames of the metadata to a variable called genomic idx.

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))\ngenomic_idx\n

The genomic_idx represents how to re-order the column names in our counts data to be identical to the row names in metadata.

Now we can create a new counts data frame in which the columns are re-ordered based on the match() indices. Remember that to reorder the rows or columns in a data frame we give the name of the data frame followed by square brackets, and then the indices for how to reorder the rows or columns.

Our genomic_idx represents how we would need to reorder the columns of our count data such that the column names would be in the same order as the row names of our metadata. Therefore, we need to add our genomic_idx to the columns position. We are going to save the output of the reordering to a new data frame called rpkm_ordered.

# Reorder the counts data frame to have the sample names in the same order as the metadata data frame\nrpkm_ordered  <- rpkm_data[ , genomic_idx]\n

Check and see what happened by clicking on the rpkm_ordered in the Environment window or using the View() function.

# View the reordered counts\nView(rpkm_ordered)\n

We can see the sample names are now in a nice order from sample 1 to 12, just like the metadata. One thing to note is that you would never want to rearrange just the column names without the rest of the column because that would dissociate the sample name from it's values.

You can also verify that column names of this new data matrix matches the metadata row names by using the all function:

all(rownames(metadata) == colnames(rpkm_ordered))\n

Now that our samples are ordered the same in our metadata and counts data, if these were raw counts (not RPKM) we could proceed to perform differential expression analysis with this dataset.

Exercise

  1. After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.

  2. Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/","title":"Plotting and data visualization in R","text":""},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#learning-objectives","title":"Learning Objectives","text":"
  • Describe the map() function for iterative tasks on data structures.
"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#setting-up-a-data-frame-for-visualization","title":"Setting up a data frame for visualization","text":"

In this lesson we want to make plots to evaluate the average expression in each sample and its relationship with the age of the mouse. So, to this end, we will be adding a couple of additional columns of information to the metadata data frame that we can utilize for plotting.

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#calculating-average-expression","title":"Calculating average expression","text":"

Let's take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample eventually. Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called mean():

mean(rpkm_ordered$sample1)\n

That is great, if we only wanted the average from one of the samples (1 column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?

Programming languages typically have a way to allow the execution of a single line of code or several lines of code multiple times, or in a \"loop\". While \"for loops\" are available in R, there are other easier-to-use functions that can achieve this - for example, the apply() family of functions and the map() family of functions.

The map() family is a bit more intuitive to use than apply() and we will be using it today. However, if you are interested in learning more about theapply() family of functions we have materials available here.

To obtain mean values for all samples we can use the map_dbl() function which generates a numeric vector.

library(purrr)  # Load the purrr\n\nsamplemeans <- map_dbl(rpkm_ordered, mean) \n
"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#the-map-family-of-functions","title":"The map family of functions","text":"

The map() family of functions is available from the purrr package, which is part of the tidyverse suite of packages. More detailed information is available in the R for Data Science book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a \"double\" or numeric vector.
  • map_chr() creates a character vector.

The syntax for the map() family of functions is:

## DO NOT RUN\nmap(object, function_to_apply)\n

If you would like to practice with the map() family of functions, we have additional materials available.

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#creating-a-new-metadata-object-with-additional-information","title":"Creating a new metadata object with additional information","text":"

Because the input was 12 columns of information the output of map_dbl() is a named vector of length 12.

# Named vectors have a name assigned to each element instead of just referring to them as indices ([1], [2] and so on)\nsamplemeans\n\n# Check length of the vector before adding it to the data frame\nlength(samplemeans)\n

Since we have 12 rows in the data frame, we can add the 12 element vector as a column to our metadata data frame using the data.frame() function.

Before we add the new column, let's create a vector with the ages of each of the mice in our data set.

# Create a numeric vector with ages. Note that there are 12 elements here\nage_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)        \n

Now, we are ready to combine the metadata data frame with the 2 new vectors to create a new data frame with 5 columns

# Add the new vector as the last column to the new_metadata dataframe\nnew_metadata <- data.frame(metadata, samplemeans, age_in_days) \n\n# Take a look at the new_metadata object\nView(new_metadata)\n

Note

that we could have also combined columns using the cbind() function as shown in the code below:

## DO NOT RUN\nnew_metadata <- cbind(metadata, samplemeans, age_in_days)\n
The two functions work identically with the exception of assigning row names. For example, if we were combining columns and wanted to add in a vector of row names, we could easily do so in data.frame() with the use of the row.names argument. This argument is not available for the cbind() function.

We are now ready for plotting and data visualization!

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/D3.1_in_class_exercises/","title":"Day 3: In class activities","text":""},{"location":"day_3/D3.1_in_class_exercises/#reading-in-and-inspecting-data","title":"Reading in and inspecting data","text":"
  • Download the data and place the file into the data directory.
Data Download link Animal data Right click & Save link as...
  • Read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.

  • Check to make sure that animals is a dataframe.

  • How many rows are in the animals dataframe? How many columns?

"},{"location":"day_3/D3.1_in_class_exercises/#data-wrangling","title":"Data wrangling","text":"
  1. Extract the speed value of 40 km/h from the animals dataframe.

  2. Return the rows with animals that are the color Tan.

  3. Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a data frame.

  4. Change the color of \"Grey\" to \"Gray\".

  5. Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.

  6. Give each element of your list the appropriate name (i.e speed and color).

"},{"location":"day_3/D3.1_in_class_exercises/#the-in-operator-reordering-and-matching","title":"The %in% operator, reordering and matching","text":"

In your environment you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset.

  1. Copy and paste the code below to create a dataframe of control samples with the associated batch information
    ctrl_samples <- data.frame(row.names = c(\"sample3\", \"sample10\", \"sample8\", \"sample4\", \"sample15\"), date = c(\"01/13/2018\", \"03/15/2018\", \"01/13/2018\", \"09/20/2018\",\"03/15/2018\"))\n
  1. How many of the ctrl_samples are also in the proj_summary dataframe? Use the %in% operator to compare sample names.

  2. Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the %in% operator. Save it to a variable called proj_summary_ctrl.

  3. We would like to add in the batch information for the samples in proj_summary_ctrl. Find the rows that match in ctrl_samples.

  4. Use cbind() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.

"},{"location":"day_3/D3.1_in_class_exercises/#bonus-using-map_lgl","title":"BONUS: Using map_lgl()","text":"
  1. Subset proj_summary to keep only the \u201chigh\u201d and \u201clow\u201d samples based on the treament column. Save the new dataframe to a variable called proj_summary_noctl.

  2. Further, subset the dataframe to remove the non-numeric columns \u201cQuality_format\u201d, and \u201ctreatment\u201d. Try to do this using the map_lgl() function in addition to is.numeric(). Save the new dataframe back to proj_summary_noctl.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/D3.2_plotting_with_ggplot2/","title":"Plotting and data visualization in R","text":"

Approximate time: 60 minutes

"},{"location":"day_3/D3.2_plotting_with_ggplot2/#learning-objectives","title":"Learning Objectives","text":"
  • Explain the syntax of ggplot2
  • Apply ggplot2 package to visualize data.
"},{"location":"day_3/D3.2_plotting_with_ggplot2/#data-visualization-with-ggplot2","title":"Data Visualization with ggplot2","text":"

For this lesson, you will need the new_metadata data frame. Please download it from the link below. Right click and save link as or download file as in the data directory.

Data Download link Data Right click & Save link as...

Once you have downloaded it, load it into your environment as follows:

## load the new_metadata data frame into your environment from a .RData object\nload(\"data/new_metadata.RData\")\n

Next, let's check if it was successfully loaded into the environment:

# this data frame should have 12 rows and 5 columns\nView(new_metadata)\n

Great, we are now ready to move forward!

When we are working with large sets of numbers it can be useful to display that information graphically to gain more insight. In this lesson we will be plotting with the popular Bioconductor package ggplot2.

Note

If you are interested in learning about plotting with base R functions, we have a short lesson.

The ggplot2 syntax takes some getting used to, but once you get it, you will find it's extremely powerful and flexible. We will start with drawing a simple x-y scatterplot of samplemeans versus age_in_days from the new_metadata data frame. Please note that ggplot2 expects a \"data frame\" or \"tibble\" as input.

Note

You can find out more about tibbles in the lesson on tidyverse

Let's start by loading the ggplot2 library:

library(ggplot2)\n

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot using additional functions one after the other and combine them into a \"code chunk\" using the + operator; the functions in the resulting code chunk are called layers.

Try the code below and see what happens.

ggplot(new_metadata) # what happens? \n

Metadata

If you don't have the new_metadata object, you can right-click to download and save an rds file from here into the project data folder, and load it in using the code below:

new_metadata <- readRDS(\"data/new_metadata.rds\")`\n

You get an blank plot, because you need to specify additional layers using the + operator.

The geom (geometric) object is the layer that specifies what kind of plot we want to draw. A plot must have at least one geom; there is no upper limit. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

Let's add a \"geom\" layer to our plot using the + operator, and since we want a scatter plot so we will use geom_point().

ggplot(new_metadata) +\n  geom_point() # note what happens here\n

Why do we get an error? Is the error message easy to decipher?

We get an error because each type of geom usually has a required set of aesthetics to be set. \"Aesthetics\" are set with the aes() function and can be set either nested within geom_point() (applies only to that layer) or within ggplot() (applies to the whole plot).

The aes() function has many different arguments, and all of those arguments take columns from the original data frame as input. It can be used to specify many plot elements including the following:

  • position (i.e., on the x and y axes)
  • color (\"outside\" color)
  • fill (\"inside\" color)
  • shape (of points)
  • linetype
  • size

To start, we will specify x- and y-axis since geom_point requires the most basic information about a scatterplot, i.e. what you want to plot on the x and y axes. All of the other plot elements mentioned above are optional.

ggplot(new_metadata) +\n     geom_point(aes(x = age_in_days, y= samplemeans))\n

Now that we have the required aesthetics, let's add some extras like color to the plot. We can color the points on the plot based on the genotype column within aes(). You will notice that there are a default set of colors that will be used so we do not have to specify. Note that the legend has been conveniently plotted for us.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype)) \n

Let's try to have both celltype and genotype represented on the plot. To do this we can assign the shape argument in aes() the celltype column, so that each celltype is plotted with a different shaped data point.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype)) \n

The data points are quite small. We can adjust the size of the data points within the geom_point() layer, but it should not be within aes() since we are not mapping it to a column in the input data frame, instead we are just specifying a number.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=2.25) \n

The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis label aesthetics
  • Plot background
  • Facet label backround
  • Legend appearance

There are built-in themes we can use (i.e. theme_bw()) that mostly change the background/foreground colours, by adding it as additional layer. Or we can adjust specific elements of the current default theme by adding the theme() layer and passing in arguments for the things we wish to change. Or we can use both.

Let's add a layer theme_bw().

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=3.0) +\n  theme_bw() \n

Do the axis labels or the tick labels get any larger by changing themes?

No, they don't. But, we can add arguments using theme() to change the size of axis labels ourselves. Since we will be adding this layer \"on top\", or after theme_bw(), any features we change will override what is set by the theme_bw() layer.

Let's increase the size of both the axes titles to be 1.5 times the default size. When modifying the size of text the rel() function is commonly used to specify a change relative to the default.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=2.25) +\n  theme_bw() +\n  theme(axis.title = element_text(size=rel(1.5)))           \n

Notes

  • You can use the example(\"geom_point\") function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geometric object layers available in ggplot2 to learn how you can easily modify your plots!
  • RStudio provide this very useful cheatsheet for plotting using ggplot2. Different example plots are provided and the associated code (i.e which geom or theme to use in the appropriate situation.) We also encourage you to peruse through this useful online reference for working with ggplot2.

Exercise

  1. The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled \"Age (days)\" and the y-axis is labeled \"Mean expression\".
  2. Use the ggtitle layer to add a plot title of your choice.
  3. Add the following new layer to the code chunk theme(plot.title=element_text(hjust=0.5)).
    • What does it change?
    • How many theme() layers can be added to a ggplot code chunk, in your estimation?

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/basic_plots_in_r/","title":"Plotting and data visualization in R (basics)","text":"

Approximate time: 45 minutes

"},{"location":"day_3/basic_plots_in_r/#basic-plots-in-r","title":"Basic plots in R","text":"

R has a number of built-in tools for basic graph types such as histograms, scatter plots, bar charts, boxplots and much more. Rather than going through all of different types, we will focus on plot(), a generic function for plotting x-y data.

To get a quick view of the different things you can do with plot, let's use the example() function:

example(\"plot\")\n

Here, you will find yourself having to press <Return> so you can scroll through the different types of graphs generated by plot. Take note of the different parameters used with each command and how that affects the aesthetics of the plot.

dev.off() \n# this means \"device off\" and we will be going over what this does at the end of this section. \n# For now, it makes it so that when we draw plots they show up where they are supposed to?\n
"},{"location":"day_3/basic_plots_in_r/#scatterplot","title":"Scatterplot","text":"

For some hands-on practice we are going to use plot to draw a scatter plot and obtain a graphical view of the relationship between two sets of continuous numeric data. From our new_metadata file we will take the samplemeans column and plot it against age_in_days, to see how mean expression changes with age.

Now our metadata has all the information to draw a scatterplot. The base R function to do this is plot(y ~ x, data):

plot(samplemeans ~ age_in_days, data=new_metadata)\n

Each point represents a sample. The values on the y-axis correspond to the average expression for each sample which is dependent on the x-axis variable age_in_days. This plot is in its simplest form, we can customize many features of the plot (fonts, colors, axes, titles) through graphic options.

For example, let's start by giving our plot a title and renaming the axes. We can do that by simply adding the options xlab, ylab and main as arguments to the plot() function:

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\")\n

We can also change the shape of the data point using the pch option and the size of the data points using cex (specifying the amount to magnify relative to the default).

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\", pch=\"*\", cex=2.0)\n

We can also add some color to the data points on the plot by adding col=\"blue\". Alternatively, you can sub in any of the default colors or you can experiment with other R packages to fiddle with better palettes.

We can also add color to separate the data points by information in our data frame. For example, suppose we wanted to the data points colored by celltype. We would need to specify a vector of colours and provide the factor by which we are separating samples. The first level in our factor vector (which by default is assigned alphabetically) would get assigned the first color that we list. So in this case, blue corresponds to celltype A samples and green corresponds to celltype B.

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\", pch=\"*\", cex=2.0, col=c(\"blue\", \"green\")[celltype])\n

The last thing this plot needs is a figure legend describing the color scheme. It would be great if it created one for you by default, but with R base functions unfortunately it is not that easy. To draw a legend on the current plot, you need to run a new function called legend() and specify the appropriate arguments. The code to do so is provided below. Don't worry if it seems confusing, we plan on showing you a much more intuitive way of plotting your data.

legend(\"topleft\", pch=\"*\", col=c(\"blue\", \"green\"), c(\"A\", \"B\"), cex=0.8,\n    title=\"Celltype\")\n

Exercise

  1. Change the color scheme in the scatterplot, such that it reflects the genotype of samples rather than celltype.

  2. Use R help to find out how to increase the size of the text on the axis labels.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

"},{"location":"day_3/basic_plots_in_r/#other-types-of-plots-in-base-r","title":"Other Types of Plots in Base R","text":"

NOTE: we will not run these in class, but the code is provided if you are interested in exploring more on your own.

"},{"location":"day_3/basic_plots_in_r/#barplot","title":"Barplot","text":"

Barplots are useful for comparing the distribution of a quantitative variable (numeric) between groups or categories. A barplot would be much more useful to compare the samplemeans (numeric variable) for each sample. We can use barplot to draw a single bar representing each sample and the height indicates the average expression level.

?barplot\n# note that there is no \"data=\" argument for barplot()\n
Similar to the scatterplot, we can use additional arguments to specify the aesthetics that we want to change. For example, changing axis labeling and adding some color.
barplot(new_metadata$samplemeans, names.arg=c(1:12), horiz=TRUE, col=c(\"darkblue\", \"red\")[new_metadata$genotype]) \n

"},{"location":"day_3/basic_plots_in_r/#histogram","title":"Histogram","text":"

If we are interested in an overall distribution of numerical data, a histogram is what we'd want. To plot a histogram of the data use the hist command:

hist(new_metadata$samplemeans)\n
Again, there are many options that we can change by modifying the default parameters. Let's color in the bars, remove the borders and increase the number of breaks:
hist(new_metadata$samplemeans, xlab=\"Mean expression level\", main=\"\", col=\"darkgrey\", border=FALSE) \n

"},{"location":"day_3/basic_plots_in_r/#boxplot","title":"Boxplot","text":"

Using additional sample information from our metadata, we can use plots to compare values between different factor levels or categories. For example, we can compare the sample means across celltypes 'typeA' and 'typeB' using a boxplot.

# Boxplot\nboxplot(samplemeans~celltype, data=new_metadata)\n

"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/","title":"Custom functions for consistent plots","text":"

Approximate time: 20 minutes

"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/#learning-objectives","title":"Learning Objectives","text":"
  • Apply the custom function to generate consistent plots.
"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/#consistent-formatting-using-custom-functions","title":"Consistent formatting using custom functions","text":"

When publishing, it is helpful to ensure all plots have similar formatting. To do this we can create a custom function with our preferences for the theme. Remember the structure of a function is:

name_of_function <- function(arguments) {\n    statements or code that does something\n}\n

Now, let's suppose we always wanted our theme to include the following:

theme_bw() +\ntheme(axis.title=element_text(size=rel(1.5))) +\ntheme(plot.title=element_text(size=rel(1.5), hjust=0.5))\n

Note

You can also combine multiple arguments within the same theme() function:

theme_bw() +\ntheme(axis.title=element_text(size=rel(1.5)), plot.title=element_text(size=rel(1.5), hjust=0.5))\n

If there is nothing that we want to change when we run this, then we do not need to specify any arguments. Creating the function is simple; we can just put the code inside the {}:

personal_theme <- function(){\n  theme_bw() +\n  theme(axis.title=element_text(size=rel(1.5))) +\n  theme(plot.title=element_text(size=rel(1.5), hjust=0.5))\n}\n

Now to run our personal theme with any plot, we can use this function in place of the lines of theme() code:

ggplot(new_metadata) +\n  geom_point(aes(x=age_in_days, y=samplemeans, color=genotype, shape=celltype), size=rel(3.0)) +\n  xlab(\"Age (days)\") +\n  ylab(\"Mean expression\") +\n  ggtitle(\"Expression with Age\") +\n  personal_theme()\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/","title":"Plotting and data visualization in R","text":"

Approximate time: 60 minutes

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#learning-objectives","title":"Learning Objectives","text":"
  • Generate the box plot using ggplot2
"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#generating-a-boxplot-with-ggplot2","title":"Generating a Boxplot with ggplot2","text":"

A boxplot provides a graphical view of the distribution of data based on a five number summary:

  • The top and bottom of the box represent the (1) first and (2) third quartiles (25th and 75th percentiles, respectively).

  • The line inside the box represents the (3) median (50th percentile).

  • The whiskers extending above and below the box represent the (4) maximum, and (5) minimum of a data set.

  • The whiskers of the plot reach the minimum and maximum values that are not outliers.

Note

In this case, outliers are determined using the interquartile range (IQR), which is defined as: Q3 - Q1. Any values that exceeds 1.5 x IQR below Q1 or above Q3 are considered outliers and are represented as points above or below the whiskers.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#1-boxplot","title":"1. Boxplot!","text":"

Generate a boxplot using the data in the new_metadata dataframe. Create a ggplot2 code chunk with the following instructions:

  1. Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.
  2. Use the fill aesthetic to look at differences in sample means between the celltypes within each genotype.
  3. Add a title to your plot.
  4. Add labels, 'Genotype' for the x-axis and 'Mean expression' for the y-axis.
  5. Make the following theme() changes:
    • Use the theme_bw() function to make the background white.
    • Change the size of your axes labels to 1.25x larger than the default.
    • Change the size of your plot title to 1.5x larger than default.
    • Center the plot title.

After running the above code the boxplot should look something like that provided below.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#2-changing-the-order-of-genotype-on-the-boxplot","title":"2. Changing the order of genotype on the Boxplot","text":"

Let's say you wanted to have the \"Wt\" boxplots displayed first on the left side, and \"KO\" on the right. How might you go about doing this?

To do this, your first question should be:

How does ggplot2 determine what to place where on the X-axis?

  • The order of the genotype on the X axis is in alphabetical order.

  • To change it, you need to make sure that the genotype column is a factor.

  • And, the factor levels for that column are in the order you want on the X-axis

  • Factor the new_metadata$genotype column without creating any extra variables/objects and change the levels to c(\"Wt\", \"KO\")

  • Re-run the boxplot code chunk you created for the \"Boxplot!\" exercise above.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#3-changing-default-colors","title":"3. Changing default colors","text":"

You can color the boxplot differently by using some specific layers:

  1. Add a new layer scale_color_manual(values=c(\"purple\",\"orange\")).
    • Do you observe a change?
  2. Replace scale_color_manual(values=c(\"purple\",\"orange\")) with scale_fill_manual(values=c(\"purple\",\"orange\")).
    • Do you observe a change?
    • In the scatterplot we drew in class, add a new layer scale_color_manual(values=c(\"purple\",\"orange\")), do you observe a difference?
    • What do you think is the difference between scale_color_manual() and scale_fill_manual()?
  3. Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.
    • Are there any colors that you tried that did not work?

We have a separate lesson about using color palettes from the package RColorBrewer, if you are interested.

You are not restricted to using colors by writing them out as character vectors. You have the choice of a lot of colors in R, and you can do so by using their hexadecimal code. For example, \"#FF0000\" would be red and \"#00FF00\" would be green similarly, #FFFFFF would be white and #000000 would be black. click here for more information about color palettes in R.

OPTIONAL Exercise:

  • Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/","title":"Saving data and plots to file","text":"

Approximate time: 30 minutes

"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#learning-objectives","title":"Learning Objectives","text":"
  • Describe how to export data tables and plots for use outside of the R environment.
"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#writing-data-to-file","title":"Writing data to file","text":"

Everything we have done so far has only modified the data in R; the files have remained unchanged. Whenever we want to save our datasets to file, we need to use a write function in R.

To write our matrix to file in comma separated format (.csv), we can use the write.csv function. There are two required arguments: the variable name of the data structure you are exporting, and the path and filename that you are exporting to. By default the delimiter or column separator is set, and columns will be separated by a comma:

# Save a data frame to file\nwrite.csv(sub_meta, file=\"data/subset_meta.csv\")\n

Oftentimes the output is not exactly what you might want. You can modify the output using the arguments for the function. We can explore the arguments using the ?. This can help elucidate what each of the arguments can adjust the output.

?write.csv\n

Similar to reading in data, there are a wide variety of functions available allowing you to export data in specific formats. Another commonly used function is write.table, which allows you to specify the delimiter or separator you wish to use. This function is commonly used to create tab-delimited files.

Note

Sometimes when writing a data frame using row names to file with write.table(), the column names will align starting with the row names column. To avoid this, you can include the argument col.names = NA when writing to file to ensure all of the column names line up with the correct column values.

Writing a vector of values to file requires a different function than the functions available for writing dataframes. You can use write() to save a vector of values to file. For example:

# Save a vector to file\nwrite(glengths, file=\"data/genome_lengths.txt\")\n

If we wanted the vector to be output to a single column instead of five, we could explore the arguments:

?write\n

Note, the ncolumns argument that it defaults to five columns unless specified, so to get a single column:

# Save a vector to file as a single column\nwrite(glengths, file=\"data/genome_lengths.txt\", ncolumns = 1)\n
"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#exporting-figures-to-file","title":"Exporting figures to file","text":"

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen).

  1. The first (and easiest) is to export directly from the RStudio 'Plots' panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to. It will also give you options to dictate the size and resolution of the output image.

  2. The second option is to use R functions and have the write to file hard-coded in to your script. This would allow you to run the script from start to finish and automate the process (not requiring human point-and-click actions to save). In R\u2019s terminology, output is directed to a particular output device and that dictates the output format that will be produced. A device must be created or \u201copened\u201d in order to receive graphical output and, for devices that create a file on disk, the device must also be closed in order to complete the output.

If we wanted to print our scatterplot to a pdf file format, we would need to initialize a plot using a function which specifies the graphical format you intend on creating i.e.pdf(), png(), tiff() etc. Within the function you will need to specify a name for your image, and the with and height (optional). This will open up the device that you wish to write to:

## Open device for writing\npdf(\"figures/scatterplot.pdf\")\n

If you wish to modify the size and resolution of the image you will need to add in the appropriate parameters as arguments to the function when you initialize. Then we plot the image to the device, using the ggplot scatterplot that we just created.

## Make a plot which will be written to the open device, in this case the temp file created by pdf()/png()\nggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=rel(3.0)) \n

Finally, close the \"device\", or file, using the dev.off() function. There are also bmp, tiff, and jpeg functions, though the jpeg function has proven less stable than the others.

## Closing the device is essential to save the temporary file created by pdf()/png()\ndev.off()\n

Note 1

  1. You will not be able to open and look at your file using standard methods (Adobe Acrobat or Preview etc.) until you execute the dev.off() function.*
  2. In the case of pdf(), if you had made additional plots before closing the device, they will all be stored in the same file with each plot usually getting its own page, unless otherwise specified.*

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.4e_finding_help/","title":"Troubleshooting and finding help","text":"

Approximate time: 30 min

"},{"location":"day_3_exercise/D3.4e_finding_help/#learning-objectives","title":"Learning Objectives","text":"
  • Identify different R-specific external sources to help with troubleshooting errors and obtaining more information about functions and packages.
"},{"location":"day_3_exercise/D3.4e_finding_help/#asking-for-help","title":"Asking for help","text":"

The key to getting help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.

  1. Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.

  2. Always include the output of sessionInfo() as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.

    sessionInfo()  #This time it is not interchangeable with search()\n
  3. If possible, reproduce the problem using a very small data.frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.

    • To share an object with someone else, you can provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your questions is not related to a data.frame, you can save any other R data structure that you have in your environment to a file:

      DO NOT RUN THIS

      # DO NOT RUN THIS!\n\nsave(iris, file=\"/tmp/iris.RData\")\n

      The content of this .RData file is not human readable and cannot be posted directly on stackoverflow. It can, however, be emailed to someone who can read it with this command:

      DO NOT RUN THIS

      # DO NOT RUN THIS!\n\nload(file=\"~/Downloads/iris.RData\")\n
"},{"location":"day_3_exercise/D3.4e_finding_help/#where-to-ask-for-help","title":"Where to ask for help?","text":"
  • Google is often your best friend for finding answers to specific questions regarding R.
    • Cryptic error messages are very common in R - it is very likely that someone else has encountered this problem already! Start by googling the error message. However, this doesn't always work because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. \"subscript out of bounds\").
  • Stackoverflow: Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r. If your question hasn't been answered before and is well crafted, chances are you will get an answer in less than 5 min.
  • Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
  • The R-help: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don't expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
  • The Bioconductor support site. This is very useful and if you tag your post, there is a high likelihood of getting an answer from the developer.
  • If your question is about a specific package, see if there is a mailing list for it. Usually it's included in the DESCRIPTION file of the package that can be accessed using packageDescription(\"name-of-package\"). You may also want to try to email the author of the package directly.
  • There are also some topic-specific mailing lists (GIS, phylogenetics, etc...), the complete list is here.
"},{"location":"day_3_exercise/D3.4e_finding_help/#more-resources","title":"More resources","text":"
  • The Posting Guide for the R mailing lists.
  • How to ask for R help useful guidelines
  • The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
  • The R FAQ is dense and technical but it is full of useful information.

Exercise

  1. Run the following code chunks and fix all of the errors. (Note: The code chunks are independent from one another.)

    # Create vector of work days\nwork_days <- c(Monday, Tuesday, Wednesday, Thursday, Friday)\n
    # Create a function to round the output of the sum function\nround_the_sum <- function(x){\n        return(round(sum(x))\n}\n
    # Create a function to add together three numbers\nadd_numbers <- function(x,y,z){\n        sum(x,y,z)\n}\n\nadd_numbers(5,9)\n
  2. You try to install a package and you get the following error message:

Error message

Error: package or namespace load failed for 'Seurat' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called 'multtest'\n

What would you do to remedy the error?

  1. You would like to ask for help on an online forum. To do this you want the users of the forum to reproduce your problem, so you want to provide them as much relevant information and data as possible.

    • You want to provide them with the list of packages that you currently have loaded, the version of R, your OS and package versions. Use the appropriate function(s) to obtain this information.
    • You want to also provide a small data frame that reproduces the error (if working with a large data frame, you'll need to subset it down to something small). For this exercise use the data frame df, and save it as an RData object called df.RData.
    • What code should the people looking at your help request should use to read in df.RData?

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.5e_tidyverse/","title":"Tidyverse data wrangling","text":"

Approximate time: 75 minutes

"},{"location":"day_3_exercise/D3.5e_tidyverse/#learning-objectives","title":"Learning Objectives","text":"
  • Perform basic data wrangling with functions in the Tidyverse package.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#data-wrangling-with-tidyverse","title":"Data Wrangling with Tidyverse","text":"

The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the 'dplyr' package and data visualization with the 'ggplot2' package.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#tidyverse-basics","title":"Tidyverse basics","text":"

The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. Two important new concepts we will focus on are pipes and tibbles.

Before we get started with pipes or tibbles, let's load the library:

library(tidyverse)\n
"},{"location":"day_3_exercise/D3.5e_tidyverse/#pipes","title":"Pipes","text":"

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.

To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the magrittr package and is now part of the dplyr package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.

Note

Shortcut to write the pipe is shift + command + M

An example of using the pipe to run multiple commands:

## A single command\nsqrt(83)\n\n## Base R method of running more than one command\nround(sqrt(83), digits = 2)\n\n## Running more than one command with piping\nsqrt(83) %>% round(digits = 2)\n

The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson.

Exercise

  1. Create a vector of random numbers using the code below:

    random_numbers <- c(81, 90, 65, 43, 71, 29)\n
  2. Use the pipe (%>%) to perform two steps in a single line:

    • Take the mean of random_numbers using the mean() function.
    • Round the output to three digits using the round() function.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#tibbles","title":"Tibbles","text":"

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R.

Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames().

Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).

Note

The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#experimental-data","title":"Experimental data","text":"

We're going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization. You should have downloaded the file called gprofiler_results_Mov10oe.tsv into your R project's data folder earlier.

Note

If you do not have the gprofiler_results_Mov10oe.tsv file in your data folder, you can right click and download it into the data folder using this link.

The dataset:

  • Represents the functional analysis results, including the biological processes, functions, pathways, or conditions that are over-represented in a given list of genes.
  • Our gene list was generated by differential gene expression analysis and the genes represent differences between control mice and mice over-expressing a gene involved in RNA splicing.

The functional analysis that we will focus on involves gene ontology (GO) terms, which:

  • describe the roles of genes and gene products
  • organized into three controlled vocabularies/ontologies (domains):
    • biological processes (BP)
    • cellular components (CC)
    • molecular functions (MF)

"},{"location":"day_3_exercise/D3.5e_tidyverse/#analysis-goal-and-workflow","title":"Analysis goal and workflow","text":"

Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values by creating the following plot:

To wrangle our data in preparation for the plotting, we are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps:

  1. Read in the functional analysis results
  2. Extract only the GO biological processes (BP) of interest
  3. Select only the columns needed for visualization
  4. Order by significance (p-adjusted values)
  5. Rename columns to be more intuitive
  6. Create additional metrics for plotting (e.g. gene ratios)
  7. Plot results
"},{"location":"day_3_exercise/D3.5e_tidyverse/#tidyverse-tools","title":"Tidyverse tools","text":"

While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (readr), wrangling (dplyr), and plotting (ggplot2) tools.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#1-read-in-the-functional-analysis-results","title":"1. Read in the functional analysis results","text":"

While the base R packages have perfectly fine methods for reading in data, the readr and readxl Tidyverse packages offer additional methods for reading in data. Let's read in our tab-delimited functional analysis results using read_delim():

# Read in the functional analysis results\nfunctional_GO_results <- read_delim(file = \"data/gprofiler_results_Mov10oe.tsv\", delim = \"\\t\" )\n\n# Take a look at the results\nfunctional_GO_results\n
Click here to see how to do this in base R

Read in the functional analysis results

functional_GO_results <- read.delim(file = \"data/gprofiler_results_Mov10oe.tsv\", sep = \"\\t\" )\n
Take a look at the results
functional_GO_results\n

Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns.

Note

A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#2-extract-only-the-go-biological-processes-bp-of-interest","title":"2. Extract only the GO biological processes (BP) of interest","text":"

Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the dplyr package, which is a swiss-army knife for data wrangling of data frames.

To extract the biological processes of interest, we only want those rows where the domain is equal to BP, which we can do using the filter() function.

To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the filter() function to return those rows for which the expression is TRUE.

Now let's return only those rows that have a domain of BP:

# Return only GO biological processes\nbp_oe <- functional_GO_results %>%\n  filter(domain == \"BP\")\n\nView(bp_oe)\n
Click here to see how to do this in base R

Return only GO biological processes

idx <- functional_GO_results$domain == \"BP\"\nbp_oe2 <- functional_GO_results[idx,]\n\nView(bp_oe)\n

Now we have returned only those rows with a domain of BP. How have the dimensions of our results changed?

Exercise

We would like to perform an additional round of filtering to only keep the most specific GO terms.

  1. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
  2. Save output to overwrite our bp_oe variable.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#3-select-only-the-columns-needed-for-visualization","title":"3. Select only the columns needed for visualization","text":"

For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms.

To extract columns from a data frame/tibble we can use the select() function. In contrast to base R, we do not need to put the column names in quotes for selection.

# Selecting columns to keep\nbp_oe <- bp_oe %>%\n  select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)\n\nView(bp_oe)\n
Click here to see how to do this in base R

Selecting columns to keep

bp_oe <- bp_oe[, c(\"term.id\", \"term.name\", \"p.value\", \"query.size\", \"term.size\", \"overlap.size\", \"intersection\")]\n\nView(bp_oe)\n

The select() function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (c()) function with a - preceding it for this functionality.

DO NOT RUN

# DO NOT RUN\n# Selecting columns to remove\nbp_oe <- bp_oe %>%\n    select(-c(query.number, significant, recall, precision, subgraph.number, relative.depth, domain))\n
Click here to see how to do this in base R

DO NOT RUN

#Selecting columns to remove\nidx <- !(colnames(bp_oe) %in% c(\"query.number\", \"significant\", \"recall\", \"precision\", \"subgraph.number\", \"relative.depth\", \"domain\"))\nbp_oe <- bp_oe[, idx]</code></pre><br>\n

"},{"location":"day_3_exercise/D3.5e_tidyverse/#4-order-go-processes-by-significance-adjusted-p-values","title":"4. Order GO processes by significance (adjusted p-values)","text":"

Now that we have only the rows and columns of interest, let's arrange these by significance, which is denoted by the adjusted p-value.

Let's sort the rows by adjusted p-value with the arrange() function.

# Order by adjusted p-value ascending\nbp_oe <- bp_oe %>%\n  arrange(p.value)\n
Click here to see how to do this in base R

Order by adjusted p-value ascending

idx <- order(bp_oe$p.value)\nbp_oe <- bp_oe[idx,]\n

Note

If you wanted to arrange in descending order, then you could have run the following instead:

DO NOT RUN

# DO NOT RUN\n# Order by adjusted p-value descending\nbp_oe <- bp_oe %>%\narrange(desc(p.value))\n
Click here to see how to do this in base R

DO NOT RUN

# Do not run\n# Order by adjusted p-value descending\nidx <- order(bp_oe$p.value, decreasing = TRUE)\nbp_oe <- bp_oe[idx,]\n

Note

Ordering variables in ggplot2 is a bit different. This post introduces a few ways of ordering variables in a plot.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#5-rename-columns-to-be-more-intuitive","title":"5. Rename columns to be more intuitive","text":"

While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the rename() function. The syntax is new_name = old_name.

Let's rename the term.id and term.name columns.

# Provide better names for columns\nbp_oe <- bp_oe %>% \n  dplyr::rename(GO_id = term.id, \n                GO_term = term.name)\n
Click here to see how to do this in base R
# Provide better names for columns\ncolnames(bp_oe)[colnames(bp_oe) == \"term.id\"] <- \"GO_id\"\ncolnames(bp_oe)[colnames(bp_oe) == \"term.name\"] <- \"GO_term\"\n

Note

In the case of two packages with identical function names, you can use :: with the package name before and the function name after (e.g stats::filter()) to ensure that the correct function is implemented. The :: can also be used to bring in a function from a library without loading it first.

In the example above, we wanted to use the rename() function specifically from the dplyr package, and not any of the other packages (or base R) which may have the rename() function.

Exercise

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#6-create-additional-metrics-for-plotting-eg-gene-ratios","title":"6. Create additional metrics for plotting (e.g. gene ratios)","text":"

Finally, before we plot our data, we need to create a couple of additional metrics. The mutate() function enables you to create a new column from an existing column.

Let's generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes.

# Create gene ratio column based on other columns in dataset\nbp_oe <- bp_oe %>%\n  mutate(gene_ratio = overlap.size / query.size)\n
Click here to see how to do this in base R
# Create gene ratio column based on other columns in dataset\nbp_oe <- cbind(bp_oe, gene_ratio = bp_oe$overlap.size / bp_oe$query.size)\n

Exercise

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

Our final data for plotting should look like the table below:

"},{"location":"day_3_exercise/D3.5e_tidyverse/#next-steps","title":"Next steps","text":"

Now that we have our results ready for plotting, we can use the ggplot2 package to plot our results. If you are interested, you can follow this lesson and dive into how to use ggplot2 to create the plots with this dataset.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#additional-resources","title":"Additional resources","text":"
  • R for Data Science
  • teach the tidyverse
  • tidy style guide

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4/D4.1_in_class_exercises/","title":"Day 4 Activities","text":"
  1. Change the animals data frame to a tibble called animals_tb. Save the row names to a column called animal_names before turning it into a tibble.

  2. Use ggplot2 to plot the animal names (x-axis) versus the speed of the animal (y-axis) in animals_tb using a scatterplot. Customize the plot to display as shown below.

  3. We decide that our plot would look better with the animal names ordered from slowest to fastest. Using the animals_tb tibble, reorder the animals on the x-axis to start with the slowest animal on the left-hand side of the plot to the fastest animal on the right-hand side of the plot by completing the following steps:

    a. Use the arrange() function to order the rows by speed from slowest to fastest. Then use the pull() function to extract the animal_names column as a vector of character values. Save the new variable as names_ordered_by_speed.

    b. Turn the animal_names column of animals_tb into a factor and specify the levels as names_ordered_by_speed from slowest to fastest (output in part a). Note: this step is crucial, because ggplot2 uses factor as plotting order, instead of the order we observe in data frame.

    c. Re-plot the scatterplot with the animal names in order from slowest to fastest.

    Note

    If you are interested in exploring other ways to reorder a variable in ggplot2, refer to this post.

  4. Save the plot as a PDF called animals_by_speed_scatterplot.pdf to the results folder.

  5. Use the functions from the dplyr package to perform the following tasks:

    a. Extract the rows of animals_tb tibble with color of gray or tan, order the rows based from slowest to fastest speed, and save to a variable called animals_gray_tan. b. Save animals_gray_tan as a comma-separated value file called animals_tb_ordered.csv to the results folder.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/","title":"Introduction to R practice","text":""},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#creating-vectorsfactors-and-dataframes","title":"Creating vectors/factors and dataframes","text":"
  1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.

    • Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the rep() function).
    • Put them together into a dataframe called meta.
    • Use the rownames() function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the paste() function).

    Your finished metadata table should have information for the variables sex, stage, treatment, and myc levels:

    sex stage treatment myc sample1 M I A 2343 sample2 F II A 457 sample3 M II A 4593 sample4 F I A 9035 sample5 M II B 3450 sample6 F II B 3524 sample7 M I B 958 sample8 F II B 1053 sample9 M II P 8674 sample10 F I P 3424 sample11 M II P 463 sample12 F II P 5105
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#subsetting-vectorsfactors-and-dataframes","title":"Subsetting vectors/factors and dataframes","text":"
  1. Using the meta data frame from question #1, write out the R code you would use to perform the following operations (questions DO NOT build upon each other):

    • return only the treatment and sex columns using []:
    • return the treatment values for samples 5, 7, 9, and 10 using []:
    • use filter() to return all data for those samples receiving treatment P:
    • use filter()/select()to return only the stage and treatment columns for those samples with myc > 5000:
    • remove the treatment column from the dataset using []:
    • remove samples 7, 8 and 9 from the dataset using []:
    • keep only samples 1-6 using []:
    • add a column called pre_treatment to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use cbind()):
    • change the names of the columns to: \"A\", \"B\", \"C\", \"D\":
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#extracting-components-from-lists","title":"Extracting components from lists","text":"
  1. Create a new list, list_hw with three components, the glengths vector, the dataframe df, and number value. Use this list to answer the questions below . list_hw has the following structure (NOTE: the components of this list are not currently named):
    [[1]]\n[1]   4.6  3000.0 50000.0 \n\n[[2]]\n     species  glengths \n1    ecoli    4.6\n2    human    3000.0\n3    corn     50000.0\n\n[[3]]\n[1] 8\n
    Write out the R code you would use to perform the following operations (questions DO NOT build upon each other):
  2. return the second component of the list:
  3. return 50000.0 from the first component of the list:
  4. return the value human from the second component:
  5. give the components of the list the following names: \"genome_lengths\", \"genomes\", \"record\":
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#creating-figures-with-ggplot2","title":"Creating figures with ggplot2","text":"
  1. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The metadata table describes an experiment that you have setup for RNA-seq analysis, while the associated count matrix gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.

    Follow the instructions below to build your plot. Write the code you used and provide the final image.

    • Read in the metadata file using: meta <- read.delim(\"Mov10_full_meta.txt\", sep=\"\\t\", row.names=1)

    • Read in the count matrix file using: data <- read.delim(\"normalized_counts.txt\", sep=\"\\t\", row.names=1)

    • Create a vector called expression that contains the normalized count values from the row in normalized_counts that corresponds to the MOV10 gene.

    • Check the class of this expression vector. Then, convert it to a numeric vector using as.numeric(expression)

    • Bind that vector to your metadata data frame (meta) and call the new data frame df.

    • Create a ggplot by constructing the plot line by line:

      • Initialize a ggplot with your df as input.

      • Add the geom_jitter() geometric object with the required aesthetics which are x and y.

      • Color the points based on sampletype

      • Add the theme_bw() layer

      • Add the title \"Expression of MOV10\" to the plot

      • Change the x-axis label to be blank

      • Change the y-axis label to \"Normalized counts\"

      • Using theme() change the following properties of the plot:

        • Remove the legend (Hint: use ?theme help and scroll down to legend.position)

        • Change the plot title size to 1.5x the default and center align

        • Change the axis title to 1.5x the default size

        • Change the size of the axis text only on the y-axis to 1.25x the default size

        • Rotate the x-axis text to 45 degrees using axis.text.x=element_text(angle=45, hjust=1)

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/","title":"Day1 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#day-1-homework-exercises","title":"Day 1 Homework Exercises","text":""},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#r-syntax-and-data-structures","title":"R syntax and data structures","text":"
# 1. Try changing the value of the variable `x` to 5. What happens to `number`?\n\nx <- 5\n\n# 2. Now try changing the value of variable `y` to contain the value 10. What do you need to do, to update the variable `number`?\n\ny <- 10\n\nnumber <- x + y\n\n#3. Try to create a vector of numeric and character values by combining the two vectors that we just created (`glengths` and `species`). Assign this combined vector to a new variable called `combined`. \n\n## Hint: you will need to use the combine `c()` function to do this. Print the `combined` vector in the console, what looks different compared to the original vectors?\n\ncombined <- c(glengths, species)\n\n#4. Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.\n\n## a. Create a vector named `samplegroup` with nine elements: 3 control (\"CTL\") values, 3 knock-out (\"KO\") values, and 3 over-expressing (\"OE\") values.\n\nsamplegroup <- c(\"CTL\", \"CTL\", \"CTL\", \"KO\", \"KO\", \"KO\", \"OE\", \"OE\", \"OE\")\n\n## b. Turn `samplegroup` into a factor data structure.\n\nsamplegroup <- factor(samplegroup)\n\n# 5. Create a data frame called `favorite_books` with the following vectors as columns:\n\ntitles <- c(\"Catch-22\", \"Pride and Prejudice\", \"Nineteen Eighty Four\")\npages <- c(453, 432, 328)\nfavorite_books <- data.frame(titles, pages)\n\n# 6. Create a list called `list2` containing `species`, `glengths`, and `number`.\nlist2 <- list(species, glengths, number)\n
"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#functions-and-arguments","title":"Functions and arguments","text":"
# 1. Let's use base R function to calculate **mean** value of the `glengths` vector. You might need to search online to find what function can perform this task.\nmean(glengths)\n\n# 2. Create a new vector `test <- c(1, NA, 2, 3, NA, 4)`. Use the same base R function from exercise 1 (with addition of proper argument), and calculate mean value of the `test` vector. The output should be `2.5`.\n#   *NOTE:* In R, missing values are represented by the symbol `NA` (not available). It\u2019s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore `NA` during statistical calculations, or to remove `NA` from the vector. More information related to missing data can be found at this link -> https://www.statmethods.net/input/missingdata.html.\ntest <- c(1, NA, 2, 3, NA, 4)\nmean(test, na.rm=TRUE)\n\n# 3. Another commonly used base function is `sort()`. Use this function to sort the `glengths` vector in **descending** order.\nsort(glengths, decreasing = TRUE)\n\n# 4. Write a function called `multiply_it`, which takes two inputs: a numeric value `x`, and a numeric value `y`. The function will return the product of these two numeric values, which is `x * y`. For example, `multiply_it(x=4, y=6)` will return output `24`.\nmultiply_it <- function(x,y) {\n  product <- x * y\n  return(product)\n}\n
"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#reading-in-and-inspecting-data","title":"Reading in and inspecting data","text":"
# 1. Download this tab-delimited .txt file and save it in your project\u2019s data folder.\n#       i. Read it in to R using read.table() and store it as the variable proj_summary, keeping in mind that: \n#               a. all the columns have column names \n#               b. you want the first column to be used as rownames (hint: look up the row.names = argument)\n#       ii. Display the contents of proj_summary in your console\nproj_summary <- read.table(file = \"data/project-summary.txt\", header = TRUE, row.names = 1)\n\n# 2. Use the class() function on glengths and metadata, how does the output differ between the two?\nclass(glengths)\nclass(metadata)\n\n# 3. Use the summary() function on the proj_summary dataframe\n#       i. What is the median rRNA_rate?\n#       ii. How many samples got the \u201clow\u201d level of treatment\nsummary(proj_summary)\n\n# 4. How long is the samplegroup factor?\nlength(samplegroup)\n\n# 5. What are the dimensions of the proj_summary dataframe?\ndim(proj_summary)\n\n# 6. When you use the rownames() function on metadata, what is the data structure of the output?\nstr(rownames(metadata))\n\n# 7. How many elements in (how long is) the output of colnames(proj_summary)? Don\u2019t count, but use another function to determine this.\nlength(colnames(proj_summary))\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/","title":"Day2 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#day-2-homework-exercises","title":"Day 2 Homework Exercises","text":""},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#data-wrangling","title":"Data wrangling","text":"
# 1. Extract only those elements in `samplegroup` that are not KO (*nesting the logical operation is optional*).\nidx <- samplegroup != \"KO\"\nsamplegroup[idx]\n\n# 2. Use the `samplegroup` factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.\nfactor(samplegroup, levels = c(\"KO\", \"CTL\", \"OE\"))\n\n### Packages and Libraries\n\n# 1. Install the tidyverse package (it is actually a suite of packages). NOTE: This suite of packages is only available in CRAN.\ninstall.packages(\"tidyverse\")\n\n# 2. Load the tidyverse library. Do you see anything unusual when it loads?\nlibrary(tidyverse)\n#Some functions from dyplyr (part of tidyverse package) mask the same functions from the basic stats. But that is fine! If you need to use filter function from stats, you can type 'stats::filter()'\n\n# 3. Run sessionInfo().\nsessionInfo()\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#data-wrangling-data-frames-matrices-and-lists","title":"Data wrangling: data frames, matrices, and lists","text":"
# 1. Return the genotype and replicate column values for Sample2 and Sample8.\nmetadata[c(\"sample2\", \"sample8\"), c(\"genotype\", \"replicate\")] # or\nmetadata[c(2,8), c(1,3)]\n\n# 2. Return the fourth and ninth values of the replicate column.\nmetadata$replicate[c(4,9)] # or\nmetadata[c(4, 9), \"replicate\"]\n\n# 3. Extract the replicate column as a data frame.\nmetadata[, \"replicate\", drop = FALSE]\n\n# 4. Subset the metadata dataframe to return only the rows of data with a genotype of KO.\nidx <- which(metadata$genotype==\"KO\")\nmetadata[idx, ]\n\n# 5. Create a list named random with the following components: metadata, age, list1, samplegroup, and number.\nrandom <- list(metadata, age, list1, samplegroup, number)\n\n# 6. Extract the samplegroup component.\nrandom[[4]]\n\n# 7. Set names for the random list you created in the last exercise.\nnames(random) <- c(\"metadata\", \"age\", \"list1\", \"samplegroup\", \"number\")\n\n# 8. Extract the age component using the $ notation\nrandom$age\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#the-in-operator","title":"The %in% operator","text":"
# 1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A\nB %in% A\n\n# 2. Subset the B vector to only return those values that are also in A.\nB[B %in% A]\n\n# 3. We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.\n\n#       i. First, let\u2019s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:\n\n        important_genes <- c(\"ENSMUSG00000083700\", \"ENSMUSG00000080990\", \"ENSMUSG00000065619\", \"ENSMUSG00000047945\", \"ENSMUSG00000081010\", \"ENSMUSG00000030970\")\n\n#       ii. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.\nimportant_genes %in% rownames(rpkm_data)\n\n#       iii. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.\nidx <- rownames(rpkm_data) %in% important_genes\nans <- rpkm_data[idx, ]\nidx2 <- which(rownames(rpkm_data) %in% important_genes)\nans2 <- rpkm_data[idx2, ]\n\n#       iv. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.\nans3 <- rpkm_data[important_genes, ]\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#reordering-and-matching","title":"Reordering and matching","text":"
# 1. Now that we know how to reorder using indices, let\u2019s try to use it to reorder the contents of one vector to match the contents of another. Let\u2019s create the vectors first and second as detailed below:\nfirst <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"B\",\"D\",\"E\",\"A\",\"C\")  # same letters but different order\n\n#        How would you reorder the second vector to match first?\nsecond[c(4, 1, 5, 2, 3)]\n\n# 2. After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.\nsubset_rpkm <- rpkm_ordered[ , c(1,3:8,10:12)]  #or\nsubset_rpkm <- rpkm_ordered[ , -c(2,9)]\n\n# 3. Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.  \nidx <- match(colnames(subset_rpkm), rownames(metadata))\nmetadata[idx, ]\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/","title":"Day3 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#ggplot2-exercise","title":"ggplot2 exercise","text":""},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#creating-a-boxplot","title":"Creating a boxplot","text":"
#1. boxplot\nggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, fill = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5)))\n\n#2. Changing the order of genotype\nnew_metadata$genotype <- factor(new_metadata$genotype, levels = c(\"Wt\", \"KO\"))\n\n#3. Changing default colors\n\n#Add a new layer scale_color_manual(values=c(\"purple\",\"orange\")).\n#Do you observe a change?\n    ## No\n\n#Replace scale_color_manual(values=c(\"purple\",\"orange\")) with scale_fill_manual(values=c(\"purple\",\"orange\")).\n#Do you observe a change?\n    ## Yes\n\n#In the scatterplot we drew in class, add a new layer scale_color_manual(values=c(\"purple\",\"orange\")), do you observe a difference?\n    ## Yes\n\n#What do you think is the difference between scale_color_manual() and scale_fill_manual()?\n    ## scale_color_manual() works with scatter plot, and scale_fill_#manual() works with box plot is what it appears to be\n    ## \n    ## Actually, scale_color_manual() works if the \"color\" argument is used , whereas scale_fill_manual() works if the \"fill\" argument is used\n\n\n## Boxplot using \"color\" instead of \"fill\"\nggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, color = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5))) +\n  scale_color_manual(values=c(\"purple\",\"orange\"))\n\n\n#Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.\n#Are there any colors that you tried that did not work?\n\n  ggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, fill = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5))) +\n  scale_fill_manual(values=c(\"red\", \"blue\"))\n\n#OPTIONAL Exercise:\n#Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.\nscale_fill_manual(values=c(\"#FF3333\", \"#3333FF\"))\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#finding-help","title":"Finding help","text":"

Exercises Run the following code chunks and fix all of the errors. (Note: The code chunks are independent from one another.)

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-vector-of-work-days","title":"Create vector of work days","text":"
#work_days <- c(Monday, Tuesday, Wednesday, Thursday, Friday)\nwork_days <- c(\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-a-function-to-round-the-output-of-the-sum-function","title":"Create a function to round the output of the sum function","text":"
#round_the_sum <- function(x){\n#  return(round(sum(x))\n#}\nround_the_sum <- function(x){\n  return(round(sum(x)))\n}\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-a-function-to-add-together-three-numbers","title":"Create a function to add together three numbers","text":"
#add_numbers <- function(x,y,z){\n#  sum(x,y,z)\n#}\n#add_numbers(5,9)\nadd_numbers(5,9,6)\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#you-try-to-install-a-package-and-you-get-the-following-error-message","title":"You try to install a package and you get the following error message:","text":"

Error

Error: package or namespace load failed for 'Seurat' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called 'multtest'\n

What would you do to remedy the error?

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#install-multtest-first-and-then-install-seurat-package","title":"Install multtest first, and then install seurat package:","text":"

BiocManager::install('multtest')\ninstall.packages('Seurat')\n
You would like to ask for help on an online forum. To do this you want the users of the forum to reproduce your problem, so you want to provide them as much relevant information and data as possible.

You want to provide them with the list of packages that you currently have loaded, the version of R, your OS and package versions. Use the appropriate function(s) to obtain this information.

sessionInfo()\n

You want to also provide a small data frame that reproduces the error (if working with a large data frame, you\u2019ll need to subset it down to something small). For this exercse use the data frame df, and save it as an RData object called df.RData.

save(df, file = \"data/df.RData\")\n# What code should the people looking at your help request should use to read in df.RData?\nload(file=\"data/df.RData\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#tidyverse","title":"Tidyverse","text":"

Create a vector of random numbers using the code below:

random_numbers <- c(81, 90, 65, 43, 71, 29)\n

Use the pipe (%>%) to perform two steps in a single line. Take the mean of random_numbers using the mean() function.

random_numbers %>% mean()\n
Round the output to three digits using the round() function.
random_numbers %>% \n  mean() %>% \n  round(digits = 3)\n
We would like to perform an additional round of filtering to only keep the most specific GO terms. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4. Save output to overwrite our bp_oe variable.
bp_oe <- bp_oe %>% \n  filter(relative.depth > 4)\n

Using Base R

# bp_oe <- subset(bp_oe, relative.depth > 4)\n

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

bp_oe <- bp_oe %>% \n  dplyr::rename(genes = intersection)\n
Using Base R
colnames(bp_oe)[colnames(bp_oe) == \"intersection\"] <- \"genes\"\n

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

bp_oe <- bp_oe %>% \n  mutate(term_percent = overlap.size / term.size)\n
Using Base R

bp_oe <- cbind(bp_oe, term_percent = bp_oe$overlap.size / bp_oe$term.size)\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/","title":"Day4 Intro to R Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#homework-answer-key-introduction-to-r-practice","title":"Homework answer key - Introduction to R practice","text":""},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#creating-vectorsfactors-and-dataframes","title":"Creating vectors/factors and dataframes","text":"
  1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.

    • Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the rep() function).
    sex <- c(\"M\", \"F\",...) # saved vectors/factors as variables and used c() or rep() function to create\n
    • Put them together into a dataframe called meta.

    meta <- data.frame(sex, stage, treatment, myc) # used data.frame() to create the table\n
    - Use the rownames() function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the paste() function).

    rownames(meta) <- c(\"sample1\", \"sample2\",... , \"sample12\") # or use:\n\nrownames(meta) <- paste(\"sample12\", 1:12, sep=\"\")\n

    Your finished metadata table should have information for the variables sex, stage, treatment, and myc levels:

    sex stage treatment myc sample1 M I A 2343 sample2 F II A 457 sample3 M II A 4593 sample4 F I A 9035 sample5 M II B 3450 sample6 F II B 3524 sample7 M I B 958 sample8 F II B 1053 sample9 M II P 8674 sample10 F I P 3424 sample11 M II P 463 sample12 F II P 5105
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#subsetting-vectorsfactors-and-dataframes","title":"Subsetting vectors/factors and dataframes","text":"
  1. Using the meta data frame from question #1, write out the R code you would use to perform the following operations (questions DO NOT build upon each other):

    • return only the treatment and sex columns using []:
    meta[ , c(1,3)]\n
    • return the treatment values for samples 5, 7, 9, and 10 using []:
    meta[c(5,7,9,10), 3]\n
    • use filter() to return all data for those samples receiving treatment P:
    filter(meta, treatment == \"P\")\n
    • use filter()/select() to return only the stage and treatment data for those samples with myc > 5000:
    filter(meta, myc > 5000) %>% select(stage, treatment)\n
    • remove the treatment column from the dataset using []:
    meta[, -3]\n
    • remove samples 7, 8 and 9 from the dataset using []:
    meta[-7:-9, ]\n
    • keep only samples 1-6 using []:
    meta [1:6, ]\n
    • add a column called pre_treatment to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use cbind()):
    pre_treatment <- c(T, F, F, F, T, T, F, T, F, F, T, T)\n\ncbind(pre_treatment, meta)\n
    • change the names of the columns to: \"A\", \"B\", \"C\", \"D\":
    colnames(meta) <- c(\"A\", \"B\", \"C\", \"D\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#extracting-components-from-lists","title":"Extracting components from lists","text":"
  1. Create a new list, list_hw with three components, the glengths vector, the dataframe df, and number value. Use this list to answer the questions below . list_hw has the following structure (NOTE: the components of this list are not currently named):

    [[1]]\n[1]   4.6  3000.0 50000.0 \n\n[[2]]\n          species  glengths \n     1    ecoli    4.6\n     2    human    3000.0\n     3    corn     50000.0\n\n[[3]]\n[1] 8\n
    Write out the R code you would use to perform the following operations (questions DO NOT build upon each other): - return the second component of the list:

    list_hw[[2]]\n
    • return 50000.0 from the first component of the list:
    list_hw[[1]][3]\n
    • return the value human from the second component:
    list_hw[[2]][2, 1]\n
    • give the components of the list the following names: \"genome_lengths\", \"genomes\", \"record\":
    names(list_hw) <- c(\"genome_lengths\",\"genomes\",\"record\")\n\nlist_hw$record\n
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#creating-figures-with-ggplot2","title":"Creating figures with ggplot2","text":"
  1. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The metadata table describes an experiment that you have setup for RNA-seq analysis, while the associated count matrix gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.

Follow the instructions below to build your plot. Write the code you used and provide the final image.

  • Read in the metadata file using: meta <- read.delim(\"Mov10_full_meta.txt\", sep=\"\\t\", row.names=1)

  • Read in the count matrix file using: data <- read.delim(\"normalized_counts.txt\", sep=\"\\t\", row.names=1)

  • Create a vector called expression that contains the normalized count values from the row in data that corresponds to the MOV10 gene.

expression <- data[\"MOV10\", ]\n
  • Check the class of this expression vector. data.frame

Then, will need to convert this to a numeric vector using as.numeric(expression)

class(expression)\n\nexpression <- as.numeric(expression)\n\nclass(expression)\n
  • Bind that vector to your metadata data frame (meta) and call the new data frame df.
df <- cbind(meta, expression) #or\n\ndf <- data.frame(meta, expression)\n
  • Create a ggplot by constructing the plot line by line:

    • Initialize a ggplot with your df as input.

    • Add the geom_jitter() geometric object with the required aesthetics

    • Color the points based on sampletype

    • Add the theme_bw() layer

    • Add the title \"Expression of MOV10\" to the plot

    • Change the x-axis label to be blank

    • Change the y-axis label to \"Normalized counts\"

    • Using theme() change the following properties of the plot:

      • Remove the legend (Hint: use ?theme help and scroll down to legend.position)

      • Change the plot title size to 1.5x the default and center align

      • Change the axis title to 1.5x the default size

      • Change the size of the axis text only on the y-axis to 1.25x the default size

      • Rotate the x-axis text to 45 degrees using axis.text.x=element_text(angle=45, hjust=1)

    ggplot(df) +\n     geom_jitter(aes(x= sampletype, y= expression, color = sampletype)) +\n     theme_bw() +\n     ggtitle(\"Expression of MOV10\") +\n     xlab(NULL) +\n     ylab(\"Normalized counts\") +\n     theme(legend.position = \"none\",\n          plot.title=element_text(hjust=0.5, size=rel(1.5)),\n          axis.text=element_text(size=rel(1.25)),\n          axis.title=element_text(size=rel(1.5)),\n          axis.text.x=element_text(angle=45, hjust=1))\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"Introduction to R Audience Computational skills required Duration Biologists None 4-session online workshop (~ 8 hours of trainer-led time)"},{"location":"#description","title":"Description","text":"

This repository has teaching materials for a hands-on Introduction to R workshop taught online. The workshop will introduce participants to the basics of R and RStudio. R is a simple programming environment that enables the effective handling of data, while providing excellent graphical support. RStudio is a tool that provides a user-friendly environment for working with R. These materials are intended to provide both basic R programming knowledge and its application for increasing efficiency for data analysis.

Note for Trainers

The schedule linked below assumes that learners will spend between 2-3 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion.

"},{"location":"#learning-objectives","title":"Learning Objectives","text":"
  1. R syntax:

    Familiarize the basic syntax and the use of Rstudio.

  2. Data types and data structures:

    Describe frequently-used data types and data structures in R.

  3. Data inspection and wrangling:

    Demonstrate the utilization of functions and indices to inspect and subset data from various data structures.

  4. Data visualization:

    Apply the ggplot2 package to create plots for data visualization.

"},{"location":"#setup-requirements","title":"Setup Requirements","text":"

Download the most recent version of R and RStudio for the appropriate OS following the links below.

R software download

RStudio download

All the data files used in the lessons are linked within, but can also be accessed through the link below.

Dataset download

"},{"location":"#lessons","title":"Lessons","text":"
  • Trainer led workshop Click here

  • Self learning materials Click here

Attribution & Citation

  • These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived from work that is Copyright \u00a9 Data Carpentry. All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0)

  • To cite material from this course in your publications, please use:

    Meeta Mistry, Mary Piper, Jihe Liu, & Radhika Khetani. (2021, May 5). hbctraining/Intro-to-R-flipped: R workshop first release. Zenodo. https://doi.org/10.5281/zenodo.4739342

  • A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

"},{"location":"Workshop_Schedule/","title":"Workshop Schedule","text":"Workshop Schedule"},{"location":"Workshop_Schedule/#day-1","title":"Day 1","text":"Lesson Overview Instructor Time Workshop Introduction Welcome and housekeeping Will 10:00-10:30 Intro to R and RStudio Introduction to R and RStudio Noor 10:30-11:45 Self learning materials Overview of self-learning materials Will 11:45-12:00"},{"location":"Workshop_Schedule/#before-the-next-class","title":"Before the next class","text":"

A. Please study the contents and work through all the code within the following lessons.

B. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.

  • Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. R Syntax and Data Structure

    About data types and data structure

    In order to utilize R effectively, you will need to understand what types of data you can use in R and also how you can store data in \"objects\" or \"variables\".

    This lesson will cover:

    • Assigning a value to a object

    • What types of information can you store in R

    • What are the different objects that you can use to store data in R

  • 2. Functions and Arguments

    Functions and Arguments in R

    Functions are the basic \"commands\" used in R to get something done. To use functions (denoted by function_name followed by \"()\"), one has to enter some information within the parenthesis and optionally some arguments to change the default behavior of a function.

    You can also create your own functions! When you want to perform a task or a series of tasks more than once, creating a custom function is the best way to go.

    In this lesson you will explore:

    • Using built-in functions

    • Creating your own custom functions

  • 3. Reading in and inspecting data

    Read and inspect data structures in R

    When using R, it is almost a certainty that you will have to bring data into the R environment.

    In this lesson you will learn:

    • Reading different types (formats) of data

    • Inspecting the contents and structure of the dataset once you have read it in

  • Submit here:

    Submit a day before the next class.

"},{"location":"Workshop_Schedule/#day-2","title":"Day 2","text":"Lesson Overview Instructor Time Review self-learning Questions about self-learning All 10:00-10:50 In-class exercises Use and customize function and arguments Noor 10:50-11:15 Data Wrangling Subsetting Vectors and Factors Will 11:15-12:00"},{"location":"Workshop_Schedule/#before-the-next-class_1","title":"Before the next class","text":"

A. Please study the contents and work through all the code within the following lessons.

B. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.

  • Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. Packages and libraries

    Installing and loading packages in R

    Base R is incredibly powerful, but it cannot do everything. R has been built to encourage community involvement in expanding functionality. Thousands of supplemental add-ons, also called \"packages\" have been contributed by the community. Each package comprises of several functions that enable users to perform their desired analysis.

    This lesson will cover:

    • Descriptions of package repositories

    • Installing a package

    • Loading a package

    • Accessing the documention for your installed packages and getting help

  • 2. Data wrangling: data frames, matrics and lists

    Subset, merge, and create new datasets

    In class we covered data wrangling (extracting/subsetting) information from single-dimensional objects (vectors, factors). The next step is to learn how to wrangle data in two-dimensional objects.

    This lesson will cover:

    • Examining and extracting values from two-dimensional data structures using indices, row names, or column names

    • Retreiving information from lists

  • 3. The %in% operator

    %in% operator, any and all functions

    Very often you will have to compare two vectors to figure out if, and which, values are common between them. The %in% operator can be used for this purpose.

    This lesson will cover:

    • Implementing the %in% operator to evaluate two vectors

    • Distinguishing %in% from == and other logical operators

    • Using any() and all() functions

  • 4. Reordering and matching

    Ordering of vectors and data frames

    Sometimes you will want to rearrange values within a vector (row names or column names). The match() function can be very powerful for this task.

    This lesson will cover:

    • Maunually rearranging values within a vector

    • Implementing the match() function to automatically rearrange the values within a vector

  • 5. Data frame for plotting

    Learn about map() function for iterative tasks

    We will be starting with visualization in the next class. To set up for this, you need to create a new metadata data frame with information from the counts data frame. You will need to use a function over every column within the counts data frame iteratively. You could do that manually, but it is error-prone; the map() family of functions makes this more efficient.

    This lesson will cover:

    • Utilizing map_dbl() to take the average of every column in a data frame

    • Briefly discuss other functions within the map() family of functions

    • Create a new data frame for plotting

  • Submit here

    Submit a day before the next class.

Prepare for in-class exercise:

  • Download the data and place the file into the data directory.
Data Download link Animal data Right click & Save link as...
  • Read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.

  • Save the R project when you close Rstudio.

"},{"location":"Workshop_Schedule/#day-3","title":"Day 3","text":"Lesson Overview Instructor Time Review self-learning Questions about self-learning All 10:00-10:35 In-class exercises Customizing functions and arguments Will 10:50-11:15 Plotting with ggplot2 ggplot2 for data visualization Noor 11:15-12:00"},{"location":"Workshop_Schedule/#before-the-next-class_2","title":"Before the next class","text":"
  1. Please study the contents and work through all the code within the following lessons.

  2. Complete the exercises:

  3. Each lesson above contains exercises; please go through each of them.

  4. Copy over your solutions into the Google Form using the submit link below the day before the next class

Questions?

If you get stuck due to an error while running code in the lesson, email us

  • 1. Custom functions for plots

    Consistent formats for plotting

    When creating your plots in ggplot2 you may want to have consistent formatting (using theme() functions) across your plots, e.g. if you are generating plots for a manuscript.

    This lesson will cover:

    • Developing a custom function for creating consistently formatted plots
  • 2. Boxplot with ggplot2

    Customizing barplots with ggplot2

    Previously, you created a scatterplot using ggplot2. However, ggplot2 can be used to create a very wide variety of plots. One of the other frequently used plots you can create with ggplot2 is a barplot.

    This lesson will cover:

    • Creating and customizing a barplot using ggplot2
  • 3. Exporting files and plots

    Writing files and plots in different formats

    Now that you have completed some analysis in R, you will need to eventually export that work out of R/RStudio. R provides lots of flexibility in what and how you export your data and plots.

    This lesson will cover:

    • Exporting your figures from R using a variety of file formats

    • Writing your data from R to a file

  • 4. Finding help

    How to best look for help

    Hopefully, this course has given you the basic tools you need to be successful when using R. However, it would be impossible to cover every aspect of R and you will need to be able to troubleshoot future issues as they arise.

    This lesson will cover:

    • Suggestions for how to best ask for help

    • Where to look for help

  • 5. Tidyverse

    Data wrangling within Tidyverse

    The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. Tidyverse is becoming increasingly prevalent and it is necessary that R users are conversant in the basics of Tidyverse. We have already used two Tidyverse packages in this workshop (ggplot2 and purrr) and in this lesson we will learn some key features from a few additional packages that make up Tidyverse.

    This lesson will cover:

    • Usage of pipes for connecting together multiple commands

    • Tibbles for two-dimensional data storage

    • Data wrangling within Tidyverse

  • Submit here

    Submit a day before the next class.

"},{"location":"Workshop_Schedule/#day-4","title":"Day 4","text":"Lesson Overview Instructor Time Review self-learning Questions about self-learning All 10:00-10:35 In-class exercises In class exercises Will 10:50-11:15 Discussion Q&A Noor 11:15 - 11:45 Wrap Up Wrap up and checking out Noor 11:45 - 12:00"},{"location":"Workshop_Schedule/#additional-exercises-and-answer-keys","title":"Additional exercises and answer keys","text":"
  • Final Exercises
Answer Keys
  • Answer Keys Day 1
  • Answer Keys Day 2
  • Answer Keys Day 3
  • Answer Keys Final exercise
"},{"location":"Workshop_Schedule/#additional-resources","title":"Additional resources","text":"
  • Building on the basic R knowledge

    • DGE workshop
    • Single-cell RNA-seq workshop
    • RMarkdown
    • Functional analysis
    • More ggplot2
    • ggplot2 cookbook
    • Running R and Rstudio on O2
  • Resources

    • Online learning resources
    • All hbctraining materials

    Cheatsheets

    • base R cheatsheet
    • RStudio cheatsheet
    • ggplot2 cheatsheet

Attribution & Citation

  • These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Some materials used in these lessons were derived from work that is Copyright \u00a9 Data Carpentry. All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0)

  • To cite material from this course in your publications, please use:

    Meeta Mistry, Mary Piper, Jihe Liu, & Radhika Khetani. (2021, May 5). hbctraining/Intro-to-R-flipped: R workshop first release. Zenodo. https://doi.org/10.5281/zenodo.4739342

  • A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/","title":"Introduction to R and RStudio","text":"

Approximate time: 45 minutes

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#learning-objectives","title":"Learning Objectives","text":"
  • Describe what R and RStudio are.
  • Interact with R using RStudio.
  • Familiarize various components of RStudio.
  • Employ variables in R.
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-r","title":"What is R?","text":"

The common misconception is that R is a programming language but in fact it is much more than that. Think of R as an environment for statistical computing and graphics, which brings together a number of features to provide powerful functionality.

The R environment combines:

  • effective handling of big data
  • collection of integrated tools
  • graphical facilities
  • simple and effective programming language
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#why-use-r","title":"Why use R?","text":"

R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.

  • Data handling, wrangling, and storage
  • Wide array of statistical methods and graphical techniques available
  • Easy to install on any platform and use (and it\u2019s free!)
  • Open source with a large and growing community of peers

Examples of R used in the media and science\"

  • \"At the BBC data team, we have developed an R package and an R cookbook to make the process of creating publication-ready graphics in our in-house style...\" - BBC Visual and Data Journalism cookbook for R graphics

  • \"R package of data and code behind the stories and interactives at FiveThirtyEight.com, a data-driven journalism website founded by Nate Silver (initially began as a polling aggregation site, but now covers politics, sports, science and pop culture) and owned by ESPN...\" - fivethirtyeight Package

  • Single Cell RNA-seq Data analysis with Seurat

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-rstudio","title":"What is RStudio?","text":"

RStudio is freely available open-source Integrated Development Environment (IDE). RStudio provides an environment with many features to make using R easier and is a great alternative to working on R in the terminal.

  • Graphical user interface, not just a command prompt
  • Great learning tool
  • Free for academic use
  • Platform agnostic
  • Open source
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#creating-a-new-project-directory-in-rstudio","title":"Creating a new project directory in RStudio","text":"

Let's create a new project directory for our Introduction to R lesson today.

  1. Open RStudio.
  2. Go to the File menu and select New Project.
  3. In the New Project window, choose New Directory. Then, choose New Project. Name your new directory Intro-to-R and then \"Create the project as subdirectory of:\" the Desktop (or location of your choice).
  4. Click on Create Project.

  1. After your project is completed, if the project does not automatically open in RStudio, then go to the File menu, select Open Project, and choose Intro-to-R.Rproj.
  2. When RStudio opens, you will see three panels in the window.
  3. Go to the File menu and select New File, and select R Script.
  4. Go to the File menu and select Save As..., type Intro-to-R.R and select Save.

The RStudio interface should now look like the screenshot below.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#what-is-a-project-in-rstudio","title":"What is a project in RStudio?","text":"

It is simply a directory that contains everything related your analyses for a specific project. RStudio projects are useful when you are working on context-specific analyses and you wish to keep them separate. When creating a project in RStudio you associate it with a working directory of your choice (either an existing one, or a new one). A . RProj file is created within that directory and that keeps track of your command history and variables in the environment. The . RProj file can be used to open the project in its current state but at a later date.

When a project is (re)opened within RStudio the following actions are taken:

  • A new R session (process) is started
  • The .RData file in the project's main directory is loaded, populating the environment with any objects that were present when the project was closed.
  • The .Rhistory file in the project's main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
  • The current working directory is set to the project directory.
  • Previously edited source documents are restored into editor tabs
  • Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.

Information adapted from RStudio Support Site

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#rstudio-interface","title":"RStudio Interface","text":"

The RStudio interface has four main panels:

  1. Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
  2. Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
  3. Environment/History: environment shows all active objects and history keeps track of all commands run in console
  4. Files/Plots/Packages/Help
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#organizing-and-setting-up-rstudio","title":"Organizing and Setting up RStudio","text":""},{"location":"day_1/D1.2_introR-R-and-RStudio/#viewing-your-working-directory","title":"Viewing your working directory","text":"

Before we organize our working directory, let's check to see where our current working directory is located by typing into the console:

getwd()\n

Your working directory should be the Intro-to-R folder constructed when you created the project. The working directory is where RStudio will automatically look for any files you bring in and where it will automatically save any files you create, unless otherwise specified.

You can visualize your working directory by selecting the Files tab from the Files/Plots/Packages/Help window.

If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu which appears as a Cog and select Set As Working Directory.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#structuring-your-working-directory","title":"Structuring your working directory","text":"

To organize your working directory for a particular analysis, you should separate the original data (raw data) from intermediate datasets. For instance, you may want to create a data/ directory within your working directory that stores the raw data, and have a results/ directory for intermediate datasets and a figures/ directory for the plots you will generate.

Let's create these three directories within your working directory by clicking on New Folder within the Files tab.

When finished, your working directory should look like:

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#setting-up","title":"Setting up","text":"

This is more of a housekeeping task. We will be writing long lines of code in our script editor and want to make sure that the lines wrap and you don't have to scroll back and forth to look at your long line of code.

Click on Edit at the top of your RStudio screen and click on Preferences... in the pull down menu.

On the left, select Code and put a check against Soft-wrap R source files. Make sure you click the Apply button at the bottom of the Window before saying OK.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#interacting-with-r","title":"Interacting with R","text":"

Now that we have our interface and directory structure set up, let's start playing with R! There are two main ways of interacting with R in RStudio: using the console or by using script editor (plain text files that contain your code).

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#console-window","title":"Console window","text":"

The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session.

Let's test it out:

3 + 5\n

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#script-editor","title":"Script editor","text":"

Best practice is to enter the commands in the script editor, and save the script. You are encouraged to comment liberally to describe the commands you are running using #. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed.

Now let's try entering commands to the script editor and using the comments character # to add descriptions and run the code chunk.

# Intro to R Lesson\n# Feb 16th, 2016\n# Interacting with R\n\n## I am adding 3 and 5. R is fun!\n3+5\n

The Rstudio script editor allows you to 'send' the current line or the currently highlighted text to the R console by clicking on the Run button in the upper-right hand corner of the script editor.

Alternatively, you can run by simply pressing the Ctrl and Return/Enter keys at the same time as a shortcut.

You should see the command run in the console and output the result.

|

What happens if we do that same command without the comment symbol #? Re-run the command after removing the # sign in the front:

I am adding 3 and 5. R is fun!\n3+5\n

Error

Error: unexpected symbol in \"I am\"\n

Now R is trying to run that sentence as a command, and it doesn't work. We get an error message in the console. It means the R interpreter did not know what to do with that command.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#console-command-prompt","title":"Console command prompt","text":"

Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:

Prompt/command Meaning Remarks > Console is ready to accept commands When the console receives a command by directly typing into the console or running from the script editor Ctrl+Enter, R will try to execute it. + Console is waiting for you to enter more data It means that you haven't finished entering a complete command. Often this can be due to you having not 'closed' a parenthesis or quotation. ESC To escape the command and bring back a new prompt > If you are in Rstudio and you can't figure out why your command isn't running, click inside the console window and press ESC"},{"location":"day_1/D1.2_introR-R-and-RStudio/#keyboard-shortcuts-in-rstudio","title":"Keyboard shortcuts in RStudio","text":"

In addition to some of the shortcuts described earlier in this lesson, we have listed a few more that can be helpful as you work in RStudio.

Key Action Ctrl+Enter Run command from script editor in console ESC Escape the current command to return to the command prompt Ctrl+1 Move cursor from console to script editor Ctrl+2 Move cursor from script editor to console Tab Use this key to complete a file path Ctrl+Shift+C Comment the block of highlighted text

Exercise

Try highlighting only 3 + from your script editor and running it. Find a way to bring back the command prompt > in the console.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#the-r-syntax","title":"The R syntax","text":"

Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.

The main parts of speech in R (syntax) include:

  • The comments # and how they are used to document function and its content
  • variables and functions
  • The assignment operator <-
  • the = for arguments in functions

We will go through each of these parts of speech in more detail, starting with the assignment operator.

Note

Indentation and consistency in spacing is used to improve clarity and legibility.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#assignment-operator","title":"Assignment operator","text":"

To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to x by executing:

x <- 3\n

The assignment operator (<-) assigns values on the right to variables on the left.

Note

In RStudio, typing Alt + - (push Alt at the same time as the - key), on Mac type option + - will write <- in a single keystroke.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#variables","title":"Variables","text":"

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to \"buckets\", where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

In the example above, we created a variable or a 'bucket' called x. Inside we put a value, 3.

Let's create another variable called y and give it a value of 5.

y <- 5\n

When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.

y\n

You can also view information on the variable by looking in your Environment window in the upper right-hand corner of the RStudio interface.

Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation:

x + y\n

Try assigning the results of this operation to another variable called number.

number <- x + y\n

Exercise

  1. Try changing the value of the variable x to 5. What happens to number?
  2. Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?

Tips on variable names

Variables can be given almost any name, such as x, current_temperature, or subject_id. However, there are some rules / suggestions you should keep in mind:

  • Make your names explicit and not too long.
  • Avoid names starting with a number (2x is not valid but x2 is)
  • Avoid names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c, T, mean, data) as variable names. When in doubt check the help to see if the name is already in use.
  • Avoid dots (.) within a variable name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it's best to avoid them.
  • Use nouns for object names and verbs for function names
  • Keep in mind that R is case sensitive (e.g., genome_length is different from Genome_length)
  • Be consistent with the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are Hadley Wickham's style guide and Google's.
"},{"location":"day_1/D1.2_introR-R-and-RStudio/#interacting-with-data-in-r","title":"Interacting with data in R","text":"

R is commonly used for handling big data, and so it only makes sense that we learn about R in the context of some kind of relevant data. Let's take a few minutes to add files to the folders we created and familiarize ourselves with the data.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#adding-files-to-your-working-directory","title":"Adding files to your working directory","text":"

You can access the files we need for this workshop using the links provided below. If you right click on the link, and \"Save link as..\". Choose ~/Desktop/Intro-to-R/data as the destination of the file. You should now see the file appear in your working directory. We will discuss these files a bit later in the lesson.

Data Download links Normalized count data Right click & Save link as... Metadata file Right click & Save link as... Functional analysis output Right click & Save link as...

NOTE

If the files download automatically to some other location on your laptop, you can move them to the your working directory using your file explorer or finder (outside RStudio), or navigating to the files in the Files tab of the bottom right panel of RStudio.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#about-the-dataset","title":"About the dataset","text":"

The count data

In this example dataset, we have collected whole brain samples from 12 mice and want to evaluate expression differences between them. The expression data represents normalized count data obtained from RNA-sequencing of the 12 brain samples. This data is stored in a comma separated values (CSV) file as a 2-dimensional matrix, with each row corresponding to a gene and each column corresponding to a sample

The metadata

We have another file in which we identify information about the data or metadata. Our metadata is also stored in a CSV file. In this file, each row corresponds to a sample and each column contains some information about each sample.

The first column contains the row names, and note that these are identical to the column names in our expression data file above (albeit, in a slightly different order). The next few columns contain information about our samples that allow us to categorize them. For example, the second column contains genotype information for each sample. Each sample is classified in one of two categories: Wt (wild type) or KO (knockout). What types of categories do you observe in the remaining columns?

R is particularly good at handling this type of categorical data. Rather than simply storing this information as text, the data is represented in a specific data structure which allows the user to sort and manipulate the data in a quick and efficient manner. We will discuss this in more detail as we go through the different lessons in R!

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#the-functional-analysis-results","title":"The functional analysis results","text":"

We will be using the results of the functional analysis to learn about packages/functions from the Tidyverse suite of integrated packages. These packages are designed to work together to make common data science operations like data wrangling, tidying, reading/writing, parsing, and visualizing, more user-friendly.

"},{"location":"day_1/D1.2_introR-R-and-RStudio/#best-practices","title":"Best practices","text":"

Before we move on to more complex concepts and getting familiar with the language, we want to point out a few things about best practices when working with R which will help you stay organized in the long run

  • Code and workflow are more reproducible if we can document everything that we do. Our end goal is not just to \"do stuff\", but to do it in a way that anyone can easily and exactly replicate our workflow and results. All code should be written in the script editor and saved to file, rather than working in the console.

  • The R console should be mainly used to inspect objects, test a function or get help.

  • Use # signs to comment. Comment liberally in your R scripts. This will help future you and other collaborators know what each line of code (or code block) was meant to do. Anything to the right of a # is ignored by R. A shortcut for this is Ctrl+Shift+C if you want to comment an entire chunk of text.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/","title":"R Syntax and Data Structures","text":"

Approximate time: 70 min

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#learning-objectives","title":"Learning Objectives","text":"
  • Describe frequently-used data types in R.
  • Construct data structures to store data.
"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-types","title":"Data Types","text":"

Variables can contain values of specific types within R. The six data types that R uses include:

  • \"numeric\" for any numerical value, including whole numbers and decimals. This is the most common data type for performing mathematical operations.
  • \"character\" for text values, denoted by using quotes (\"\") around value. For instance, while 5 is a numeric value, if you were to put quotation marks around it, it would turn into a character value, and you could no longer use it for mathematical operations. Single or double quotes both work, as long as the same type is used at the beginning and end of the character value.
  • \"integer\" for whole numbers (e.g., 2L, the L indicates to R that it's an integer). It behaves similar to the numeric data type for most tasks or functions; however, it takes up less storage space than numeric data, so often tools will output integers if the data is known to be comprised of whole numbers. Just know that integers behave similarly to numeric values. If you wanted to create your own, you could do so by providing the whole number, followed by an upper-case L.
  • \"logical\" for TRUE and FALSE (the Boolean data type). The logical data type can be specified using four values, TRUE in all capital letters, FALSE in all capital letters, a single capital T or a single capital F.
  • \"complex\" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that's all we're going to say about them
  • \"raw\" that we won't discuss further

The table below provides examples of each of the commonly used data types:

Data Type Examples Numeric: 1, 1.5, 20, pi Character: \u201canytext\u201d, \u201c5\u201d, \u201cTRUE\u201d Integer: 2L, 500L, -17L Logical: TRUE, FALSE, T, F

The type of data will determine what you can do with it. For example, if you want to perform mathematical operations, then your data type cannot be character or logical. Whereas if you want to search for a word or pattern in your data, then you data should be of the character data type. The task or function being performed on the data will determine what type of data can be used.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-structures","title":"Data Structures","text":"

We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when number was created, the result of the mathematical operation was a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list).

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#vectors","title":"Vectors","text":"

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's basically just a collection of values, mainly either numbers,

or characters,

or logical values,

Note

All values in a vector must be of the same data type.

If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.

For example, if you were to try to create the following vector:

R will coerce it into:

The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements.

Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single entity (bucket).

Let's create a vector of genome lengths and assign it to a variable called glengths.

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

# Create a numeric vector and store the vector as a variable called 'glengths'\nglengths <- c(4.6, 3000, 50000)\nglengths\n

Note

Your environment shows the glengths variable is numeric (num) and tells you the glengths vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values) as denoted by the [1:3].*

A vector can also contain characters. Create another vector called species with three elements, where each element corresponds with the genome sizes vector (in Mb).

# Create a character vector and store the vector as a variable called 'species'\nspecies <- c(\"ecoli\", \"human\", \"corn\")\nspecies\n
What do you think would happen if we forgot to put quotations around one of the values? Let's test it out with corn.

# Forget to put quotes around corn\nspecies <- c(\"ecoli\", \"human\", corn)\n
Note that RStudio is quite helpful in color-coding the various data types. We can see that our numeric values are blue, the character values are green, and if we forget to surround corn with quotes, it's black. What does this mean? Let's try to run this code.

When we try to run this code we get an error specifying that object 'corn' is not found. What this means is that R is looking for an object or variable in my Environment called 'corn', and when it doesn't find it, it returns an error. If we had a character vector called 'corn' in our Environment, then it would combine the contents of the 'corn' vector with the values \"ecoli\" and \"human\".

Since we only want to add the value \"corn\" to our vector, we need to re-run the code with the quotation marks surrounding corn. A quick way to add quotes to both ends of a word in RStudio is to highlight the word, then press the quote key.

# Create a character vector and store the vector as a variable called 'species'\nspecies <- c(\"ecoli\", \"human\", \"corn\")\n

Exercise

Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this.

Print the combined vector in the console, what looks different compared to the original vectors?

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#factors","title":"Factors","text":"

A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level). Factors are built on top of integer vectors such that each factor level is assigned an integer value, creating value-label pairs.

For instance, if we have four animals and the first animal is female, the second and third are male, and the fourth is female, we could create a factor that appears like a vector, but has integer values stored under-the-hood. The integer value assigned is a one for females and a two for males. The numbers are assigned in alphabetical order, so because the f- in females comes before the m- in males in the alphabet, females get assigned a one and males a two. In later lessons we will show you how you could change these assignments.

Let's create a factor vector and explore a bit more. We'll start by creating a character vector describing three different levels of expression. Perhaps the first value represents expression in mouse1, the second value represents expression in mouse2, and so on and so forth:

# Create a character vector and store the vector as a variable called 'expression'\nexpression <- c(\"low\", \"high\", \"medium\", \"high\", \"low\", \"medium\", \"high\")\n

Now we can convert this character vector into a factor using the factor() function:

# Turn 'expression' vector into a factor\nexpression <- factor(expression)\n

So, what exactly happened when we applied the factor() function?

The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low, medium, and high. By turning the expression vector into a factor, the categories are assigned integers alphabetically, with high=1, low=2, medium=3. This in effect assigns the different factor levels. You can view the newly created factor variable and the levels in the Environment window.

So now that we have an idea of what factors are, when would you ever want to use them?

Factors are extremely valuable for many operations often performed in R. For instance, factors can give order to values with no intrinsic order. In the previous 'expression' vector, if I wanted the low category to be less than the medium category, then we could do this using factors. Also, factors are necessary for many statistical methods. For example, descriptive statistics can be obtained for character vectors if you have the categorical information stored as a factor. Also, if you want to denote which category is your base level for a statistical comparison, then you would need to have your category variable stored as a factor with the base level assigned to 1. Anytime that it is helpful to have the categories thought of as groups in an analysis, the factor function makes this possible. For instance, if you want to color your plots by treatment type, then you would need the treatment variable to be a factor.

Exercises

Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup with nine elements: 3 control (\"CTL\") values, 3 knock-out (\"KO\") values, and 3 over-expressing (\"OE\") values.

  2. Turn samplegroup into a factor data structure.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#matrix","title":"Matrix","text":"

A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

Matrices are used commonly as part of the mathematical machinery of statistics. They are usually of numeric datatype and used in computational algorithms to serve as a checkpoint. For example, if input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#data-frame","title":"Data Frame","text":"

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it's a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors). In the data frame pictured below, the first column is character, the second column is numeric, the third is character, and the fourth is logical.

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.

# Create a data frame and store it as a variable called 'df'\ndf <- data.frame(species, glengths)\n

We can see that a new variable called df has been created in our Environment within a new section called Data. In the Environment, it specifies that df has 3 observations of 2 variables. What does that mean? In R, rows always come first, so it means that df has 3 rows and 2 columns. We can get additional information if we click on the blue circle with the white triangle in the middle next to df. It will display information about each of the columns in the data frame, giving information about what the data type is of each of the columns and the first few values of those columns.

Another handy feature in RStudio is that if we hover the cursor over the variable name in the Environment, df, it will turn into a pointing finger. If you click on df, it will open the data frame as it's own tab next to the script editor. We can explore the table interactively within this window. To close, just click on the X on the tab.

As with any variable, we can print the values stored inside to the console if we type the variable's name and run.

df\n

Exercise

Create a data frame called favorite_books with the following vectors as columns:

titles <- c(\"Catch-22\", \"Pride and Prejudice\", \"Nineteen Eighty Four\")\npages <- c(453, 432, 328)\n
"},{"location":"day_1_exercise/D1.1e_r_syntax_and_data_structures/#lists","title":"Lists","text":"

Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures.

If you have variables of different data structures you wish to combine, you can put all of those into one list object by using the list() function and placing all the items you wish to combine within parentheses:

list1 <- list(species, df, number)\n

We see list1 appear within the Data section of our environment as a list of 3 components or variables. If we click on the blue circle with a triangle in the middle, it's not quite as interpretable as it was for data frames.

Essentially, each component is preceded by a colon. The first colon give the species vector, the second colon precedes the df data frame, with the dollar signs indicating the different columns, the last colon gives the single value, number.

If I click on list1, it opens a tab where you can explore the contents a bit more, but it's still not super intuitive. The easiest way to view small lists is to print to the console.

Let's type list1 and print to the console by running it.

list1\n\n[[1]]\n[1] \"ecoli\" \"human\" \"corn\" \n\n[[2]]\n  species glengths\n1   ecoli      4.6\n2   human   3000.0\n3    corn  50000.0\n\n[[3]]\n[1] 5\n

There are three components corresponding to the three different variables we passed in, and what you see is that structure of each is retained. Each component of a list is referenced based on the number position. We will talk more about how to inspect and manipulate components of lists in later lessons.

Exercise

Create a list called list2 containing species, glengths, and number.

Now that we know what lists are, why would we ever want to use them? When getting started with R, you will most likely encounter lists with different tools or functions that you use. Oftentimes a tool will need a list as input, so that all the information needed to run the tool is present in a single variable. Sometimes a tool will output a list when working through an analysis. Knowing how to work with them and extract necessary information will be critically important.

As you become more comfortable with R, you will find yourself using lists more often. One common use of lists is to make iterative processes more efficient. For example, let's say you had multiple data frames containing the same weather information from different cities throughout North America. You wanted to perform the same task on each of the data frames, but that would take a long time to do individually. Instead you could create a list where each data frame is a component of the list. Then, you could perform the task on the list instead, which would be applied to each of the components.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/","title":"Functions in R","text":"

Approximate time: 30 min

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#learning-objectives","title":"Learning Objectives","text":"
  • Describe and utilize functions in R.
  • Modify default behavior of a function using arguments.
  • Identify R-specific sources of obtaining more information about functions.
  • Demonstrate how to create user-defined functions in R
"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#functions-and-their-arguments","title":"Functions and their arguments","text":""},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#what-are-functions","title":"What are functions?","text":"

A key feature of R is functions. Functions are \"self contained\" modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

The general usage for a function is the name of the function followed by parentheses:

function_name(input)\n
The input(s) are called arguments, which can include:

  1. the physical object (any data structure) on which the function carries out a task
  2. specifications that alter the way the function operates (e.g. options)

Not all functions take arguments, for example:

getwd()\n

However, most functions can take several arguments. If you don't specify a required argument when calling the function, you will either receive an error or the function will fall back on using a default.

The defaults represent standard values that the author of the function specified as being \"good enough in standard cases\". An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#basic-functions","title":"Basic functions","text":"

We have already used a few examples of basic functions in the previous lessons i.e getwd(), c(), and factor(). These functions are available as part of R's built in capabilities, and we will explore a few more of these base functions below.

Let's revisit a function that we have used previously to combine data c() into vectors. The arguments it takes is a collection of numbers, characters or strings (separated by a comma). The c() function performs the task of combining the numbers or characters into a single vector. You can also use the function to add elements to an existing vector:

glengths <- c(glengths, 90) # adding at the end \nglengths <- c(30, glengths) # adding at the beginning\n

What happens here is that we take the original vector glengths (containing three elements), and we are adding another item to either end. We can do this over and over again to build a vector or a dataset.

Since R is used for statistical computing, many of the base functions involve mathematical operations. One example would be the function sqrt(). The input/argument must be a number, and the output is the square root of that number. Let's try finding the square root of 81:

sqrt(81)\n

Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?

sqrt(glengths)\n

In this case the task was performed on each individual value of the vector glengths and the respective results were displayed.

Let's try another function, this time using one that we can change some of the options (arguments that change the behavior of the function), for example round:

round(3.14159)\n

We can see that we get 3. That's because the default is to round to the nearest whole number. What if we want a different number of significant digits? Let's first learn how to find available arguments for a function.

"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#seeking-help-on-arguments-for-functions","title":"Seeking help on arguments for functions","text":"

The best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio that will provide a description of the function, usage, arguments, details, and examples:

?round\n

Alternatively, if you are familiar with the function but just need to remind yourself of the names of the arguments, you can use:

args(round)\n

Even more useful is the example() function. This will allow you to run the examples section from the Online Help to see exactly how it works when executing the commands. Let's try that for round():

example(\"round\")\n

In our example, we can change the number of digits returned by adding an argument. We can type digits=2 or however many we may want:

round(3.14159, digits=2)\n

Note

If you provide the arguments in the exact same order as they are defined (in the help manual) you don't have to name them:

round(3.14159, 2)\n
However, it's usually not recommended practice because it involves a lot of memorization. In addition, it makes your code difficult to read for your future self and others, especially if your code includes functions that are not commonly used. (It's however OK to not include the names of the arguments for basic functions like mean, min, etc...). Another advantage of naming arguments, is that the order doesn't matter. This is useful when a function has many arguments.

Exercise

  1. Let's use base R function to calculate mean value of the glengths vector. You might need to search online to find what function can perform this task.

  2. Create a new vector test <- c(1, NA, 2, 3, NA, 4). Use the same base R function from exercise 1 (with addition of proper argument), and calculate mean value of the test vector. The output should be 2.5.

    NOTE: In R, missing values are represented by the symbol NA (not available). It\u2019s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore NA during statistical calculation, or to remove NA from the vector. If you want more information related to missing data or NA you can go to this page (please note that there are many advanced concepts on that page that have not been covered in class).

  3. Another commonly used base function is sort(). Use this function to sort the glengths vector in descending order.
"},{"location":"day_1_exercise/D1.2e_functions_and_arguments/#user-defined-functions","title":"User-defined Functions","text":"

One of the great strengths of R is the user's ability to add functions. Sometimes there is a small task (or series of tasks) you need done and you find yourself having to repeat it multiple times. In these types of situations, it can be helpful to create your own custom function. The structure of a function is given below:

name_of_function <- function(argument1, argument2) {\n    statements or code that does something\n    return(something)\n}\n
  • First you give your function a name.
  • Then you assign value to it, where the value is the function.

When defining the function you will want to provide the list of arguments required (inputs and/or options to modify behaviour of the function), and wrapped between curly brackets place the tasks that are being executed on/using those arguments. The argument(s) can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it\u2019s not necessary to define what it is in any way.

Finally, you can \u201creturn\u201d the value of the object from the function, meaning pass the value of it into the global environment. The important idea behind functions is that objects that are created within the function are local to the environment of the function \u2013 they don\u2019t exist outside of the function.

Let's try creating a simple example function. This function will take in a numeric value as input, and return the squared value.

square_it <- function(x) {\n    square <- x * x\n    return(square)\n}\n

Once you run the code, you should see a function named square_it in the Environment panel (located at the top right of Rstudio interface). Now, we can use this function as any other base R functions. We type out the name of the function, and inside the parentheses we provide a numeric value x:

square_it(5)\n

Pretty simple, right? In this case, we only had one line of code that was run, but in theory you could have many lines of code to get obtain the final results that you want to \"return\" to the user.

Do I always have to return() something at the end of the function?

In the example above, we created a new variable called square inside the function, and then return the value of square. If you don't use return(), by default R will return the value of the last line of code inside that function. That is to say, the following function will also work.

square_it <- function(x) {\n    x * x\n}\n
However, we recommend always using return at the end of a function as the best practice.

We have only scratched the surface here when it comes to creating functions! We will revisit this in later lessons, but if interested you can also find more detailed information on this R-bloggers site, which is where we adapted this example from.

Exercise

  1. Write a function called multiply_it, which takes two inputs: a numeric value x, and a numeric value y. The function will return the product of these two numeric values, which is x * y. For example, multiply_it(x=4, y=6) will return output 24.

Attribution notice

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/","title":"Reading in and inspecting data","text":""},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to read existing data into R
  • Utilize base R functions to inspect data structures
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#reading-data-into-r","title":"Reading data into R","text":""},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#the-basics","title":"The basics","text":"

Regardless of the specific analysis in R we are performing, we usually need to bring data in for any analysis being done in R, so learning how to read in data is a crucial component of learning to use R.

Many functions exist to read data in, and the function in R you use will depend on the file format being read in. Below we have a table with some examples of functions that can be used for importing some common text data types (plain text).

Data type Extension Function Package Comma separated values csv read.csv() utils (default) read_csv() readr (tidyverse) Tab separated values tsv read_tsv() readr Other delimited formats txt read.table() utils read_table() readr read_delim() readr

For example, if we have text file where the columns are separated by commas (comma-separated values or comma-delimited), you could use the function read.csv. However, if the data are separated by a different delimiter in a text file (e.g. \":\", \";\", \" \"), you could use the generic read.table function and specify the delimiter (sep = \" \") as an argument in the function.

In the above table we refer to base R functions as being contained in the \"utils\" package. In addition to base R functions, we have also listed functions from some other packages that can be used to import data, specifically the \"readr\" package that installs when you install the \"tidyverse\" suite of packages.

In addition to plain text files, you can also import data from other statistical analysis packages and Excel using functions from different packages.

Data type Extension Function Package Stata version 13-14 dta readdta() haven Stata version 7-12 dta read.dta() foreign SPSS sav read.spss() foreign SAS sas7bdat read.sas7bdat() sas7bdat Excel xlsx, xls read_excel() readxl (tidyverse)

Note

These lists are not comprehensive, and may other functions exist for importing data. Once you have been using R for a bit, maybe you will have a preference for which functions you prefer to use for which data type.

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#metadata","title":"Metadata","text":"

When working with large datasets, you will very likely be working with \"metadata\" file which contains the information about each sample in your dataset.

The metadata is very important information and we encourage you to think about creating a document with as much metadata you can record before you bring the data into R. Here is some additional reading on metadata from the HMS Data Management Working Group.

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#the-readcsv-function","title":"The read.csv() function","text":"

Let's bring in the metadata file we downloaded earlier (mouse_exp_design.csv or mouse_exp_design.txt) using the read.csv function.

First, check the arguments for the function using the ? to ensure that you are entering all the information appropriately:

?read.csv\n

The first thing you will notice is that you've pulled up the documentation for read.table(), this is because that is the parent function and all the other functions are in the same family.

The next item on the documentation page is the function Description, which specifies that the output of this set of functions is going to be a data frame - \"Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.\"

In usage, all of the arguments listed for read.table() are the default values for all of the family members unless otherwise specified for a given function. Let's take a look at 2 examples:

  1. The separator

    • in the case of read.table() it is sep = \"\" (space or tab)
    • whereas for read.csv() it is sep = \",\" (a comma).
  2. The header

    This argument refers to the column headers that may (TRUE) or may not (FALSE) exist in the plain text file you are reading in.

    • in the case of read.table() it is header = FALSE (by default, it assumes you do not have column names)
    • whereas for read.csv() it is header = TRUE (by default, it assumes that all your columns have names listed).

The take-home from the \"Usage\" section for read.csv() is that it has one mandatory argument, the path to the file and filename in quotations; in our case that is data/mouse_exp_design.csv or data/mouse_exp_design.txt.

The stringsAsFactors argument

Note that the read.table {utils} family of functions has an argument called stringsAsFactors, which by default will take the value of default.stringsAsFactors().

Type out default.stringsAsFactors() in the console to check what the default value is for your current R session. Is it TRUE or FALSE?

If default.stringsAsFactors() is set to TRUE, then stringsAsFactors = TRUE. In that case any function in this family of functions will coerce character columns in the data you are reading in to factor columns (i.e. coerce from vector to factor) in the resulting data frame.

If you want to maintain the character vector data structure (e.g. for gene names), you will want to make sure that stringsAsFactors = FALSE (or that default.stringsAsFactors() is set to FALSE).

"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#create-a-data-frame-by-reading-in-the-file","title":"Create a data frame by reading in the file","text":"

At this point, please check the extension for the mouse_exp_design file within your data folder. You will have to type it accordingly within the read.csv() function.

Note

read.csv is not fussy about extensions for plain text files, so even though the file we are reading in is a comma-separated value file, it will be read in properly even with a .txt extension.

Let's read in the mouse_exp_design file and create a new data frame called metadata.

metadata <- read.csv(file=\"data/mouse_exp_design.csv\")\n\n# OR \n# metadata <- read.csv(file=\"data/mouse_exp_design.txt\")\n

NOTE

RStudio supports the automatic completion of code using the Tab key. This is especially helpful for when reading in files to ensure the correct file path. The tab completion feature also provides a shortcut to listing objects, and inline help for functions. Tab completion is your friend! We encourage you to use it whenever possible.

Go to your Global environment and click on the name of the data frame you just created.

When you do this the metadata table will pop up on the top left hand corner of RStudio, right next to the R script.

You should see a subtle coloring (blue-gray) of the first row and first column, the rest of the table will have a white background. This is because your first row and first columns have different properties than the rest of the table, they are the names of the rows and columns respectively.

Earlier we noted that the file we just read in had column names (first row of values) and how read.csv() deals with \"headers\". In addition to column headers, read.csv() also assumes that the first column contains the row names. Not all functions in the read.table() family of functions will do this and depending on which one you use, you may have to specify an additional argument to properly assign the row names and column names.

Note

Row names and column names are really handy when subsetting data structures and they are also helpful to identify samples or genes. We almost always use them with data frames.

Exercise 1

  1. Download this tab-delimited .txt file and save it in your project's data folder.
  2. Read it in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
    • all the columns in the input text file have column name/headers
    • you want the first column of the text file to be used as row names (hint: look up the input for the row.names = argument in read.table())
  3. Display the contents of proj_summary in your console
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#inspecting-data-structures","title":"Inspecting data structures","text":"

There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Let's use the metadata file that we created to test out data inspection functions.

Take a look at the dataframe by typing out the variable name metadata and pressing return; the variable contains information describing the samples in our study. Each row holds information for a single sample, and the columns contain categorical information about the sample genotype(WT or KO), celltype (typeA or typeB), and replicate number (1,2, or 3).

metadata\n

Output

genotype celltype replicate\nsample1        Wt    typeA      1\nsample2        Wt    typeA      2\nsample3        Wt    typeA      3\nsample4        KO    typeA      1\nsample5        KO    typeA      2\nsample6        KO    typeA      3\nample7        Wt    typeB       1\nsample8        Wt    typeB      2\nsample9        Wt    typeB      3\nsample10       KO    typeB      1\nsample11       KO    typeB      2\nsample12       KO    typeB      3\n

Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head():

head(metadata)\n
"},{"location":"day_1_exercise/D1.3e_reading_in_and_data_inspection/#list-of-functions-for-data-inspection","title":"List of functions for data inspection","text":"

We already saw how the functions head() and str() (in the releveling section) can be useful to check the content and the structure of a data.frame. Below is a non-exhaustive list of functions to get a sense of the content/structure of data. The list has been divided into functions that work on all types of objects, some that work only on vectors/factors (1 dimensional objects), and others that work on data frames and matrices (2 dimensional objects).

We have some exercises below that will allow you to gain more familiarity with these. You will definitely be using some of them in the next few homework sections.

  • All data structures - content display:

    • str(): compact display of data contents (similar to what you see in the Global environment)
    • class(): displays the data type for vectors (e.g. character, numeric, etc.) and data structure for dataframes, matrices, lists
    • summary(): detailed display of the contents of a given object, including descriptive statistics, frequencies
    • head(): prints the first 6 entries (elements for 1-D objects, rows for 2-D objects)
    • tail(): prints the last 6 entries (elements for 1-D objects, rows for 2-D objects)
  • Vector and factor variables:

    • length(): returns the number of elements in a vector or factor
  • Dataframe and matrix variables:

    • dim(): returns dimensions of the dataset (number_of_rows, number_of_columns) [Note, row numbers will always be displayed before column numbers in R]
    • nrow(): returns the number of rows in the dataset
    • ncol(): returns the number of columns in the dataset
    • rownames(): returns the row names in the dataset
    • colnames(): returns the column names in the dataset

Exercise 2

  • Use the class() function on glengths and metadata, how does the output differ between the two?
  • Use the summary() function on the proj_summary dataframe, what is the median \"rRNA_rate\"?
  • How long is the samplegroup factor?
  • What are the dimensions of the proj_summary dataframe?
  • When you use the rownames() function on metadata, what is the data structure of the output?
  • [Optional] How many elements in (how long is) the output of colnames(proj_summary)? Don't count, but use another function to determine this.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2/D2.1_in_class_exercises/","title":"Day 2: In class activities","text":""},{"location":"day_2/D2.1_in_class_exercises/#1-custom-functions","title":"1. Custom Functions","text":"

Let's create a function temp_conv(), which converts the temperature in Fahrenheit (input) to the temperature in Kelvin (output).

  • We could perform a two-step calculation: first convert from Fahrenheit to Celsius, and then convert from Celsius to Kelvin.

  • The formula for these two calculations are as follows: temp_c = (temp_f - 32) * 5 / 9; temp_k = temp_c + 273.15.

  • if your input is 70, the result of temp_conv(70) should be 294.2611.

"},{"location":"day_2/D2.1_in_class_exercises/#2-nesting-functions","title":"2. Nesting Functions","text":"

Now we want to round the temperature in Kelvin (output of temp_conv()) to a single decimal place. Use the round() function with the newly-created temp_conv() function to achieve this in one line of code. If your input is 70, the output should now be 294.3.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2/D2.2_data_wrangling/","title":"Data subsetting with base R: vectors and factors","text":"

Approximate time: 60 min

"},{"location":"day_2/D2.2_data_wrangling/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to subset vectors and factors
  • Explain the use of logical operators when subsetting vectors and factors
  • Demonstrate how to relevel factors in a desired order
"},{"location":"day_2/D2.2_data_wrangling/#selecting-data-using-indices-and-sequences","title":"Selecting data using indices and sequences","text":"

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let's begin with vectors and how to access different elements, and then extend those concepts to dataframes.

"},{"location":"day_2/D2.2_data_wrangling/#vectors","title":"Vectors","text":""},{"location":"day_2/D2.2_data_wrangling/#selecting-using-indices","title":"Selecting using indices","text":"

If we want to extract one or several values from a vector, we must provide one or several indices using square brackets [ ] syntax. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.

Let's start by creating a vector called age:

age <- c(15, 22, 45, 52, 73, 81)\n

Suppose we only wanted the fifth value of this vector, we would use the following syntax:

age[5]\n

If we wanted all values except the fifth value of this vector, we would use the following:

age[-5]\n

If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:

age[c(3,5,6)]   ## nested\n\n# OR\n\n## create a vector first then select\nidx <- c(3,5,6) # create vector of the elements of interest\nage[idx]\n

To select a sequence of continuous values from a vector, we would use : which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let's select the first four values from age:

age[1:4]\n

Alternatively, if you wanted the reverse could try 4:1 for instance, and see what is returned.

Exercise

  1. Create a vector called alphabets with the following letters, C, D, X, L, F.
  2. Use the associated indices along with [ ] to do the following:
    • only display C, D and F
    • display all except X
    • display the letters in the opposite order (F, L, X, D, C)
"},{"location":"day_2/D2.2_data_wrangling/#selecting-using-indices-with-logical-operators","title":"Selecting using indices with logical operators","text":"

We can also use indices with logical operators. Logical operators include greater than (>), less than (<), and equal to (==). A full list of logical operators in R is displayed below:

Operator Description > greater than >= greater than or equal to < less than <= less than or equal to == equal to != not equal to & and | or

We can use logical expressions to determine whether a particular condition is true or false. For example, let's use our age vector:

age\n

If we wanted to know if each element in our age vector is greater than 50, we could write the following expression:

age > 50\n

Returned is a vector of logical values the same length as age with TRUE and FALSE values indicating whether each element in the vector is greater than 50.

[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE\n

We can use these logical vectors to select only the elements in a vector with TRUE values at the same position or index as in the logical vector.

Select all values in the age vector over 50 or age less than 18:

age > 50 | age < 18\n\nage\n\nage[age > 50 | age < 18]  ## nested\n\n# OR\n\n## create a vector first then select\nidx <- age > 50 | age < 18\nage[idx]\n
"},{"location":"day_2/D2.2_data_wrangling/#indexing-with-logical-operators-using-the-which-function","title":"Indexing with logical operators using the which() function","text":"

While logical expressions will return a vector of TRUE and FALSE values of the same length, we could use the which() function to output the indices where the values are TRUE. Indexing with either method generates the same results, and personal preference determines which method you choose to use. For example:

which(age > 50 | age < 18)\n\nage[which(age > 50 | age < 18)]  ## nested\n\n# OR\n\n## create a vector first then select\nidx_num <- which(age > 50 | age < 18)\nage[idx_num]\n

Notice that we get the same results regardless of whether or not we use the which(). Also note that while which() works the same as the logical expressions for indexing, it can be used for multiple other operations, where it is not interchangeable with logical expressions.

"},{"location":"day_2/D2.2_data_wrangling/#factors","title":"Factors","text":"

Since factors are special vectors, the same rules for selecting values using indices apply. The elements of the expression factor created previously had the following categories or levels: low, medium, and high.

Let's extract the values of the factor with high expression, and let's using nesting here:

expression[expression == \"high\"]    ## This will only return those elements in the factor equal to \"high\"\n

Nesting note

The piece of code above was more efficient with nesting; we used a single step instead of two steps as shown below:

Step1 (no nesting): idx <- expression == \"high\"

Step2 (no nesting): expression[idx]

Exercise

Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).

"},{"location":"day_2/D2.2_data_wrangling/#releveling-factors","title":"Releveling factors","text":"

We have briefly talked about factors, but this data type only becomes more intuitive once you've had a chance to work with it. Let's take a slight detour and learn about how to relevel categories within a factor.

To view the integer assignments under the hood you can use str():

expression\n\nstr(expression)\nFactor w/ 3 levels \"high\",\"low\",\"medium\": 2 1 3 1 2 3 1\n
The categories are referred to as factor levels. As we learned earlier, the levels in the expression factor were assigned integers alphabetically, with high=1, low=2, medium=3. However, it makes more sense for us if low=1, medium=2 and high=3, i.e. it makes sense for us to relevel the categories in this factor.

To relevel the categories, you can add the levels argument to the factor() function, and give it a vector with the categories listed in the required order:

expression <- factor(expression, levels=c(\"low\", \"medium\", \"high\"))     # you can re-factor a factor \n\nstr(expression)\nFactor w/ 3 levels \"low\",\"medium\",..: 1 3 2 3 1 2 3\n

Now we have a releveled factor with low as the lowest or first category, medium as the second and high as the third. This is reflected in the way they are listed in the output of str(), as well as in the numbering of which category is where in the factor.

Note

Releveling becomes necessary when you need a specific category in a factor to be the \"base\" category, i.e. category that is equal to 1. One example would be if you need the \"control\" to be the \"base\" in a given RNA-seq experiment.

Exercise

Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/","title":"Packages and libraries","text":"

Approximate time: 25 min

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#learning-objectives","title":"Learning Objectives","text":"
  • Explain different ways to install external R packages
  • Demonstrate how to load a library and how to find functions specific to a package
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#packages-and-libraries","title":"Packages and Libraries","text":"

Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality. There are 10,000+ user contributed packages and growing.

There are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets; for example, all of the functions that we have been using so far in our examples.

The directories in R where the packages are stored are called the libraries. The terms package and library are sometimes used synonymously and there has been discussion amongst the community to resolve this. It is somewhat counter-intuitive to load a package using the library() function and so you can see how confusion can arise.

You can check what libraries are loaded in your current R session by typing into the console:

sessionInfo() #Print version information about R, the OS and attached or loaded packages\n\n# OR\n\nsearch() #Gives a list of attached packages\n

Previously we have introduced you to functions from the standard base packages. However, the more you work with R, you will come to realize that there is a cornucopia of R packages that offer a wide variety of functionality. To use additional packages will require installation. Many packages can be installed from the CRAN or Bioconductor repositories.

Helpful tips for package installations

  • Package names are case sensitive!
  • At any point (especially if you\u2019ve used R/Bioconductor in the past), in the console R may ask you if you want to \"update any old packages by asking Update all/some/none? [a/s/n]:\". If you see this, type \"a\" at the prompt and hit Enter to update any old packages. Updating packages can sometimes take awhile to run. If you are short on time, you can choose \"n\" and proceed. Without updating, you run the risk of conflicts between your old packages and the ones from your updated R version later down the road.
  • If you see a message in your console along the lines of \u201cbinary version available but the source version is later\u201d, followed by a question, \u201cDo you want to install from sources the package which needs compilation? y/n\u201d, type n for no, and hit enter.
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-cran","title":"Package installation from CRAN","text":"

CRAN is a repository where the latest downloads of R (and legacy versions) are found in addition to source code for thousands of different user contributed R packages.

Packages for R can be installed from the CRAN package repository using the install.packages function. This function will download the source code from on the CRAN mirrors and install the package (and any dependencies) locally on your computer.

An example is given below for the ggplot2 package that will be required for some plots we will create later on. Run this code to install ggplot2.

install.packages(\"ggplot2\")\n
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-bioconductor","title":"Package installation from Bioconductor","text":"

Alternatively, packages can also be installed from Bioconductor, another repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data. These packages includes (but is not limited to) tools for performing statistical analysis, annotation packages, and accessing public datasets.

There are many packages that are available in CRAN and Bioconductor, but there are also packages that are specific to one repository. Generally, you can find out this information with a Google search or by trial and error.

To install from Bioconductor, you will first need to install BiocManager. This only needs to be done once ever for your R installation.

Do Not Run This!

install.packages(\"BiocManager\")\n

Now you can use the install() function from the BiocManager package to install a package by providing the name in quotations.

Here we have the code to install ggplot2, through Bioconductor:

Do Not Run This!

BiocManager::install(\"ggplot2\")\n

Note

The code above may not be familiar to you - it is essentially using a new operator, a double colon :: to execute a function from a particular package. This is the syntax: package::function_name().

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#package-installation-from-source","title":"Package installation from source","text":"

Finally, R packages can also be installed from source. This is useful when you do not have an internet connection (and have the source files locally), since the other two methods are retrieving the source files from remote sites.

To install from source, we use the same install.packages function but we have additional arguments that provide specifications to change from defaults:

Do Not Run This!

install.packages(\"~/Downloads/ggplot2_1.0.1.tar.gz\", type=\"source\", repos=NULL)\n
"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#loading-libraries","title":"Loading libraries","text":"

Once you have the package installed, you can load the library into your R session for use. Any of the functions that are specific to that package will be available for you to use by simply calling the function as you would for any of the base functions. Note that quotations are not required here.

library(ggplot2)\n

You can also check what is loaded in your current environment by using sessionInfo() or search() and you should see your package listed as:

other attached packages:\n[1] ggplot2_2.0.0\n

In this case there are several other packages that were also loaded along with ggplot2.

We only need to install a package once on our computer. However, to use the package, we need to load the library every time we start a new R/RStudio environment. You can think of this as installing a bulb versus turning on the light.

Analogy and image credit to Dianne Cook of Monash University.

"},{"location":"day_2_exercise/D2.1e_packages_and_libraries/#finding-functions-specific-to-a-package","title":"Finding functions specific to a package","text":"

This is your first time using ggplot2, how do you know where to start and what functions are available to you? One way to do this, is by using the Package tab in RStudio. If you click on the tab, you will see listed all packages that you have installed. For those libraries that you have loaded, you will see a blue checkmark in the box next to it. Scroll down to ggplot2 in your list:

If your library is successfully loaded you will see the box checked, as in the screenshot above. Now, if you click on ggplot2 RStudio will open up the help pages and you can scroll through.

An alternative is to find the help manual online, which can be less technical and sometimes easier to follow. For example, this website is much more comprehensive for ggplot2 and is the result of a Google search. Many of the Bioconductor packages also have very helpful vignettes that include comprehensive tutorials with mock data that you can work with.

If you can't find what you are looking for, you can use the rdocumention.org website that search through the help files across all packages available.

Exercise

The ggplot2 package is part of the tidyverse suite of integrated packages which was designed to work together to make common data science operations more user-friendly. We will be using the tidyverse suite in later lessons, and so let's install it.

NOTE:

This suite of packages is only available in CRAN._

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/","title":"Data wrangling: dataframes, matrices, and lists","text":"

Approximate time: 60 min

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#learning-objectives","title":"Learning Objectives","text":"
  • Demonstrate how to subset, merge, and create new datasets from existing data structures in R.
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#dataframes","title":"Dataframes","text":"

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the \"coordinates\" we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma). Let's explore the metadata dataframe, shown below are the first six samples:

Let's say we wanted to extract the wild type (Wt) value that is present in the first row and the first column. To extract it, just like with vectors, we give the name of the data frame that we want to extract from, followed by the square brackets. Now inside the square brackets we give the coordinates or indices for the rows in which the value(s) are present, followed by a comma, then the coordinates or indices for the columns in which the value(s) are present. We know the wild type value is in the first row if we count from the top, so we put a one, then a comma. The wild type value is also in the first column, counting from left to right, so we put a one in the columns space too.

# Extract value 'Wt'\nmetadata[1, 1]\n

Now let's extract the value 1 from the first row and third column.

# Extract value '1'\nmetadata[1, 3] \n

Now if you only wanted to select based on rows, you would provide the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure:

# Extract third row\nmetadata[3, ] \n
What kind of data structure does the output appear to be? We see that it is two-dimensional with row names and column names, so we can surmise that it's likely a data frame.

If you were selecting specific columns from the data frame - the rows are left blank:

# Extract third column\nmetadata[ , 3]   \n

What kind of data structure does this output appear to be? It looks different from the data frame, and we really just see a series of values output, indicating a vector data structure. This happens be default if just selecting a single column from a data frame. R will drop to the simplest data structure possible. Since a single column in a data frame is really just a vector, R will output a vector data structure as the simplest data structure. Oftentimes we would like to keep our single column as a data frame. To do this, there is an argument we can add when subsetting called drop, meaning do we want to drop down to the simplest data structure. By default it is TRUE, but we can change it's value to FALSE in order to keep the output as a data frame.

# Extract third column as a data frame\nmetadata[ , 3, drop = FALSE] \n

Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values.

We can extract consecutive rows or columns using the colon (:) to create the vector of indices to extract.

# Dataframe containing first two columns\nmetadata[ , 1:2] \n

Alternatively, we can use the combine function (c()) to extract any number of rows or columns. Let's extract the first, third, and sixth rows.

# Data frame containing first, third and sixth rows\nmetadata[c(1,3,6), ] \n

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 2? oh, right... they are in column 1). In some cases, the column/row number for values can change if the script you are using adds or removes columns/rows. It's, therefore, often better to use column/row names to refer to extract particular values, and it makes your code easier to read and your intentions clearer.

# Extract the celltype column for the first three samples\nmetadata[c(\"sample1\", \"sample2\", \"sample3\") , \"celltype\"] \n

It's important to type the names of the columns/rows in the exact way that they are typed in the data frame; for instance if I had spelled celltype with a capital C, it would not have worked.

If you need to remind yourself of the column/row names, the following functions are helpful:

# Check column names of metadata data frame\ncolnames(metadata)\n\n# Check row names of metadata data frame\nrownames(metadata)\n

If only a single column is to be extracted from a data frame, there is a useful shortcut available. If you type the name of the data frame, then the $, you have the option to choose which column to extract. For instance, let's extract the entire genotype column from our dataset:

# Extract the genotype column\nmetadata$genotype \n

The output will always be a vector, and if desired, you can continue to treat it as a vector. For example, if we wanted the genotype information for the first five samples in metadata, we can use the square brackets ([]) with the indices for the values from the vector to extract:

# Extract the first five values/elements of the genotype column\nmetadata$genotype[1:5]\n

Unfortunately, there is no equivalent $ syntax to select a row by name.

Exercise

  1. Return a data frame with only the genotype and replicate column values for sample2 and sample8.
  2. Return the fourth and ninth values of the replicate column.
  3. Extract the replicate column as a data frame.
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#selecting-using-indices-with-logical-operators","title":"Selecting using indices with logical operators","text":"

With data frames, similar to vectors, we can use logical expressions to extract the rows or columns in the data frame with specific values. First, we need to determine the indices in a rows or columns where a logical expression is TRUE, then we can extract those rows or columns from the data frame.

For example, if we want to return only those rows of the data frame with the celltype column having a value of typeA, we would perform two steps:

  1. Identify which rows in the celltype column have a value of typeA.
  2. Use those TRUE values to extract those rows from the data frame.

To do this we would extract the column of interest as a vector, with the first value corresponding to the first row, the second value corresponding to the second row, so on and so forth. We use that vector in the logical expression. Here we are looking for values to be equal to typeA, so our logical expression would be:

metadata$celltype == \"typeA\"\n

This will output TRUE and FALSE values for the values in the vector. The first six values are TRUE, while the last six are FALSE. This means the first six rows of our metadata have a vale of typeA while the last six do not. We can save these values to a variable, which we can call whatever we would like; let's call it logical_idx.

logical_idx <- metadata$celltype == \"typeA\"\n

Now we can use those TRUE and FALSE values to extract the rows that correspond to the TRUE values from the metadata data frame. We will extract as we normally would a data frame with metadata[ , ], and we need to make sure we put the logical_idx in the row's space, since those TRUE and FALSE values correspond to the ROWS for which the expression is TRUE/FALSE. We will leave the column's space blank to return all columns.

metadata[logical_idx, ]\n
"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#selecting-indices-with-logical-operators-using-the-which-function","title":"Selecting indices with logical operators using the which() function","text":"

As you might have guessed, we can also use the which() function to return the indices for which the logical expression is TRUE. For example, we can find the indices where the celltype is typeA within the metadata dataframe:

which(metadata$celltype == \"typeA\")\n

This returns the values one through six, indicating that the first 6 values or rows are true, or equal to typeA. We can save our indices for which rows the logical expression is true to a variable we'll call idx, but, again, you could call it anything you want.

idx <- which(metadata$celltype == \"typeA\")\n

Then, we can use these indices to indicate the rows that we would like to return by extracting that data as we have previously, giving the idx as the rows that we would like to extract, while returning all columns:

metadata[idx, ]\n

Let's try another subsetting. Extract the rows of the metadata data frame for only the replicates 2 and 3. First, let's create the logical expression for the column of interest (replicate):

which(metadata$replicate > 1)\n

This should return the indices for the rows in the replicate column within metadata that have a value of 2 or 3. Now, we can save those indices to a variable and use that variable to extract those corresponding rows from the metadata table.

idx <- which(metadata$replicate > 1)\n\nmetadata[idx, ]\n

Alternatively, instead of doing this in two steps, we could use nesting to perform in a single step:

metadata[which(metadata$replicate > 1), ]\n

Either way works, so use the method that is most intuitive for you.

So far we haven't stored as variables any of the extractions/subsettings that we have performed. Let's save this output to a variable called sub_meta:

sub_meta <- metadata[which(metadata$replicate > 1), ]\n

Exercises

Subset the metadata dataframe to return only the rows of data with a genotype of KO.

NOTE

There are easier methods for subsetting dataframes using logical expressions, including the filter() and the subset() functions. These functions will return the rows of the dataframe for which the logical expression is TRUE, allowing us to subset the data in a single step. We will explore the filter() function in more detail in a later lesson.

"},{"location":"day_2_exercise/D2.2e_introR-data-wrangling/#lists","title":"Lists","text":"

Selecting components from a list requires a slightly different notation, even though in theory a list is a vector (that contains multiple data structures). To select a specific component of a list, you need to use double bracket notation [[]]. Let's use the list1 that we created previously, and index the second component:

list1[[2]]\n

What do you see printed to the console? Using the double bracket notation is useful for accessing the individual components whilst preserving the original data structure. When creating this list we know we had originally stored a dataframe in the second component. With the class function we can check if that is what we retrieve:

comp2 <- list1[[2]]\nclass(comp2)\n

You can also reference what is inside the component by adding an additional bracket. For example, in the first component we have a vector stored.

list1[[1]]\n\n[1] \"ecoli\" \"human\" \"corn\" \n

Now, if we wanted to reference the first element of that vector we would use:

list1[[1]][1]\n\n[1] \"ecoli\"\n

You can also do the same for dataframes and matrices, although with larger datasets it is not advisable. Instead, it is better to save the contents of a list component to a variable (as we did above) and further manipulate it. Also, it is important to note that when selecting components we can only access one at a time. To access multiple components of a list, see the note below.

Note

Using the single bracket notation also works wth lists. The difference is the class of the information that is retrieved. Using single bracket notation i.e. list1[1] will return the contents in a list form and not the original data structure. The benefit of this notation is that it allows indexing by vectors so you can access multiple components of the list at once.

Exercise

  1. Create a list named random with the following components: metadata, age, list1, samplegroup, and number.
  2. Extract the samplegroup component.

Assigning names to the components in a list can help identify what each list component contains, as well as, facilitating the extraction of values from list components.

Adding names to components of a list uses the names() function. Let's check and see if the list1 has names for the components:

names(list1) \n

When we created the list we had combined the species vector with a dataframe df and the number variable. Let's assign the original names to the components. To do this we can use the assignment operator in a new context. If we add names(list1) to the left side of the assignment arrow to be assigned to, then anything on the right side of the arrow will be assigned. Since we have three components in list1, we need three names to assign. We can create a vector of names using the combine (c()) function, and inside the combine function we give the names to assign to the components in the order we would like. So the first name is assigned to the first component of the list, and so on.

# Name components of the list\nnames(list1) <- c(\"species\", \"df\", \"number\")\n\nnames(list1)\n

Now that we have named our list components, we can extract components using the $ similar to extracting columns from a data frame. To obtain a component of a list using the component name, use list_name$component_name:

To extract the df dataframe from the list1 list:

# Extract 'df' component\nlist1$df\n

Exercise

Let's practice combining ways to extract data from the data structures we have covered so far:

  1. Set names for the random list you created in the last exercise.

  2. Extract the age component using the $ notation

An R package for data wrangling

The methods presented above are using base R functions for data wrangling. Later we will explore the Tidyverse suite of packages, specifically designed to make data wrangling easier.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/","title":"Advanced R, logical operators for matching","text":"

Approximate time: 45 min

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#learning-objectives","title":"Learning Objectives","text":"
  • Describe the use of %in% operator.
  • Explain the user case for any and all functions.
"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#logical-operators-for-identifying-matching-elements","title":"Logical operators for identifying matching elements","text":"

Oftentimes, we encounter different analysis tools that require multiple input datasets. It is not uncommon for these inputs to need to have the same row names, column names, or unique identifiers in the same order to perform the analysis. Therefore, knowing how to reorder datasets and determine whether the data matches is an important skill.

In our use case, we will be working with genomic data. We have gene expression data generated by RNA-seq, which we had downloaded previously; in addition, we have a metadata file corresponding to the RNA-seq samples. The metadata contains information about the samples present in the gene expression file, such as which sample group each sample belongs to and any batch or experimental variables present in the data.

Let's read in our gene expression data (RPKM matrix) that we downloaded previously:

rpkm_data <- read.csv(\"data/counts.rpkm.csv\")\n

Note

If the data file name ends with txt instead of csv, you can read in the data using the code: rpkm_data <- read.csv(\"data/counts.rpkm.txt\").

Take a look at the first few lines of the data matrix to see what's in there.

head(rpkm_data)\n

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.

ncol(rpkm_data)\nnrow(metadata)\n

What we want to know is, do we have data for every sample that we have metadata?

"},{"location":"day_2_exercise/D2.3e_identifying-matching-elements/#the-in-operator","title":"The %in% operator","text":"

Although lacking in documentation, this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

vector1 %in% vector2\n

It will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size. This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:

A <- c(1,3,5,7,9,11)   # odd numbers\nB <- c(2,4,6,8,10,12)  # even numbers\n\n# test to see if each of the elements of A is in B  \nA %in% B\n
## [1] FALSE FALSE FALSE FALSE FALSE FALSE\n

Since vector A contains only odd numbers and vector B contains only even numbers, the operation returns a logical vector containing six FALSE, suggesting that no element in vector A is present in vector B. Let's change a couple of numbers inside vector B to match vector A:

A <- c(1,3,5,7,9,11)   # odd numbers\nB <- c(2,4,6,8,1,5)  # add some odd numbers in \n
# test to see if each of the elements of A is in B\nA %in% B\n
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE\n

The returned logical vector denotes which elements in A are also in B - the first and third elements, which are 1 and 5.

We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to TRUE. Therefore, we can use the output logical vector to subset our data, and return only those elements in A, which are also in B by returning only the TRUE values:

intersection <- A %in% B\nintersection\n

A[intersection]\n

In these previous examples, the vectors were so small that it's easy to check every logical value by eye; but this is not practical when we work with large datasets (e.g. a vector with 1000 logical values). Instead, we can use any function. Given a logical vector, this function will tell you whether at least one value is TRUE. It provides us a quick way to assess if any of the values contained in vector A are also in vector B:

any(A %in% B)\n

The all function is also useful. Given a logical vector, it will tell you whether all values are TRUE. If there is at least one FALSE value, the all function will return a FALSE. We can use this function to assess whether all elements from vector A are contained in vector B.

all(A %in% B)\n

Exercise

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A

  2. Subset the B vector to only return those values that are also in A.

Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors. The operator returns a logical vector indicating TRUE/FALSE at each position. Then we can use all() function to check if all values in the returned vector are TRUE. If all values are TRUE, we know that these two vectors are the same. Unlike %in% operator, == operator requires that two vectors are of equal length.

A <- c(10,20,30,40,50)\nB <- c(50,40,30,20,10)  # same numbers but backwards \n\n# test to see if each element of A is in B\nA %in% B\n\n# test to see if each element of A is in the same position in B\nA == B\n\n# use all() to check if they are a perfect match\nall(A == B)\n

Let's try this on our genomic data, and see whether we have metadata information for all samples in our expression data. We'll start by creating two vectors: one is the rownames of the metadata, and one is the colnames of the RPKM data. These are base functions in R which allow you to extract the row and column names as a vector:

x <- rownames(metadata)\ny <- colnames(rpkm_data)\n

Now check to see that all of x are in y:

all(x %in% y)\n

Note that we can use nested functions in place of x and y and still get the same result:

all(rownames(metadata) %in% colnames(rpkm_data))\n

We know that all samples are present, but are they in the same order?

x == y\nall(x == y)\n

Looks like all of the samples are there, but they need to be reordered. To reorder our genomic samples, we will learn different ways to reorder data in our next lesson. But before that, let's work on exercise 2 to consolidate concepts from this lesson.

Exercise

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.

First, let's create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

    important_genes <- c(\"ENSMUSG00000083700\", \"ENSMUSG00000080990\", \"ENSMUSG00000065619\", \"ENSMUSG00000047945\", \"ENSMUSG00000081010\", \"ENSMUSG00000030970\")\n
  1. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.

  2. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.

  3. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/","title":"Advanced R, reordering to match datasets","text":"

Approximate time: 45 min

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#learning-objectives","title":"Learning Objectives","text":"
  • Implement manual reordering of vectors and data frames
  • Utilize the match() function to reorder vectors and data frames so that unique identifiers are in the same order
"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#reordering-data-to-match","title":"Reordering data to match","text":"

In the previous lesson, we learned how to determine whether the same data is present in two datasets, in addition to, whether it is in the same order. In this lesson, we will explore how to reorder the data such that the datasets are matching.

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#manual-reordering-of-data-using-indices","title":"Manual reordering of data using indices","text":"

Indexing [ ] can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values.

teaching_team <- c(\"Jihe\", \"Mary\", \"Meeta\", \"Radhika\", \"Will\", \"Emma\")\n

Remember that we can return values in a vector by specifying it's position or index:

# Extracting values from a vector\nteaching_team[c(2, 4)] \n

Also, note that we haven't changed the teaching_team variable. The only way to change the teaching_team variable would be to re-assign/overwrite it.

teaching_team\n

We can also extract the values and reorder them:

# Extracting values and reordering them\nteaching_team[c(4, 2)] \n

Similarly, we can extract all of the values and reorder them:

# Extracting all values and reordering them\nteaching_team[c(5, 4, 6, 2, 1, 3)]\n

If we want to save our results, we need to assign to a variable:

# Saving the results to a variable\nreorder_teach <- teaching_team[c(5, 4, 6, 2, 1, 3)] \n

Exercise

Now that we know how to reorder using indices, let's try to use it to reorder the contents of one vector to match the contents of another. Let's create the vectors first and second as detailed below:

first <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"B\",\"D\",\"E\",\"A\",\"C\")  # same letters but different order\n

How would you reorder the second vector to match first?

If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements, and it would be quite easy to make a typo or mistake. To help with matching datasets, there is a function called match().

"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#the-match-function","title":"The match function","text":"

We can use the match() function to match the values in two vectors. We'll be using it to evaluate which values are present in both vectors, and how to reorder the elements to make the values match.

match() takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:

  1. a vector of values in the order you want
  2. a vector of values to be reordered

The function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let's use match() on the first and second vectors we created.

match(first,second)\n[1] 4 1 5 2 3\n

The output is the indices for how to reorder the second vector to match the first. These indices match the indices that we derived manually before.

Now, we can just use the indices to reorder the elements of the second vector to be in the same positions as the matching elements in the first vector:

# Saving indices for how to reorder `second` to match `first`\nreorder_idx <- match(first,second) \n

Then, we can use those indices to reorder the second vector similar to how we ordered with the manually derived indices.

# Reordering the second vector to match the order of the first vector\nsecond[reorder_idx]\n

If the output looks good, we can save the reordered vector to a new variable.

# Reordering and saving the output to a variable\nsecond_reordered <- second[reorder_idx]  \n

Now that we know how match() works, let's change vector second so that only a subset are retained:

first <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"D\",\"B\",\"A\")  # remove values\n
And try to match() again:

match(first,second)\n\n[1]  3  2 NA  1 NA\n

We see that the match() function takes every element in the first vector and finds the position of that element in the second vector, and if that element is not present, will return a missing value of NA. The value NA represents missing data for any data type within R. In this case, we can see that the match() function output represents the value at position 3 as first, which is A, then position 2 is next, which is B, the value coming next is supposed to be C, but it is not present in the second vector, so NA is returned, so on and so forth.

Note

For values that don't match by default return an NA value. You can specify what values you would have it assigned using nomatch argument. Also, if there is more than one matching value found only the first is reported.

If we rearrange second using these indices, then we should see that all the values present in both vectors are in the same positions and NAs are present for any missing values.

second[match(first, second)]\n
"},{"location":"day_2_exercise/D2.4e_reordering-to-match-datasets/#reordering-genomic-data-using-match-function","title":"Reordering genomic data using match() function","text":"

While the input to the match() function is always going to be to vectors, often we need to use these vectors to reorder the rows or columns of a data frame to match the rows or columns of another dataframe. Let's explore how to do this with our use case featuring RNA-seq data. To perform differential gene expression analysis, we have a data frame with the expression data or counts for every sample and another data frame with the information about to which condition each sample belongs. For the tools doing the analysis, the samples in the counts data, which are the column names, need to be the same and in the same order as the samples in the metadata data frame, which are the rownames.

We can take a look at these samples in each dataset by using the rownames() and colnames() functions.

# Check row names of the metadata\nrownames(metadata)\n\n# Check the column names of the counts data\ncolnames(rpkm_data)\n

We see the row names of the metadata are in a nice order starting at sample1 and ending at sample12, while the column names of the counts data look to be the same samples, but are randomly ordered. Therefore, we want to reorder the columns of the counts data to match the order of the row names of the metadata. To do so, we will use the match() function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match.

To do so, we will use the match function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match().

Within the match() function, the rownames of the metadata is the vector in the order that we want, so this will be the first argument, while the column names of the count or rpkm data is the vector to be reordered. We will save these indices for how to reorder the column names of the count data such that it matches the rownames of the metadata to a variable called genomic idx.

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))\ngenomic_idx\n

The genomic_idx represents how to re-order the column names in our counts data to be identical to the row names in metadata.

Now we can create a new counts data frame in which the columns are re-ordered based on the match() indices. Remember that to reorder the rows or columns in a data frame we give the name of the data frame followed by square brackets, and then the indices for how to reorder the rows or columns.

Our genomic_idx represents how we would need to reorder the columns of our count data such that the column names would be in the same order as the row names of our metadata. Therefore, we need to add our genomic_idx to the columns position. We are going to save the output of the reordering to a new data frame called rpkm_ordered.

# Reorder the counts data frame to have the sample names in the same order as the metadata data frame\nrpkm_ordered  <- rpkm_data[ , genomic_idx]\n

Check and see what happened by clicking on the rpkm_ordered in the Environment window or using the View() function.

# View the reordered counts\nView(rpkm_ordered)\n

We can see the sample names are now in a nice order from sample 1 to 12, just like the metadata. One thing to note is that you would never want to rearrange just the column names without the rest of the column because that would dissociate the sample name from it's values.

You can also verify that column names of this new data matrix matches the metadata row names by using the all function:

all(rownames(metadata) == colnames(rpkm_ordered))\n

Now that our samples are ordered the same in our metadata and counts data, if these were raw counts (not RPKM) we could proceed to perform differential expression analysis with this dataset.

Exercise

  1. After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.

  2. Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/","title":"Plotting and data visualization in R","text":""},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#learning-objectives","title":"Learning Objectives","text":"
  • Describe the map() function for iterative tasks on data structures.
"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#setting-up-a-data-frame-for-visualization","title":"Setting up a data frame for visualization","text":"

In this lesson we want to make plots to evaluate the average expression in each sample and its relationship with the age of the mouse. So, to this end, we will be adding a couple of additional columns of information to the metadata data frame that we can utilize for plotting.

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#calculating-average-expression","title":"Calculating average expression","text":"

Let's take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute the average value of expression for each sample eventually. Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called mean():

mean(rpkm_ordered$sample1)\n

That is great, if we only wanted the average from one of the samples (1 column in a data frame), but we need to get this information from all 12 samples, so all 12 columns. It would be ideal to get a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?

Programming languages typically have a way to allow the execution of a single line of code or several lines of code multiple times, or in a \"loop\". While \"for loops\" are available in R, there are other easier-to-use functions that can achieve this - for example, the apply() family of functions and the map() family of functions.

The map() family is a bit more intuitive to use than apply() and we will be using it today. However, if you are interested in learning more about theapply() family of functions we have materials available here.

To obtain mean values for all samples we can use the map_dbl() function which generates a numeric vector.

library(purrr)  # Load the purrr\n\nsamplemeans <- map_dbl(rpkm_ordered, mean) \n
"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#the-map-family-of-functions","title":"The map family of functions","text":"

The map() family of functions is available from the purrr package, which is part of the tidyverse suite of packages. More detailed information is available in the R for Data Science book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a \"double\" or numeric vector.
  • map_chr() creates a character vector.

The syntax for the map() family of functions is:

## DO NOT RUN\nmap(object, function_to_apply)\n

If you would like to practice with the map() family of functions, we have additional materials available.

"},{"location":"day_2_exercise/D2.5e_setting_up_to_plot/#creating-a-new-metadata-object-with-additional-information","title":"Creating a new metadata object with additional information","text":"

Because the input was 12 columns of information the output of map_dbl() is a named vector of length 12.

# Named vectors have a name assigned to each element instead of just referring to them as indices ([1], [2] and so on)\nsamplemeans\n\n# Check length of the vector before adding it to the data frame\nlength(samplemeans)\n

Since we have 12 rows in the data frame, we can add the 12 element vector as a column to our metadata data frame using the data.frame() function.

Before we add the new column, let's create a vector with the ages of each of the mice in our data set.

# Create a numeric vector with ages. Note that there are 12 elements here\nage_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)        \n

Now, we are ready to combine the metadata data frame with the 2 new vectors to create a new data frame with 5 columns

# Add the new vector as the last column to the new_metadata dataframe\nnew_metadata <- data.frame(metadata, samplemeans, age_in_days) \n\n# Take a look at the new_metadata object\nView(new_metadata)\n

Note

that we could have also combined columns using the cbind() function as shown in the code below:

## DO NOT RUN\nnew_metadata <- cbind(metadata, samplemeans, age_in_days)\n
The two functions work identically with the exception of assigning row names. For example, if we were combining columns and wanted to add in a vector of row names, we could easily do so in data.frame() with the use of the row.names argument. This argument is not available for the cbind() function.

We are now ready for plotting and data visualization!

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/D3.1_in_class_exercises/","title":"Day 3: In class activities","text":""},{"location":"day_3/D3.1_in_class_exercises/#reading-in-and-inspecting-data","title":"Reading in and inspecting data","text":"
  • Download the data and place the file into the data directory.
Data Download link Animal data Right click & Save link as...
  • Read the .csv file into your environment and assign it to a variable called animals. Be sure to check that your row names are the different animals.

  • Check to make sure that animals is a dataframe.

  • How many rows are in the animals dataframe? How many columns?

"},{"location":"day_3/D3.1_in_class_exercises/#data-wrangling","title":"Data wrangling","text":"
  1. Extract the speed value of 40 km/h from the animals dataframe.

  2. Return the rows with animals that are the color Tan.

  3. Return the rows with animals that have speed greater than 50 km/h and output only the color column. Keep the output as a data frame.

  4. Change the color of \"Grey\" to \"Gray\".

  5. Create a list called animals_list in which the first element contains the speed column of the animals dataframe and the second element contains the color column of the animals dataframe.

  6. Give each element of your list the appropriate name (i.e speed and color).

"},{"location":"day_3/D3.1_in_class_exercises/#the-in-operator-reordering-and-matching","title":"The %in% operator, reordering and matching","text":"

In your environment you should have a dataframe called proj_summary which contains quality metric information for an RNA-seq dataset. We have obtained batch information for the control samples in this dataset.

  1. Copy and paste the code below to create a dataframe of control samples with the associated batch information
    ctrl_samples <- data.frame(row.names = c(\"sample3\", \"sample10\", \"sample8\", \"sample4\", \"sample15\"), date = c(\"01/13/2018\", \"03/15/2018\", \"01/13/2018\", \"09/20/2018\",\"03/15/2018\"))\n
  1. How many of the ctrl_samples are also in the proj_summary dataframe? Use the %in% operator to compare sample names.

  2. Keep only the rows in proj_summary which correspond to those in ctrl_samples. Do this with the %in% operator. Save it to a variable called proj_summary_ctrl.

  3. We would like to add in the batch information for the samples in proj_summary_ctrl. Find the rows that match in ctrl_samples.

  4. Use cbind() to add a column called batch to the proj_summary_ctrl dataframe. Assign this new dataframe back to proj_summary_ctrl.

"},{"location":"day_3/D3.1_in_class_exercises/#bonus-using-map_lgl","title":"BONUS: Using map_lgl()","text":"
  1. Subset proj_summary to keep only the \u201chigh\u201d and \u201clow\u201d samples based on the treament column. Save the new dataframe to a variable called proj_summary_noctl.

  2. Further, subset the dataframe to remove the non-numeric columns \u201cQuality_format\u201d, and \u201ctreatment\u201d. Try to do this using the map_lgl() function in addition to is.numeric(). Save the new dataframe back to proj_summary_noctl.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/D3.2_plotting_with_ggplot2/","title":"Plotting and data visualization in R","text":"

Approximate time: 60 minutes

"},{"location":"day_3/D3.2_plotting_with_ggplot2/#learning-objectives","title":"Learning Objectives","text":"
  • Explain the syntax of ggplot2
  • Apply ggplot2 package to visualize data.
"},{"location":"day_3/D3.2_plotting_with_ggplot2/#data-visualization-with-ggplot2","title":"Data Visualization with ggplot2","text":"

For this lesson, you will need the new_metadata data frame. Please download it from the link below. Right click and save link as or download file as in the data directory.

Data Download link Data Right click & Save link as...

Once you have downloaded it, load it into your environment as follows:

## load the new_metadata data frame into your environment from a .RData object\nload(\"data/new_metadata.RData\")\n

Next, let's check if it was successfully loaded into the environment:

# this data frame should have 12 rows and 5 columns\nView(new_metadata)\n

Great, we are now ready to move forward!

When we are working with large sets of numbers it can be useful to display that information graphically to gain more insight. In this lesson we will be plotting with the popular Bioconductor package ggplot2.

Note

If you are interested in learning about plotting with base R functions, we have a short lesson.

The ggplot2 syntax takes some getting used to, but once you get it, you will find it's extremely powerful and flexible. We will start with drawing a simple x-y scatterplot of samplemeans versus age_in_days from the new_metadata data frame. Please note that ggplot2 expects a \"data frame\" or \"tibble\" as input.

Note

You can find out more about tibbles in the lesson on tidyverse

Let's start by loading the ggplot2 library:

library(ggplot2)\n

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot using additional functions one after the other and combine them into a \"code chunk\" using the + operator; the functions in the resulting code chunk are called layers.

Try the code below and see what happens.

ggplot(new_metadata) # what happens? \n

Metadata

If you don't have the new_metadata object, you can right-click to download and save an rds file from here into the project data folder, and load it in using the code below:

new_metadata <- readRDS(\"data/new_metadata.rds\")`\n

You get an blank plot, because you need to specify additional layers using the + operator.

The geom (geometric) object is the layer that specifies what kind of plot we want to draw. A plot must have at least one geom; there is no upper limit. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

Let's add a \"geom\" layer to our plot using the + operator, and since we want a scatter plot so we will use geom_point().

ggplot(new_metadata) +\n  geom_point() # note what happens here\n

Why do we get an error? Is the error message easy to decipher?

We get an error because each type of geom usually has a required set of aesthetics to be set. \"Aesthetics\" are set with the aes() function and can be set either nested within geom_point() (applies only to that layer) or within ggplot() (applies to the whole plot).

The aes() function has many different arguments, and all of those arguments take columns from the original data frame as input. It can be used to specify many plot elements including the following:

  • position (i.e., on the x and y axes)
  • color (\"outside\" color)
  • fill (\"inside\" color)
  • shape (of points)
  • linetype
  • size

To start, we will specify x- and y-axis since geom_point requires the most basic information about a scatterplot, i.e. what you want to plot on the x and y axes. All of the other plot elements mentioned above are optional.

ggplot(new_metadata) +\n     geom_point(aes(x = age_in_days, y= samplemeans))\n

Now that we have the required aesthetics, let's add some extras like color to the plot. We can color the points on the plot based on the genotype column within aes(). You will notice that there are a default set of colors that will be used so we do not have to specify. Note that the legend has been conveniently plotted for us.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype)) \n

Let's try to have both celltype and genotype represented on the plot. To do this we can assign the shape argument in aes() the celltype column, so that each celltype is plotted with a different shaped data point.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype)) \n

The data points are quite small. We can adjust the size of the data points within the geom_point() layer, but it should not be within aes() since we are not mapping it to a column in the input data frame, instead we are just specifying a number.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=2.25) \n

The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis label aesthetics
  • Plot background
  • Facet label backround
  • Legend appearance

There are built-in themes we can use (i.e. theme_bw()) that mostly change the background/foreground colours, by adding it as additional layer. Or we can adjust specific elements of the current default theme by adding the theme() layer and passing in arguments for the things we wish to change. Or we can use both.

Let's add a layer theme_bw().

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=3.0) +\n  theme_bw() \n

Do the axis labels or the tick labels get any larger by changing themes?

No, they don't. But, we can add arguments using theme() to change the size of axis labels ourselves. Since we will be adding this layer \"on top\", or after theme_bw(), any features we change will override what is set by the theme_bw() layer.

Let's increase the size of both the axes titles to be 1.5 times the default size. When modifying the size of text the rel() function is commonly used to specify a change relative to the default.

ggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=2.25) +\n  theme_bw() +\n  theme(axis.title = element_text(size=rel(1.5)))           \n

Notes

  • You can use the example(\"geom_point\") function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geometric object layers available in ggplot2 to learn how you can easily modify your plots!
  • RStudio provide this very useful cheatsheet for plotting using ggplot2. Different example plots are provided and the associated code (i.e which geom or theme to use in the appropriate situation.) We also encourage you to peruse through this useful online reference for working with ggplot2.

Exercise

  1. The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled \"Age (days)\" and the y-axis is labeled \"Mean expression\".
  2. Use the ggtitle layer to add a plot title of your choice.
  3. Add the following new layer to the code chunk theme(plot.title=element_text(hjust=0.5)).
    • What does it change?
    • How many theme() layers can be added to a ggplot code chunk, in your estimation?

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3/basic_plots_in_r/","title":"Plotting and data visualization in R (basics)","text":"

Approximate time: 45 minutes

"},{"location":"day_3/basic_plots_in_r/#basic-plots-in-r","title":"Basic plots in R","text":"

R has a number of built-in tools for basic graph types such as histograms, scatter plots, bar charts, boxplots and much more. Rather than going through all of different types, we will focus on plot(), a generic function for plotting x-y data.

To get a quick view of the different things you can do with plot, let's use the example() function:

example(\"plot\")\n

Here, you will find yourself having to press <Return> so you can scroll through the different types of graphs generated by plot. Take note of the different parameters used with each command and how that affects the aesthetics of the plot.

dev.off() \n# this means \"device off\" and we will be going over what this does at the end of this section. \n# For now, it makes it so that when we draw plots they show up where they are supposed to?\n
"},{"location":"day_3/basic_plots_in_r/#scatterplot","title":"Scatterplot","text":"

For some hands-on practice we are going to use plot to draw a scatter plot and obtain a graphical view of the relationship between two sets of continuous numeric data. From our new_metadata file we will take the samplemeans column and plot it against age_in_days, to see how mean expression changes with age.

Now our metadata has all the information to draw a scatterplot. The base R function to do this is plot(y ~ x, data):

plot(samplemeans ~ age_in_days, data=new_metadata)\n

Each point represents a sample. The values on the y-axis correspond to the average expression for each sample which is dependent on the x-axis variable age_in_days. This plot is in its simplest form, we can customize many features of the plot (fonts, colors, axes, titles) through graphic options.

For example, let's start by giving our plot a title and renaming the axes. We can do that by simply adding the options xlab, ylab and main as arguments to the plot() function:

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\")\n

We can also change the shape of the data point using the pch option and the size of the data points using cex (specifying the amount to magnify relative to the default).

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\", pch=\"*\", cex=2.0)\n

We can also add some color to the data points on the plot by adding col=\"blue\". Alternatively, you can sub in any of the default colors or you can experiment with other R packages to fiddle with better palettes.

We can also add color to separate the data points by information in our data frame. For example, suppose we wanted to the data points colored by celltype. We would need to specify a vector of colours and provide the factor by which we are separating samples. The first level in our factor vector (which by default is assigned alphabetically) would get assigned the first color that we list. So in this case, blue corresponds to celltype A samples and green corresponds to celltype B.

plot(samplemeans ~ age_in_days, data=new_metadata, main=\"Expression changes with age\", xlab=\"Age (days)\", \n    ylab=\"Mean expression\", pch=\"*\", cex=2.0, col=c(\"blue\", \"green\")[celltype])\n

The last thing this plot needs is a figure legend describing the color scheme. It would be great if it created one for you by default, but with R base functions unfortunately it is not that easy. To draw a legend on the current plot, you need to run a new function called legend() and specify the appropriate arguments. The code to do so is provided below. Don't worry if it seems confusing, we plan on showing you a much more intuitive way of plotting your data.

legend(\"topleft\", pch=\"*\", col=c(\"blue\", \"green\"), c(\"A\", \"B\"), cex=0.8,\n    title=\"Celltype\")\n

Exercise

  1. Change the color scheme in the scatterplot, such that it reflects the genotype of samples rather than celltype.

  2. Use R help to find out how to increase the size of the text on the axis labels.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

"},{"location":"day_3/basic_plots_in_r/#other-types-of-plots-in-base-r","title":"Other Types of Plots in Base R","text":"

NOTE: we will not run these in class, but the code is provided if you are interested in exploring more on your own.

"},{"location":"day_3/basic_plots_in_r/#barplot","title":"Barplot","text":"

Barplots are useful for comparing the distribution of a quantitative variable (numeric) between groups or categories. A barplot would be much more useful to compare the samplemeans (numeric variable) for each sample. We can use barplot to draw a single bar representing each sample and the height indicates the average expression level.

?barplot\n# note that there is no \"data=\" argument for barplot()\n
Similar to the scatterplot, we can use additional arguments to specify the aesthetics that we want to change. For example, changing axis labeling and adding some color.
barplot(new_metadata$samplemeans, names.arg=c(1:12), horiz=TRUE, col=c(\"darkblue\", \"red\")[new_metadata$genotype]) \n

"},{"location":"day_3/basic_plots_in_r/#histogram","title":"Histogram","text":"

If we are interested in an overall distribution of numerical data, a histogram is what we'd want. To plot a histogram of the data use the hist command:

hist(new_metadata$samplemeans)\n
Again, there are many options that we can change by modifying the default parameters. Let's color in the bars, remove the borders and increase the number of breaks:
hist(new_metadata$samplemeans, xlab=\"Mean expression level\", main=\"\", col=\"darkgrey\", border=FALSE) \n

"},{"location":"day_3/basic_plots_in_r/#boxplot","title":"Boxplot","text":"

Using additional sample information from our metadata, we can use plots to compare values between different factor levels or categories. For example, we can compare the sample means across celltypes 'typeA' and 'typeB' using a boxplot.

# Boxplot\nboxplot(samplemeans~celltype, data=new_metadata)\n

"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/","title":"Custom functions for consistent plots","text":"

Approximate time: 20 minutes

"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/#learning-objectives","title":"Learning Objectives","text":"
  • Apply the custom function to generate consistent plots.
"},{"location":"day_3_exercise/D3.1e_Custom_Functions_ggplot2/#consistent-formatting-using-custom-functions","title":"Consistent formatting using custom functions","text":"

When publishing, it is helpful to ensure all plots have similar formatting. To do this we can create a custom function with our preferences for the theme. Remember the structure of a function is:

name_of_function <- function(arguments) {\n    statements or code that does something\n}\n

Now, let's suppose we always wanted our theme to include the following:

theme_bw() +\ntheme(axis.title=element_text(size=rel(1.5))) +\ntheme(plot.title=element_text(size=rel(1.5), hjust=0.5))\n

Note

You can also combine multiple arguments within the same theme() function:

theme_bw() +\ntheme(axis.title=element_text(size=rel(1.5)), plot.title=element_text(size=rel(1.5), hjust=0.5))\n

If there is nothing that we want to change when we run this, then we do not need to specify any arguments. Creating the function is simple; we can just put the code inside the {}:

personal_theme <- function(){\n  theme_bw() +\n  theme(axis.title=element_text(size=rel(1.5))) +\n  theme(plot.title=element_text(size=rel(1.5), hjust=0.5))\n}\n

Now to run our personal theme with any plot, we can use this function in place of the lines of theme() code:

ggplot(new_metadata) +\n  geom_point(aes(x=age_in_days, y=samplemeans, color=genotype, shape=celltype), size=rel(3.0)) +\n  xlab(\"Age (days)\") +\n  ylab(\"Mean expression\") +\n  ggtitle(\"Expression with Age\") +\n  personal_theme()\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/","title":"Plotting and data visualization in R","text":"

Approximate time: 60 minutes

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#learning-objectives","title":"Learning Objectives","text":"
  • Generate the box plot using ggplot2
"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#generating-a-boxplot-with-ggplot2","title":"Generating a Boxplot with ggplot2","text":"

A boxplot provides a graphical view of the distribution of data based on a five number summary:

  • The top and bottom of the box represent the (1) first and (2) third quartiles (25th and 75th percentiles, respectively).

  • The line inside the box represents the (3) median (50th percentile).

  • The whiskers extending above and below the box represent the (4) maximum, and (5) minimum of a data set.

  • The whiskers of the plot reach the minimum and maximum values that are not outliers.

Note

In this case, outliers are determined using the interquartile range (IQR), which is defined as: Q3 - Q1. Any values that exceeds 1.5 x IQR below Q1 or above Q3 are considered outliers and are represented as points above or below the whiskers.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#1-boxplot","title":"1. Boxplot!","text":"

Generate a boxplot using the data in the new_metadata dataframe. Create a ggplot2 code chunk with the following instructions:

  1. Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.
  2. Use the fill aesthetic to look at differences in sample means between the celltypes within each genotype.
  3. Add a title to your plot.
  4. Add labels, 'Genotype' for the x-axis and 'Mean expression' for the y-axis.
  5. Make the following theme() changes:
    • Use the theme_bw() function to make the background white.
    • Change the size of your axes labels to 1.25x larger than the default.
    • Change the size of your plot title to 1.5x larger than default.
    • Center the plot title.

After running the above code the boxplot should look something like that provided below.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#2-changing-the-order-of-genotype-on-the-boxplot","title":"2. Changing the order of genotype on the Boxplot","text":"

Let's say you wanted to have the \"Wt\" boxplots displayed first on the left side, and \"KO\" on the right. How might you go about doing this?

To do this, your first question should be:

How does ggplot2 determine what to place where on the X-axis?

  • The order of the genotype on the X axis is in alphabetical order.

  • To change it, you need to make sure that the genotype column is a factor.

  • And, the factor levels for that column are in the order you want on the X-axis

  • Factor the new_metadata$genotype column without creating any extra variables/objects and change the levels to c(\"Wt\", \"KO\")

  • Re-run the boxplot code chunk you created for the \"Boxplot!\" exercise above.

"},{"location":"day_3_exercise/D3.2e_boxplot_exercise/#3-changing-default-colors","title":"3. Changing default colors","text":"

You can color the boxplot differently by using some specific layers:

  1. Add a new layer scale_color_manual(values=c(\"purple\",\"orange\")).
    • Do you observe a change?
  2. Replace scale_color_manual(values=c(\"purple\",\"orange\")) with scale_fill_manual(values=c(\"purple\",\"orange\")).
    • Do you observe a change?
    • In the scatterplot we drew in class, add a new layer scale_color_manual(values=c(\"purple\",\"orange\")), do you observe a difference?
    • What do you think is the difference between scale_color_manual() and scale_fill_manual()?
  3. Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.
    • Are there any colors that you tried that did not work?

We have a separate lesson about using color palettes from the package RColorBrewer, if you are interested.

You are not restricted to using colors by writing them out as character vectors. You have the choice of a lot of colors in R, and you can do so by using their hexadecimal code. For example, \"#FF0000\" would be red and \"#00FF00\" would be green similarly, #FFFFFF would be white and #000000 would be black. click here for more information about color palettes in R.

OPTIONAL Exercise:

  • Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/","title":"Saving data and plots to file","text":"

Approximate time: 30 minutes

"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#learning-objectives","title":"Learning Objectives","text":"
  • Describe how to export data tables and plots for use outside of the R environment.
"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#writing-data-to-file","title":"Writing data to file","text":"

Everything we have done so far has only modified the data in R; the files have remained unchanged. Whenever we want to save our datasets to file, we need to use a write function in R.

To write our matrix to file in comma separated format (.csv), we can use the write.csv function. There are two required arguments: the variable name of the data structure you are exporting, and the path and filename that you are exporting to. By default the delimiter or column separator is set, and columns will be separated by a comma:

# Save a data frame to file\nwrite.csv(sub_meta, file=\"data/subset_meta.csv\")\n

Oftentimes the output is not exactly what you might want. You can modify the output using the arguments for the function. We can explore the arguments using the ?. This can help elucidate what each of the arguments can adjust the output.

?write.csv\n

Similar to reading in data, there are a wide variety of functions available allowing you to export data in specific formats. Another commonly used function is write.table, which allows you to specify the delimiter or separator you wish to use. This function is commonly used to create tab-delimited files.

Note

Sometimes when writing a data frame using row names to file with write.table(), the column names will align starting with the row names column. To avoid this, you can include the argument col.names = NA when writing to file to ensure all of the column names line up with the correct column values.

Writing a vector of values to file requires a different function than the functions available for writing dataframes. You can use write() to save a vector of values to file. For example:

# Save a vector to file\nwrite(glengths, file=\"data/genome_lengths.txt\")\n

If we wanted the vector to be output to a single column instead of five, we could explore the arguments:

?write\n

Note, the ncolumns argument that it defaults to five columns unless specified, so to get a single column:

# Save a vector to file as a single column\nwrite(glengths, file=\"data/genome_lengths.txt\", ncolumns = 1)\n
"},{"location":"day_3_exercise/D3.3e_exporting_data_and_plots/#exporting-figures-to-file","title":"Exporting figures to file","text":"

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen).

  1. The first (and easiest) is to export directly from the RStudio 'Plots' panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to. It will also give you options to dictate the size and resolution of the output image.

  2. The second option is to use R functions and have the write to file hard-coded in to your script. This would allow you to run the script from start to finish and automate the process (not requiring human point-and-click actions to save). In R\u2019s terminology, output is directed to a particular output device and that dictates the output format that will be produced. A device must be created or \u201copened\u201d in order to receive graphical output and, for devices that create a file on disk, the device must also be closed in order to complete the output.

If we wanted to print our scatterplot to a pdf file format, we would need to initialize a plot using a function which specifies the graphical format you intend on creating i.e.pdf(), png(), tiff() etc. Within the function you will need to specify a name for your image, and the with and height (optional). This will open up the device that you wish to write to:

## Open device for writing\npdf(\"figures/scatterplot.pdf\")\n

If you wish to modify the size and resolution of the image you will need to add in the appropriate parameters as arguments to the function when you initialize. Then we plot the image to the device, using the ggplot scatterplot that we just created.

## Make a plot which will be written to the open device, in this case the temp file created by pdf()/png()\nggplot(new_metadata) +\n  geom_point(aes(x = age_in_days, y= samplemeans, color = genotype,\n            shape=celltype), size=rel(3.0)) \n

Finally, close the \"device\", or file, using the dev.off() function. There are also bmp, tiff, and jpeg functions, though the jpeg function has proven less stable than the others.

## Closing the device is essential to save the temporary file created by pdf()/png()\ndev.off()\n

Note 1

  1. You will not be able to open and look at your file using standard methods (Adobe Acrobat or Preview etc.) until you execute the dev.off() function.*
  2. In the case of pdf(), if you had made additional plots before closing the device, they will all be stored in the same file with each plot usually getting its own page, unless otherwise specified.*

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.4e_finding_help/","title":"Troubleshooting and finding help","text":"

Approximate time: 30 min

"},{"location":"day_3_exercise/D3.4e_finding_help/#learning-objectives","title":"Learning Objectives","text":"
  • Identify different R-specific external sources to help with troubleshooting errors and obtaining more information about functions and packages.
"},{"location":"day_3_exercise/D3.4e_finding_help/#asking-for-help","title":"Asking for help","text":"

The key to getting help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.

  1. Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem.

  2. Always include the output of sessionInfo() as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.

    sessionInfo()  #This time it is not interchangeable with search()\n
  3. If possible, reproduce the problem using a very small data.frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.

    • To share an object with someone else, you can provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your questions is not related to a data.frame, you can save any other R data structure that you have in your environment to a file:

      DO NOT RUN THIS

      # DO NOT RUN THIS!\n\nsave(iris, file=\"/tmp/iris.RData\")\n

      The content of this .RData file is not human readable and cannot be posted directly on stackoverflow. It can, however, be emailed to someone who can read it with this command:

      DO NOT RUN THIS

      # DO NOT RUN THIS!\n\nload(file=\"~/Downloads/iris.RData\")\n
"},{"location":"day_3_exercise/D3.4e_finding_help/#where-to-ask-for-help","title":"Where to ask for help?","text":"
  • Google is often your best friend for finding answers to specific questions regarding R.
    • Cryptic error messages are very common in R - it is very likely that someone else has encountered this problem already! Start by googling the error message. However, this doesn't always work because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. \"subscript out of bounds\").
  • Stackoverflow: Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r. If your question hasn't been answered before and is well crafted, chances are you will get an answer in less than 5 min.
  • Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
  • The R-help: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don't expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
  • The Bioconductor support site. This is very useful and if you tag your post, there is a high likelihood of getting an answer from the developer.
  • If your question is about a specific package, see if there is a mailing list for it. Usually it's included in the DESCRIPTION file of the package that can be accessed using packageDescription(\"name-of-package\"). You may also want to try to email the author of the package directly.
  • There are also some topic-specific mailing lists (GIS, phylogenetics, etc...), the complete list is here.
"},{"location":"day_3_exercise/D3.4e_finding_help/#more-resources","title":"More resources","text":"
  • The Posting Guide for the R mailing lists.
  • How to ask for R help useful guidelines
  • The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
  • The R FAQ is dense and technical but it is full of useful information.

Exercise

  1. Run the following code chunks and fix all of the errors. (Note: The code chunks are independent from one another.)

    # Create vector of work days\nwork_days <- c(Monday, Tuesday, Wednesday, Thursday, Friday)\n
    # Create a function to round the output of the sum function\nround_the_sum <- function(x){\n        return(round(sum(x))\n}\n
    # Create a function to add together three numbers\nadd_numbers <- function(x,y,z){\n        sum(x,y,z)\n}\n\nadd_numbers(5,9)\n
  2. You try to install a package and you get the following error message:

Error message

Error: package or namespace load failed for 'Seurat' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called 'multtest'\n

What would you do to remedy the error?

  1. You would like to ask for help on an online forum. To do this you want the users of the forum to reproduce your problem, so you want to provide them as much relevant information and data as possible.

    • You want to provide them with the list of packages that you currently have loaded, the version of R, your OS and package versions. Use the appropriate function(s) to obtain this information.
    • You want to also provide a small data frame that reproduces the error (if working with a large data frame, you'll need to subset it down to something small). For this exercise use the data frame df, and save it as an RData object called df.RData.
    • What code should the people looking at your help request should use to read in df.RData?

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_3_exercise/D3.5e_tidyverse/","title":"Tidyverse data wrangling","text":"

Approximate time: 75 minutes

"},{"location":"day_3_exercise/D3.5e_tidyverse/#learning-objectives","title":"Learning Objectives","text":"
  • Perform basic data wrangling with functions in the Tidyverse package.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#data-wrangling-with-tidyverse","title":"Data Wrangling with Tidyverse","text":"

The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the 'dplyr' package and data visualization with the 'ggplot2' package.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#tidyverse-basics","title":"Tidyverse basics","text":"

The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. Two important new concepts we will focus on are pipes and tibbles.

Before we get started with pipes or tibbles, let's load the library:

library(tidyverse)\n
"},{"location":"day_3_exercise/D3.5e_tidyverse/#pipes","title":"Pipes","text":"

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.

To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the magrittr package and is now part of the dplyr package that is installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.

Note

Shortcut to write the pipe is shift + command + M

An example of using the pipe to run multiple commands:

## A single command\nsqrt(83)\n\n## Base R method of running more than one command\nround(sqrt(83), digits = 2)\n\n## Running more than one command with piping\nsqrt(83) %>% round(digits = 2)\n

The pipe represents a much easier way of writing and deciphering R code, and so we will be taking advantage of it, when possible, as we work through the remaining lesson.

Exercise

  1. Create a vector of random numbers using the code below:

    random_numbers <- c(81, 90, 65, 43, 71, 29)\n
  2. Use the pipe (%>%) to perform two steps in a single line:

    • Take the mean of random_numbers using the mean() function.
    • Round the output to three digits using the round() function.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#tibbles","title":"Tibbles","text":"

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R.

Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames().

Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).

Note

The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#experimental-data","title":"Experimental data","text":"

We're going to explore the Tidyverse suite of tools to wrangle our data to prepare it for visualization. You should have downloaded the file called gprofiler_results_Mov10oe.tsv into your R project's data folder earlier.

Note

If you do not have the gprofiler_results_Mov10oe.tsv file in your data folder, you can right click and download it into the data folder using this link.

The dataset:

  • Represents the functional analysis results, including the biological processes, functions, pathways, or conditions that are over-represented in a given list of genes.
  • Our gene list was generated by differential gene expression analysis and the genes represent differences between control mice and mice over-expressing a gene involved in RNA splicing.

The functional analysis that we will focus on involves gene ontology (GO) terms, which:

  • describe the roles of genes and gene products
  • organized into three controlled vocabularies/ontologies (domains):
    • biological processes (BP)
    • cellular components (CC)
    • molecular functions (MF)

"},{"location":"day_3_exercise/D3.5e_tidyverse/#analysis-goal-and-workflow","title":"Analysis goal and workflow","text":"

Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values by creating the following plot:

To wrangle our data in preparation for the plotting, we are going to use the Tidyverse suite of tools to wrangle and visualize our data through several steps:

  1. Read in the functional analysis results
  2. Extract only the GO biological processes (BP) of interest
  3. Select only the columns needed for visualization
  4. Order by significance (p-adjusted values)
  5. Rename columns to be more intuitive
  6. Create additional metrics for plotting (e.g. gene ratios)
  7. Plot results
"},{"location":"day_3_exercise/D3.5e_tidyverse/#tidyverse-tools","title":"Tidyverse tools","text":"

While all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate more deeply the reading (readr), wrangling (dplyr), and plotting (ggplot2) tools.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#1-read-in-the-functional-analysis-results","title":"1. Read in the functional analysis results","text":"

While the base R packages have perfectly fine methods for reading in data, the readr and readxl Tidyverse packages offer additional methods for reading in data. Let's read in our tab-delimited functional analysis results using read_delim():

# Read in the functional analysis results\nfunctional_GO_results <- read_delim(file = \"data/gprofiler_results_Mov10oe.tsv\", delim = \"\\t\" )\n\n# Take a look at the results\nfunctional_GO_results\n
Click here to see how to do this in base R

Read in the functional analysis results

functional_GO_results <- read.delim(file = \"data/gprofiler_results_Mov10oe.tsv\", sep = \"\\t\" )\n
Take a look at the results
functional_GO_results\n

Notice that the results were automatically read in as a tibble and the output gives the number of rows, columns and the data type for each of the columns.

Note

A large number of tidyverse functions will work with both tibbles and dataframes, and the data structure of the output will be identical to the input. However, there are some functions that will return a tibble (without row names), whether or not a tibble or dataframe is provided.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#2-extract-only-the-go-biological-processes-bp-of-interest","title":"2. Extract only the GO biological processes (BP) of interest","text":"

Now that we have our data, we will need to wrangle it into a format ready for plotting. For all of our data wrangling steps we will be using tools from the dplyr package, which is a swiss-army knife for data wrangling of data frames.

To extract the biological processes of interest, we only want those rows where the domain is equal to BP, which we can do using the filter() function.

To filter rows of a data frame/tibble based on values in different columns, we give a logical expression as input to the filter() function to return those rows for which the expression is TRUE.

Now let's return only those rows that have a domain of BP:

# Return only GO biological processes\nbp_oe <- functional_GO_results %>%\n  filter(domain == \"BP\")\n\nView(bp_oe)\n
Click here to see how to do this in base R

Return only GO biological processes

idx <- functional_GO_results$domain == \"BP\"\nbp_oe2 <- functional_GO_results[idx,]\n\nView(bp_oe)\n

Now we have returned only those rows with a domain of BP. How have the dimensions of our results changed?

Exercise

We would like to perform an additional round of filtering to only keep the most specific GO terms.

  1. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
  2. Save output to overwrite our bp_oe variable.
"},{"location":"day_3_exercise/D3.5e_tidyverse/#3-select-only-the-columns-needed-for-visualization","title":"3. Select only the columns needed for visualization","text":"

For visualization purposes, we are only interested in the columns related to the GO terms, the significance of the terms, and information about the number of genes associated with the terms.

To extract columns from a data frame/tibble we can use the select() function. In contrast to base R, we do not need to put the column names in quotes for selection.

# Selecting columns to keep\nbp_oe <- bp_oe %>%\n  select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)\n\nView(bp_oe)\n
Click here to see how to do this in base R

Selecting columns to keep

bp_oe <- bp_oe[, c(\"term.id\", \"term.name\", \"p.value\", \"query.size\", \"term.size\", \"overlap.size\", \"intersection\")]\n\nView(bp_oe)\n

The select() function also allows for negative selection. So we could have alternately removed columns with negative selection. Note that we need to put the column names inside of the combine (c()) function with a - preceding it for this functionality.

DO NOT RUN

# DO NOT RUN\n# Selecting columns to remove\nbp_oe <- bp_oe %>%\n    select(-c(query.number, significant, recall, precision, subgraph.number, relative.depth, domain))\n
Click here to see how to do this in base R

DO NOT RUN

#Selecting columns to remove\nidx <- !(colnames(bp_oe) %in% c(\"query.number\", \"significant\", \"recall\", \"precision\", \"subgraph.number\", \"relative.depth\", \"domain\"))\nbp_oe <- bp_oe[, idx]</code></pre><br>\n

"},{"location":"day_3_exercise/D3.5e_tidyverse/#4-order-go-processes-by-significance-adjusted-p-values","title":"4. Order GO processes by significance (adjusted p-values)","text":"

Now that we have only the rows and columns of interest, let's arrange these by significance, which is denoted by the adjusted p-value.

Let's sort the rows by adjusted p-value with the arrange() function.

# Order by adjusted p-value ascending\nbp_oe <- bp_oe %>%\n  arrange(p.value)\n
Click here to see how to do this in base R

Order by adjusted p-value ascending

idx <- order(bp_oe$p.value)\nbp_oe <- bp_oe[idx,]\n

Note

If you wanted to arrange in descending order, then you could have run the following instead:

DO NOT RUN

# DO NOT RUN\n# Order by adjusted p-value descending\nbp_oe <- bp_oe %>%\narrange(desc(p.value))\n
Click here to see how to do this in base R

DO NOT RUN

# Do not run\n# Order by adjusted p-value descending\nidx <- order(bp_oe$p.value, decreasing = TRUE)\nbp_oe <- bp_oe[idx,]\n

Note

Ordering variables in ggplot2 is a bit different. This post introduces a few ways of ordering variables in a plot.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#5-rename-columns-to-be-more-intuitive","title":"5. Rename columns to be more intuitive","text":"

While not necessary for our visualization, renaming columns more intuitively can help with our understanding of the data using the rename() function. The syntax is new_name = old_name.

Let's rename the term.id and term.name columns.

# Provide better names for columns\nbp_oe <- bp_oe %>% \n  dplyr::rename(GO_id = term.id, \n                GO_term = term.name)\n
Click here to see how to do this in base R
# Provide better names for columns\ncolnames(bp_oe)[colnames(bp_oe) == \"term.id\"] <- \"GO_id\"\ncolnames(bp_oe)[colnames(bp_oe) == \"term.name\"] <- \"GO_term\"\n

Note

In the case of two packages with identical function names, you can use :: with the package name before and the function name after (e.g stats::filter()) to ensure that the correct function is implemented. The :: can also be used to bring in a function from a library without loading it first.

In the example above, we wanted to use the rename() function specifically from the dplyr package, and not any of the other packages (or base R) which may have the rename() function.

Exercise

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#6-create-additional-metrics-for-plotting-eg-gene-ratios","title":"6. Create additional metrics for plotting (e.g. gene ratios)","text":"

Finally, before we plot our data, we need to create a couple of additional metrics. The mutate() function enables you to create a new column from an existing column.

Let's generate gene ratios to reflect the number of DE genes associated with each GO process relative to the total number of DE genes.

# Create gene ratio column based on other columns in dataset\nbp_oe <- bp_oe %>%\n  mutate(gene_ratio = overlap.size / query.size)\n
Click here to see how to do this in base R
# Create gene ratio column based on other columns in dataset\nbp_oe <- cbind(bp_oe, gene_ratio = bp_oe$overlap.size / bp_oe$query.size)\n

Exercise

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

Our final data for plotting should look like the table below:

"},{"location":"day_3_exercise/D3.5e_tidyverse/#next-steps","title":"Next steps","text":"

Now that we have our results ready for plotting, we can use the ggplot2 package to plot our results. If you are interested, you can follow this lesson and dive into how to use ggplot2 to create the plots with this dataset.

"},{"location":"day_3_exercise/D3.5e_tidyverse/#additional-resources","title":"Additional resources","text":"
  • R for Data Science
  • teach the tidyverse
  • tidy style guide

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4/D4.1_in_class_exercises/","title":"Day 4 Activities","text":"
  1. Change the animals data frame to a tibble called animals_tb. Save the row names to a column called animal_names before turning it into a tibble.

  2. Use ggplot2 to plot the animal names (x-axis) versus the speed of the animal (y-axis) in animals_tb using a scatterplot. Customize the plot to display as shown below.

  3. We decide that our plot would look better with the animal names ordered from slowest to fastest. Using the animals_tb tibble, reorder the animals on the x-axis to start with the slowest animal on the left-hand side of the plot to the fastest animal on the right-hand side of the plot by completing the following steps:

    a. Use the arrange() function to order the rows by speed from slowest to fastest. Then use the pull() function to extract the animal_names column as a vector of character values. Save the new variable as names_ordered_by_speed.

    b. Turn the animal_names column of animals_tb into a factor and specify the levels as names_ordered_by_speed from slowest to fastest (output in part a). Note: this step is crucial, because ggplot2 uses factor as plotting order, instead of the order we observe in data frame.

    c. Re-plot the scatterplot with the animal names in order from slowest to fastest.

    Note

    If you are interested in exploring other ways to reorder a variable in ggplot2, refer to this post.

  4. Save the plot as a PDF called animals_by_speed_scatterplot.pdf to the results folder.

  5. Use the functions from the dplyr package to perform the following tasks:

    a. Extract the rows of animals_tb tibble with color of gray or tan, order the rows based from slowest to fastest speed, and save to a variable called animals_gray_tan. b. Save animals_gray_tan as a comma-separated value file called animals_tb_ordered.csv to the results folder.

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/","title":"Introduction to R practice","text":""},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#creating-vectorsfactors-and-dataframes","title":"Creating vectors/factors and dataframes","text":"
  1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.

    • Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the rep() function).
    • Put them together into a dataframe called meta.
    • Use the rownames() function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the paste() function).

    Your finished metadata table should have information for the variables sex, stage, treatment, and myc levels:

    sex stage treatment myc sample1 M I A 2343 sample2 F II A 457 sample3 M II A 4593 sample4 F I A 9035 sample5 M II B 3450 sample6 F II B 3524 sample7 M I B 958 sample8 F II B 1053 sample9 M II P 8674 sample10 F I P 3424 sample11 M II P 463 sample12 F II P 5105
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#subsetting-vectorsfactors-and-dataframes","title":"Subsetting vectors/factors and dataframes","text":"
  1. Using the meta data frame from question #1, write out the R code you would use to perform the following operations (questions DO NOT build upon each other):

    • return only the treatment and sex columns using []:
    • return the treatment values for samples 5, 7, 9, and 10 using []:
    • use filter() to return all data for those samples receiving treatment P:
    • use filter()/select()to return only the stage and treatment columns for those samples with myc > 5000:
    • remove the treatment column from the dataset using []:
    • remove samples 7, 8 and 9 from the dataset using []:
    • keep only samples 1-6 using []:
    • add a column called pre_treatment to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use cbind()):
    • change the names of the columns to: \"A\", \"B\", \"C\", \"D\":
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#extracting-components-from-lists","title":"Extracting components from lists","text":"
  1. Create a new list, list_hw with three components, the glengths vector, the dataframe df, and number value. Use this list to answer the questions below . list_hw has the following structure (NOTE: the components of this list are not currently named):
    [[1]]\n[1]   4.6  3000.0 50000.0 \n\n[[2]]\n     species  glengths \n1    ecoli    4.6\n2    human    3000.0\n3    corn     50000.0\n\n[[3]]\n[1] 8\n
    Write out the R code you would use to perform the following operations (questions DO NOT build upon each other):
  2. return the second component of the list:
  3. return 50000.0 from the first component of the list:
  4. return the value human from the second component:
  5. give the components of the list the following names: \"genome_lengths\", \"genomes\", \"record\":
"},{"location":"day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/#creating-figures-with-ggplot2","title":"Creating figures with ggplot2","text":"
  1. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The metadata table describes an experiment that you have setup for RNA-seq analysis, while the associated count matrix gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.

    Follow the instructions below to build your plot. Write the code you used and provide the final image.

    • Read in the metadata file using: meta <- read.delim(\"Mov10_full_meta.txt\", sep=\"\\t\", row.names=1)

    • Read in the count matrix file using: data <- read.delim(\"normalized_counts.txt\", sep=\"\\t\", row.names=1)

    • Create a vector called expression that contains the normalized count values from the row in normalized_counts that corresponds to the MOV10 gene.

    • Check the class of this expression vector. Then, convert it to a numeric vector using as.numeric(expression)

    • Bind that vector to your metadata data frame (meta) and call the new data frame df.

    • Create a ggplot by constructing the plot line by line:

      • Initialize a ggplot with your df as input.

      • Add the geom_jitter() geometric object with the required aesthetics which are x and y.

      • Color the points based on sampletype

      • Add the theme_bw() layer

      • Add the title \"Expression of MOV10\" to the plot

      • Change the x-axis label to be blank

      • Change the y-axis label to \"Normalized counts\"

      • Using theme() change the following properties of the plot:

        • Remove the legend (Hint: use ?theme help and scroll down to legend.position)

        • Change the plot title size to 1.5x the default and center align

        • Change the axis title to 1.5x the default size

        • Change the size of the axis text only on the y-axis to 1.25x the default size

        • Rotate the x-axis text to 45 degrees using axis.text.x=element_text(angle=45, hjust=1)

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/","title":"Day1 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#day-1-homework-exercises","title":"Day 1 Homework Exercises","text":""},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#r-syntax-and-data-structures","title":"R syntax and data structures","text":"
# 1. Try changing the value of the variable `x` to 5. What happens to `number`?\n\nx <- 5\n\n# 2. Now try changing the value of variable `y` to contain the value 10. What do you need to do, to update the variable `number`?\n\ny <- 10\n\nnumber <- x + y\n\n#3. Try to create a vector of numeric and character values by combining the two vectors that we just created (`glengths` and `species`). Assign this combined vector to a new variable called `combined`. \n\n## Hint: you will need to use the combine `c()` function to do this. Print the `combined` vector in the console, what looks different compared to the original vectors?\n\ncombined <- c(glengths, species)\n\n#4. Let's say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.\n\n## a. Create a vector named `samplegroup` with nine elements: 3 control (\"CTL\") values, 3 knock-out (\"KO\") values, and 3 over-expressing (\"OE\") values.\n\nsamplegroup <- c(\"CTL\", \"CTL\", \"CTL\", \"KO\", \"KO\", \"KO\", \"OE\", \"OE\", \"OE\")\n\n## b. Turn `samplegroup` into a factor data structure.\n\nsamplegroup <- factor(samplegroup)\n\n# 5. Create a data frame called `favorite_books` with the following vectors as columns:\n\ntitles <- c(\"Catch-22\", \"Pride and Prejudice\", \"Nineteen Eighty Four\")\npages <- c(453, 432, 328)\nfavorite_books <- data.frame(titles, pages)\n\n# 6. Create a list called `list2` containing `species`, `glengths`, and `number`.\nlist2 <- list(species, glengths, number)\n
"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#functions-and-arguments","title":"Functions and arguments","text":"
# 1. Let's use base R function to calculate **mean** value of the `glengths` vector. You might need to search online to find what function can perform this task.\nmean(glengths)\n\n# 2. Create a new vector `test <- c(1, NA, 2, 3, NA, 4)`. Use the same base R function from exercise 1 (with addition of proper argument), and calculate mean value of the `test` vector. The output should be `2.5`.\n#   *NOTE:* In R, missing values are represented by the symbol `NA` (not available). It\u2019s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore `NA` during statistical calculations, or to remove `NA` from the vector. More information related to missing data can be found at this link -> https://www.statmethods.net/input/missingdata.html.\ntest <- c(1, NA, 2, 3, NA, 4)\nmean(test, na.rm=TRUE)\n\n# 3. Another commonly used base function is `sort()`. Use this function to sort the `glengths` vector in **descending** order.\nsort(glengths, decreasing = TRUE)\n\n# 4. Write a function called `multiply_it`, which takes two inputs: a numeric value `x`, and a numeric value `y`. The function will return the product of these two numeric values, which is `x * y`. For example, `multiply_it(x=4, y=6)` will return output `24`.\nmultiply_it <- function(x,y) {\n  product <- x * y\n  return(product)\n}\n
"},{"location":"day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/#reading-in-and-inspecting-data","title":"Reading in and inspecting data","text":"
# 1. Download this tab-delimited .txt file and save it in your project\u2019s data folder.\n#       i. Read it in to R using read.table() and store it as the variable proj_summary, keeping in mind that: \n#               a. all the columns have column names \n#               b. you want the first column to be used as rownames (hint: look up the row.names = argument)\n#       ii. Display the contents of proj_summary in your console\nproj_summary <- read.table(file = \"data/project-summary.txt\", header = TRUE, row.names = 1)\n\n# 2. Use the class() function on glengths and metadata, how does the output differ between the two?\nclass(glengths)\nclass(metadata)\n\n# 3. Use the summary() function on the proj_summary dataframe\n#       i. What is the median rRNA_rate?\n#       ii. How many samples got the \u201clow\u201d level of treatment\nsummary(proj_summary)\n\n# 4. How long is the samplegroup factor?\nlength(samplegroup)\n\n# 5. What are the dimensions of the proj_summary dataframe?\ndim(proj_summary)\n\n# 6. When you use the rownames() function on metadata, what is the data structure of the output?\nstr(rownames(metadata))\n\n# 7. How many elements in (how long is) the output of colnames(proj_summary)? Don\u2019t count, but use another function to determine this.\nlength(colnames(proj_summary))\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/","title":"Day2 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#day-2-homework-exercises","title":"Day 2 Homework Exercises","text":""},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#data-wrangling","title":"Data wrangling","text":"
# 1. Extract only those elements in `samplegroup` that are not KO (*nesting the logical operation is optional*).\nidx <- samplegroup != \"KO\"\nsamplegroup[idx]\n\n# 2. Use the `samplegroup` factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.\nfactor(samplegroup, levels = c(\"KO\", \"CTL\", \"OE\"))\n\n### Packages and Libraries\n\n# 1. Install the tidyverse package (it is actually a suite of packages). NOTE: This suite of packages is only available in CRAN.\ninstall.packages(\"tidyverse\")\n\n# 2. Load the tidyverse library. Do you see anything unusual when it loads?\nlibrary(tidyverse)\n#Some functions from dyplyr (part of tidyverse package) mask the same functions from the basic stats. But that is fine! If you need to use filter function from stats, you can type 'stats::filter()'\n\n# 3. Run sessionInfo().\nsessionInfo()\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#data-wrangling-data-frames-matrices-and-lists","title":"Data wrangling: data frames, matrices, and lists","text":"
# 1. Return the genotype and replicate column values for Sample2 and Sample8.\nmetadata[c(\"sample2\", \"sample8\"), c(\"genotype\", \"replicate\")] # or\nmetadata[c(2,8), c(1,3)]\n\n# 2. Return the fourth and ninth values of the replicate column.\nmetadata$replicate[c(4,9)] # or\nmetadata[c(4, 9), \"replicate\"]\n\n# 3. Extract the replicate column as a data frame.\nmetadata[, \"replicate\", drop = FALSE]\n\n# 4. Subset the metadata dataframe to return only the rows of data with a genotype of KO.\nidx <- which(metadata$genotype==\"KO\")\nmetadata[idx, ]\n\n# 5. Create a list named random with the following components: metadata, age, list1, samplegroup, and number.\nrandom <- list(metadata, age, list1, samplegroup, number)\n\n# 6. Extract the samplegroup component.\nrandom[[4]]\n\n# 7. Set names for the random list you created in the last exercise.\nnames(random) <- c(\"metadata\", \"age\", \"list1\", \"samplegroup\", \"number\")\n\n# 8. Extract the age component using the $ notation\nrandom$age\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#the-in-operator","title":"The %in% operator","text":"
# 1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A\nB %in% A\n\n# 2. Subset the B vector to only return those values that are also in A.\nB[B %in% A]\n\n# 3. We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.\n\n#       i. First, let\u2019s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:\n\n        important_genes <- c(\"ENSMUSG00000083700\", \"ENSMUSG00000080990\", \"ENSMUSG00000065619\", \"ENSMUSG00000047945\", \"ENSMUSG00000081010\", \"ENSMUSG00000030970\")\n\n#       ii. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.\nimportant_genes %in% rownames(rpkm_data)\n\n#       iii. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.\nidx <- rownames(rpkm_data) %in% important_genes\nans <- rpkm_data[idx, ]\nidx2 <- which(rownames(rpkm_data) %in% important_genes)\nans2 <- rpkm_data[idx2, ]\n\n#       iv. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.\nans3 <- rpkm_data[important_genes, ]\n
"},{"location":"day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/#reordering-and-matching","title":"Reordering and matching","text":"
# 1. Now that we know how to reorder using indices, let\u2019s try to use it to reorder the contents of one vector to match the contents of another. Let\u2019s create the vectors first and second as detailed below:\nfirst <- c(\"A\",\"B\",\"C\",\"D\",\"E\")\nsecond <- c(\"B\",\"D\",\"E\",\"A\",\"C\")  # same letters but different order\n\n#        How would you reorder the second vector to match first?\nsecond[c(4, 1, 5, 2, 3)]\n\n# 2. After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.\nsubset_rpkm <- rpkm_ordered[ , c(1,3:8,10:12)]  #or\nsubset_rpkm <- rpkm_ordered[ , -c(2,9)]\n\n# 3. Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.  \nidx <- match(colnames(subset_rpkm), rownames(metadata))\nmetadata[idx, ]\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/","title":"Day3 Homework Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#ggplot2-exercise","title":"ggplot2 exercise","text":""},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#creating-a-boxplot","title":"Creating a boxplot","text":"
#1. boxplot\nggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, fill = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5)))\n\n#2. Changing the order of genotype\nnew_metadata$genotype <- factor(new_metadata$genotype, levels = c(\"Wt\", \"KO\"))\n\n#3. Changing default colors\n\n#Add a new layer scale_color_manual(values=c(\"purple\",\"orange\")).\n#Do you observe a change?\n    ## No\n\n#Replace scale_color_manual(values=c(\"purple\",\"orange\")) with scale_fill_manual(values=c(\"purple\",\"orange\")).\n#Do you observe a change?\n    ## Yes\n\n#In the scatterplot we drew in class, add a new layer scale_color_manual(values=c(\"purple\",\"orange\")), do you observe a difference?\n    ## Yes\n\n#What do you think is the difference between scale_color_manual() and scale_fill_manual()?\n    ## scale_color_manual() works with scatter plot, and scale_fill_#manual() works with box plot is what it appears to be\n    ## \n    ## Actually, scale_color_manual() works if the \"color\" argument is used , whereas scale_fill_manual() works if the \"fill\" argument is used\n\n\n## Boxplot using \"color\" instead of \"fill\"\nggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, color = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5))) +\n  scale_color_manual(values=c(\"purple\",\"orange\"))\n\n\n#Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.\n#Are there any colors that you tried that did not work?\n\n  ggplot(new_metadata) +\n  geom_boxplot(aes(x = genotype, y = samplemeans, fill = celltype)) +\n  ggtitle(\"Genotype differences in average gene expression\") +\n  xlab(\"Genotype\") +\n  ylab(\"Mean expression\") +\n  theme_bw() +\n  theme(axis.title = element_text(size = rel(1.25))) +\n  theme(plot.title=element_text(hjust = 0.5, size = rel(1.5))) +\n  scale_fill_manual(values=c(\"red\", \"blue\"))\n\n#OPTIONAL Exercise:\n#Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.\nscale_fill_manual(values=c(\"#FF3333\", \"#3333FF\"))\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#finding-help","title":"Finding help","text":"

Exercises Run the following code chunks and fix all of the errors. (Note: The code chunks are independent from one another.)

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-vector-of-work-days","title":"Create vector of work days","text":"
#work_days <- c(Monday, Tuesday, Wednesday, Thursday, Friday)\nwork_days <- c(\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-a-function-to-round-the-output-of-the-sum-function","title":"Create a function to round the output of the sum function","text":"
#round_the_sum <- function(x){\n#  return(round(sum(x))\n#}\nround_the_sum <- function(x){\n  return(round(sum(x)))\n}\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#create-a-function-to-add-together-three-numbers","title":"Create a function to add together three numbers","text":"
#add_numbers <- function(x,y,z){\n#  sum(x,y,z)\n#}\n#add_numbers(5,9)\nadd_numbers(5,9,6)\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#you-try-to-install-a-package-and-you-get-the-following-error-message","title":"You try to install a package and you get the following error message:","text":"

Error

Error: package or namespace load failed for 'Seurat' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]): there is no package called 'multtest'\n

What would you do to remedy the error?

"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#install-multtest-first-and-then-install-seurat-package","title":"Install multtest first, and then install seurat package:","text":"

BiocManager::install('multtest')\ninstall.packages('Seurat')\n
You would like to ask for help on an online forum. To do this you want the users of the forum to reproduce your problem, so you want to provide them as much relevant information and data as possible.

You want to provide them with the list of packages that you currently have loaded, the version of R, your OS and package versions. Use the appropriate function(s) to obtain this information.

sessionInfo()\n

You want to also provide a small data frame that reproduces the error (if working with a large data frame, you\u2019ll need to subset it down to something small). For this exercse use the data frame df, and save it as an RData object called df.RData.

save(df, file = \"data/df.RData\")\n# What code should the people looking at your help request should use to read in df.RData?\nload(file=\"data/df.RData\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/#tidyverse","title":"Tidyverse","text":"

Create a vector of random numbers using the code below:

random_numbers <- c(81, 90, 65, 43, 71, 29)\n

Use the pipe (%>%) to perform two steps in a single line. Take the mean of random_numbers using the mean() function.

random_numbers %>% mean()\n
Round the output to three digits using the round() function.
random_numbers %>% \n  mean() %>% \n  round(digits = 3)\n
We would like to perform an additional round of filtering to only keep the most specific GO terms. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4. Save output to overwrite our bp_oe variable.
bp_oe <- bp_oe %>% \n  filter(relative.depth > 4)\n

Using Base R

# bp_oe <- subset(bp_oe, relative.depth > 4)\n

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

bp_oe <- bp_oe %>% \n  dplyr::rename(genes = intersection)\n
Using Base R
colnames(bp_oe)[colnames(bp_oe) == \"intersection\"] <- \"genes\"\n

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

bp_oe <- bp_oe %>% \n  mutate(term_percent = overlap.size / term.size)\n
Using Base R

bp_oe <- cbind(bp_oe, term_percent = bp_oe$overlap.size / bp_oe$term.size)\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/","title":"Day4 Intro to R Answer Key","text":""},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#homework-answer-key-introduction-to-r-practice","title":"Homework answer key - Introduction to R practice","text":""},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#creating-vectorsfactors-and-dataframes","title":"Creating vectors/factors and dataframes","text":"
  1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.

    • Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the rep() function).
    sex <- c(\"M\", \"F\",...) # saved vectors/factors as variables and used c() or rep() function to create\n
    • Put them together into a dataframe called meta.

    meta <- data.frame(sex, stage, treatment, myc) # used data.frame() to create the table\n
    - Use the rownames() function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the paste() function).

    rownames(meta) <- c(\"sample1\", \"sample2\",... , \"sample12\") # or use:\n\nrownames(meta) <- paste(\"sample12\", 1:12, sep=\"\")\n

    Your finished metadata table should have information for the variables sex, stage, treatment, and myc levels:

    sex stage treatment myc sample1 M I A 2343 sample2 F II A 457 sample3 M II A 4593 sample4 F I A 9035 sample5 M II B 3450 sample6 F II B 3524 sample7 M I B 958 sample8 F II B 1053 sample9 M II P 8674 sample10 F I P 3424 sample11 M II P 463 sample12 F II P 5105
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#subsetting-vectorsfactors-and-dataframes","title":"Subsetting vectors/factors and dataframes","text":"
  1. Using the meta data frame from question #1, write out the R code you would use to perform the following operations (questions DO NOT build upon each other):

    • return only the treatment and sex columns using []:
    meta[ , c(1,3)]\n
    • return the treatment values for samples 5, 7, 9, and 10 using []:
    meta[c(5,7,9,10), 3]\n
    • use filter() to return all data for those samples receiving treatment P:
    filter(meta, treatment == \"P\")\n
    • use filter()/select() to return only the stage and treatment data for those samples with myc > 5000:
    filter(meta, myc > 5000) %>% select(stage, treatment)\n
    • remove the treatment column from the dataset using []:
    meta[, -3]\n
    • remove samples 7, 8 and 9 from the dataset using []:
    meta[-7:-9, ]\n
    • keep only samples 1-6 using []:
    meta [1:6, ]\n
    • add a column called pre_treatment to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use cbind()):
    pre_treatment <- c(T, F, F, F, T, T, F, T, F, F, T, T)\n\ncbind(pre_treatment, meta)\n
    • change the names of the columns to: \"A\", \"B\", \"C\", \"D\":
    colnames(meta) <- c(\"A\", \"B\", \"C\", \"D\")\n
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#extracting-components-from-lists","title":"Extracting components from lists","text":"
  1. Create a new list, list_hw with three components, the glengths vector, the dataframe df, and number value. Use this list to answer the questions below . list_hw has the following structure (NOTE: the components of this list are not currently named):

    [[1]]\n[1]   4.6  3000.0 50000.0 \n\n[[2]]\n          species  glengths \n     1    ecoli    4.6\n     2    human    3000.0\n     3    corn     50000.0\n\n[[3]]\n[1] 8\n
    Write out the R code you would use to perform the following operations (questions DO NOT build upon each other): - return the second component of the list:

    list_hw[[2]]\n
    • return 50000.0 from the first component of the list:
    list_hw[[1]][3]\n
    • return the value human from the second component:
    list_hw[[2]][2, 1]\n
    • give the components of the list the following names: \"genome_lengths\", \"genomes\", \"record\":
    names(list_hw) <- c(\"genome_lengths\",\"genomes\",\"record\")\n\nlist_hw$record\n
"},{"location":"day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/#creating-figures-with-ggplot2","title":"Creating figures with ggplot2","text":"
  1. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The metadata table describes an experiment that you have setup for RNA-seq analysis, while the associated count matrix gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.

Follow the instructions below to build your plot. Write the code you used and provide the final image.

  • Read in the metadata file using: meta <- read.delim(\"Mov10_full_meta.txt\", sep=\"\\t\", row.names=1)

  • Read in the count matrix file using: data <- read.delim(\"normalized_counts.txt\", sep=\"\\t\", row.names=1)

  • Create a vector called expression that contains the normalized count values from the row in data that corresponds to the MOV10 gene.

expression <- data[\"MOV10\", ]\n
  • Check the class of this expression vector. data.frame

Then, will need to convert this to a numeric vector using as.numeric(expression)

class(expression)\n\nexpression <- as.numeric(expression)\n\nclass(expression)\n
  • Bind that vector to your metadata data frame (meta) and call the new data frame df.
df <- cbind(meta, expression) #or\n\ndf <- data.frame(meta, expression)\n
  • Create a ggplot by constructing the plot line by line:

    • Initialize a ggplot with your df as input.

    • Add the geom_jitter() geometric object with the required aesthetics

    • Color the points based on sampletype

    • Add the theme_bw() layer

    • Add the title \"Expression of MOV10\" to the plot

    • Change the x-axis label to be blank

    • Change the y-axis label to \"Normalized counts\"

    • Using theme() change the following properties of the plot:

      • Remove the legend (Hint: use ?theme help and scroll down to legend.position)

      • Change the plot title size to 1.5x the default and center align

      • Change the axis title to 1.5x the default size

      • Change the size of the axis text only on the y-axis to 1.25x the default size

      • Rotate the x-axis text to 45 degrees using axis.text.x=element_text(angle=45, hjust=1)

    ggplot(df) +\n     geom_jitter(aes(x= sampletype, y= expression, color = sampletype)) +\n     theme_bw() +\n     ggtitle(\"Expression of MOV10\") +\n     xlab(NULL) +\n     ylab(\"Normalized counts\") +\n     theme(legend.position = \"none\",\n          plot.title=element_text(hjust=0.5, size=rel(1.5)),\n          axis.text=element_text(size=rel(1.25)),\n          axis.title=element_text(size=rel(1.5)),\n          axis.text.x=element_text(angle=45, hjust=1))\n

Attribution notice

  • This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • The materials used in this lesson are adapted from work that is Copyright \u00a9 Data Carpentry (http://datacarpentry.org/).

  • All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).

"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index aa50bb4..1bf22ee 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,137 +2,137 @@ https://github.com/hbctraining/Intro-to-R-mkdocs/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/Workshop_Schedule/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_1/D1.2_introR-R-and-RStudio/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_1_exercise/D1.1e_r_syntax_and_data_structures/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_1_exercise/D1.2e_functions_and_arguments/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_1_exercise/D1.3e_reading_in_and_data_inspection/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2/D2.1_in_class_exercises/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2/D2.2_data_wrangling/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2_exercise/D2.1e_packages_and_libraries/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2_exercise/D2.2e_introR-data-wrangling/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2_exercise/D2.3e_identifying-matching-elements/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2_exercise/D2.4e_reordering-to-match-datasets/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_2_exercise/D2.5e_setting_up_to_plot/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3/D3.1_in_class_exercises/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3/D3.2_plotting_with_ggplot2/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3/basic_plots_in_r/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3_exercise/D3.1e_Custom_Functions_ggplot2/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3_exercise/D3.2e_boxplot_exercise/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3_exercise/D3.3e_exporting_data_and_plots/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3_exercise/D3.4e_finding_help/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_3_exercise/D3.5e_tidyverse/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4/D4.1_in_class_exercises/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4_exercise_n_answer_keys/D4.1e_intro_to_R_hw/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4_exercise_n_answer_keys/Day1_Homework_Answer-Key/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4_exercise_n_answer_keys/Day2_Homework_Answer-Key/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4_exercise_n_answer_keys/Day3_Homework_Answer-Key/ - 2024-08-09 + 2024-08-12 daily https://github.com/hbctraining/Intro-to-R-mkdocs/day_4_exercise_n_answer_keys/Day4_Intro_to_R_Answer-Key/ - 2024-08-09 + 2024-08-12 daily \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 8ae2c7489218954059474707e3fce7d57ddba60b..550ce1c3ac8d04c44f242d02401a97f7f51888eb 100644 GIT binary patch delta 604 zcmV-i0;B!#1n&d~ABzYGz=*k#2Oxh|38SjfsH5&Ot-7=v`vk0jO>I-c+xOUsa@mh6 z88(1#oR8n<0_&Gk*Dn5ou~Mr|l4r|g0g7uORlP}m{d`TIlb7mgJ($*lTi{@2H%a4M z|Amlo95X8&bgZ|THZ`%l!5d31jbxd9PO7H`9)5xIRXEnI=9OQ?;xe^#&9HyZxb8@^ z=gu&x@ag1DIipjj(_Pvfh31y5iJx}A3bU@$qAHd}nJ%Bxyucxsw!2}xVXB6Mfj^aC zvYpoCuH|eVx!U(3mj2e}XdB(rcYGsaXdxoo1)FG2ws}@i={tAW#gs^Q?`{xM$3{Zo z1cpls9w`TEsGXE!CyGo|FvowWb!OmhFkqu*D}Wvb#T`1b4RFlVLkG&mjkQ9_fcbcY z+wmrvRJMlz8yQ_ewndhE7LGRAE31u>qd1E(8hl>s$416*x6=wA9D2r&tcJ7yv~q8l zk?|?>K!su962`q$={%&updbZ?HhGwEO4>2U8-D>U#2E1&;YuIzv=M(`JQ9u#ctJj_ z!Qy871QZ#ZkLE*9ou+-OU1UTnvR%y~f(MQwC=di0aWbuIe_uq0vS(KEz+$~JCN_XO zFIo9a^lh+Ccl7mI(M2lE0~h9^y*_!4TXW1jn1oIbTWNZl3o)3Bw6L1FCk0ABzYGK$NwS2Oxi`5=K>{QAgcnT6JkT_6b-4o7$#?x9_nN<+86T z88(1#oR8n<0_)dP*Dn5mu~Mr|l4r|g0g7uORlP}m{&-8Dlh^8LJ($*lTi{@2H%a4M z|Cx|+95X8&bgZ|THZ`%l!5d31jbxd^gTLJViDDKdaZGdB@9y(AiZmbnb2F%AJ z+>STVq_RB(*vRM#vMsXQvv9P@URiC79K~6T(ctr1KQ=OkyPa0};LtOEWHp@qrA;yUJ2v_=ur;UFANM?J?II&uk?m>@5j=1dL4hF1h?8ku`}-m~ls&VO2Nvsx}$H`iY`)N9=I?U?e)oX+?r$N!88P?UYp=Pp%U_SL)e5-b72NLNEK@C z2HHMq7R)$Kipl@th;2=?m1JAaA<>jq%c=i@vm+RKgvkcHOL^*RfgNqe*`Tp0r{8o3 rqc%tSCG@1}7v%=t?Ce)GKlNf=J}8BqNNzN)(=eii@#a-}Jb diff --git a/stylesheets/extra.css b/stylesheets/extra.css index 279a52f..e3b7b3d 100644 --- a/stylesheets/extra.css +++ b/stylesheets/extra.css @@ -25,7 +25,7 @@ html { } .md-typeset h1 { color: var(--md-default-fg-color); - font-size: 2rem; + font-size: 26px; font-family: "Degular", sans-serif; line-height: 1.5rem; font-weight: 600; @@ -34,7 +34,7 @@ html { margin-top: 0rem; } .md-typeset h2 { - font-size: 1rem; + font-size: 24px; font-family: "Degular", sans-serif; line-height: 1rem; font-weight: 500; @@ -43,7 +43,7 @@ html { margin-bottom: 0rem; } .md-typeset h3 { - font-size: 1.2rem; + font-size: 22px; font-family: "Degular", sans-serif; line-height: 1.5rem; font-weight: 400; @@ -52,18 +52,18 @@ html { @media (min-width: 768px) { .md-typeset h1 { font-family: "Degular", sans-serif; - font-size: 3.2rem; + font-size: 48px; line-height: 4rem; } .md-typeset h2 { font-family: "Degular", sans-serif; - font-size: 2rem; - line-height: 2.5rem; + font-size: 30px; + line-height: 1.5rem; } .md-typeset h3 { font-family: "Degular", sans-serif; - font-size: 1.5rem; - line-height: 2rem; + font-size: 22px; + line-height: 1.5rem; } }