This package contains common R scripts I use in my day to day data analysis of biological data. The scripts are primarily for plotting and visualisation, with some data organisation thrown in as well.
dplyr
ggplot2
ggrepel
ComplexHeatmap
RColorBrewer
Rmisc
ggpubr
gtools
grid
pBrackets
biomaRt
First install devtools to allow installation from gitub and any other required packages.
install.packages("devtools")
library("devtools")
library(devtools)
library(knitr)
Now install the BioOutputs package.
install_github("KatrionaGoldmann/BioOutputs")
library("BioOutputs")
Create a correlation plot. Taken from kassambara/ggpubr just changed the default arguments.
So we can use the classic example with the mtcars data frames:
kable(head(mtcars))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
bio_corr(mtcars, "qsec", "wt")
The bio_frequency() function generates a frequency table from factor or character vector columns in a data frame. This has the following arguments:
Argument | |
---|---|
data | A data frame containing columns to be counted |
columns | Column names or indices to be counted in data |
freq.percent | Whether the table should include frequency counts, percentages or both (options = c("freq", "percent", "both")). Default="both" |
include.na | Include NA values (options are TRUE/FALSE, default=TRUE) |
remove.vars | Character vector of variables not to be included in the counts (e.g. remove.vars = c("") remove blanks from the count) |
Then if we want to see the breakdown of, say, the gear column in mtcars we can apply:
kable(bio_frequency(mtcars, "gear"))
3 | 4 | 5 | Total | |
---|---|---|---|---|
gear | 15 (47%) | 12 (38%) | 5 (16%) | n = 32 |
And if wanted we can remove one variable from the table. This is useful if we have unknowns or the likes.
kable(bio_frequency(mtcars, "gear", remove.vars=c("5")))
3 | 4 | Total | |
---|---|---|---|
gear | 15 (56%) | 12 (44%) | n = 27 |
This function generates a volcano plot from a top table using ggplot.
The function contains many parameters, use ?bio_volcano
to interogate.
Argument | |
---|---|
toptable | A data frame containing p value and fold change columns for parameters compared across multiple groups. The p value column should be named "pvalue". |
fc.col | The column name which stores the fold change. Should be in the log2 format (default="log2FC") |
padj.col | The column which contains adjusted p-values. If NULL adjusted pvalues will be calculated |
padj.method | correction method. Options include: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default="fdr |
padj.cutoff | The cutoff for adjusted pvalues. This adds a horizontal line of significance (default=NULL) |
fc.cutoff | The log2(fold change) significance cut-off (default=1) |
marker.colour | Character vector of four colours to map to the volcano plot. In the order non-significanct, fold-change significant, pvalue significant, significant in fold-change and pvalues (default=c("grey60", "olivedrab", "salmon", "darkturquoise")) |
label.p.cutoff | The cutoff for adjusted pvalues for labelling (default=NULL). Not recommended if many significant rows. |
label.row.indices | Indices of rows to be labelled (default=NULL) |
label.colour | Colour of labels (default="black") |
legend.labs | A character vector for theThe legend label names (default=c("Not Significant", "FC>fc.cutoff", "Padj<padj.cutoff", "FC>fc.cutoff& Padj<padj.cutoff")) |
add.lines | Whether to add dashed lines at fc.cutoff and padj.cutoff (default=TRUE) |
line.colour | The color of dashed significance lines (default="grey14") |
main | Plot title |
xlims, ylims | The plot limits |
Lets look at the ALS patients carrying the C9ORF72 data set
toptable <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", sep="", header=T)
rownames(toptable) = toptable$Gene
bio_volcano(toptable, fc.col="log2FoldChange", label.row.indices=1:10, main="ALS patients carrying the C9ORF72", add.lines=TRUE)
Lets look at the leukemia data set
BiocManager::install("leukemiasEset", version = "3.8")
library(leukemiasEset)
library(limma)
data(leukemiasEset)
ourData <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
ourData$LeukemiaType <- factor(ourData$LeukemiaType)
design <- model.matrix(~ ourData$LeukemiaType)
fit <- lmFit(ourData, design)
fit <- eBayes(fit)
toptable <- topTable(fit)
toptable$pvalue = toptable$P.Value
bio_volcano(toptable, fc.col="logFC", label.row.indices=1:10, main="leukemia", add.lines=FALSE)
This function colours data according to whether it is below or above a defined plane. The plane is plotted as a line and data can be output as either a line or markers/points.
Argument | |
---|---|
x | x column name in data |
y | y column name in data |
df.data | Data frame containing x and y columns |
x.plane | column name for the x axis in df.plane |
y.plane | column name for the y axis in df.plane |
df.plane | Date frame modelling the plane |
stepwise | logical whether to plot the cutoff plane as stepwise or smoothed |
colours | colour vector for higher, lower and plane values (default=c("green", "red", "grey) respectively) |
inc.equal | logical whethere points on the line should be counted as above (dafault=TRUE) |
labels | label for the markers (default=c("above", "below")) |
type | type of plot for data (options include point (dafault), line, stepwise) |
Lets look at some beaver temperature data.
data(beavers)
df.plane = beaver1
df.data = beaver2
df.plane$temp <- df.plane$temp + 0.5
bio_bires(x="time", y="temp", df.data, x.plane="time", y.plane="temp", df.plane)
This function splits expression data into customisable modules and averages over catagories in a given variable.
Argument | |
---|---|
exp | data frame containing the expression data |
mod.list | A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe. |
meta | Dataframe where each column contains an annotation/tract for samples in the heatmap. The order of samples in meta must match that of exp. |
cluster.rows | The method to use for clustering of rows |
cols | Chacter vector, or named vector to fix the order, defining the colours of each mean.var group. |
main | Title of heatmap |
show.names | Show row names. Logical. |
mean.subjects | Logical to determine whether to add a row for the mean value for all subjects in a group |
split.var | Character defining the meta column to average (mean) over. |
In this example we will look at two Li modules and a custom one I made up.
exp = rld.syn
meta <- rld.metadata.syn
bio_mods(exp=exp, mod.list=mod_list, meta=meta, mean.var = "Pathotype", cluster.rows = FALSE)
This function plots a line graph comparing two treatment groups over time.
Argument | |
---|---|
df | Data frame containing the data for both groups |
group.col | Column name which corresponding to the group of values in df |
y.col | Column name corresponding to the y-axis values in df |
x.col | Column name corresponding to the x-axis values in df |
id.col | Column name which corresponds to subject ID |
main | Title of pot |
cols | Character vector for colours of lines |
p.col | Colour of p-value text |
Lets look at kidney disease survival in this example from the survival package.
library(survival)
data(kidney)
df <- kidney
df <- df[df$time < 150, ]
bio_treatmentGroups(df, group.col="status", y.col="frail", x.col="time")
This function switches gene (or snp) ids using biomart
Argument | |
---|---|
IDs | list of the Ids you want to convert |
IDFrom | What format these IDs are in (default Ensembl) |
IDTo | What format you want the IDs converted to (default gene names) |
mart | The biomart to use. Typically, for humans you will want ensembl (default). Alternatives can be found at listEnsembl() |
dataset | you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = useEnsembl('ENSEMBL_MART_ENSEMBL'), followed by listDatasets(mart). |
attributes | list of variables you want output |
For example genes TNF and A1BG:
IDs = c("TNF", "A1BG")
#kable(bio_geneid(IDs, IDFrom='hgnc_symbol', IDTo = 'ensembl_transcript_id'))
This function exports a complex heatmap looking at the expression of different modules which can be customised.
Argument | |
---|---|
exp | Expression Data |
var | Vector classing samples by variables |
prefix | Prefix to Heatmap titles |
stars | whether pvales should be written as numeric or start (default=FALSE) |
overlay | pvalues on fold change Heatmap or besidde (default = TRUE) |
logp | Whether or not to log the pvalues |
... | Other parameters to pass to Complex Heatmap |
exp = rld.syn
meta <- rld.metadata.syn
bio_fc_heatmap(exp=exp, var=meta$Pathotype)
Creates boxplots showing the significance between groups.
Argument | |
---|---|
data | Data frame containing columns x and y |
x, y | x and y variable names for drawing. |
p.cutoff | plot p-value if above p.cutoff threshold. To include all comparisons set as NULL. |
stars | Logical. Whether significance shown as numeric or stars |
method | a character string indicating which method to be used for comparing means.c("t.test", "wilcox.test") |
star.vals | a list of arguments to pass to the function symnum for symbolic number coding of p-values. For example, the dafault is symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c('****', '', '', '', 'ns')). In other words, we use the following convention for symbols indicating statistical significance: ns: p > 0.05; : p <= 0.05; : p <= 0.01; : p <= 0.001; ****: p <= 0.0001 |
Lets look at the iris data:
bio_boxplots(iris, x="Species", y= "Sepal.Width", p.cutoff = 0.0001)
bio_boxplots(iris, x="Species", y= "Sepal.Width", NULL, stars=TRUE)
Creates a sankey plot with heatmaps showing how individuals progress over time.
Argument | |
---|---|
samp.orders | A list of named vectors for sample order at each timepoint. Vector names must correspond to matchable ids. |
library(lme4)
library(ComplexHeatmap)
data(sleepstudy)
data = sleepstudy
data = data[data$Days %in% c(0, 1, 2), ]
data = data[data$Days !=1 | data$Subject != "308", ]
time.col="Days"
exp.cols = "Reaction"
sub.col = "Subject"
row.order=list()
for(i in unique(data[, time.col])){
hm = Heatmap(data[data[, time.col] == i, exp.cols])
row.order[[paste('timepoint', as.character(i))]] = setNames(unlist(row_order(hm)), data[data[, time.col] == i, sub.col])
}
bio_sankey(samp.order=row.order)
Converts strings to appropriate format for titles according to Chicago style.
Argument | |
---|---|
titles | A character vector of phrases to be converted to titles |
exeption.words | A character vector of words with case to be forced (for example abbreviations and roman numerals) |
replace.chars | A named list of characters to replace in title. This works in order of appearance. E.g. c("\."=" ") replaces fullstops with spaces. |
shakespeare_plays =c("all's well that ends well", "as you like it","comedy of errors","love's labour's lost", "measure for measure", "merchant of venice","merry wives of windsor","midsummer night's dream","much ado about nothing","taming of the shrew", "tempest", "twelfth night","two gentlemen of verona", "winter's tale", "henry iv, part i","henry iv, part ii", "henry v", "henry vi, part i","henry vi, part ii", "henry vi, part iii", "henry viii","king john", "pericles","richard ii", "richard iii", "antony and cleopatra","coriolanus","cymbeline", "hamlet","julius caesar", "king lear", "macbeth (the scottish play)", "othello", "romeo and juliet","timon of athens", "titus andronicus", "troilus and cressida")
exception.words= c("I", "II", "III", "IV", "V", "VI") # Pass in exceptions
bio_capitalize(as.character(shakespeare_plays[grepl("henry", shakespeare_plays)]), exception.words)
Calculates the module expression from a list of genes.
Argument | |
---|---|
exp | data frame containing the expression data |
mod.list | A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe. |
bio_mods(exp=exp, mod.list=mod_list)