Code
library(tidyverse)
Ana Schaan (aschaan@ikmb.uni-kiel.de) and Malte Rühlemann (m.ruehlemann@ikmb.uni-kiel.de)
This is how you load packages into R. Packages contain a series of functions and data developed by R community. We will be working with tidyverse. Normally, you load all the packages needed at the beginning of your code.
Packages are collections of functions that are often written and collected for a specific purpose. Bioconductor is a place where many packages aimed at the analysis of biological data are archived and provided for installation. For this part of the tutorial which aims at introducing you to R using tidyverse, we will load named tidyverse
pacakge.
R is a software language for statistical analyses. There is tons of excellent quality info out there. So, we will focus on the practical aspects of it.
help.start()
and the HTML help button in the Windows GUI.help
and ?
: help("data.frame")
or ?help
.help.search()
, apropos()
. example, apropos("sum")
browseVignettes("package")
R is an object-oriented language, so every data item is an object in R. These are the following basic data types in R.
It is practical to store objects and values into named objects. You assign names to values using <-
as name_of_object <- value
or =
as name_of_object = value
.
Now check that the x, y and z appeared on the environment. Use the function ls()
to inspect the environment using the console.
Functions contain operations that will be applied to objects. Functions have arguments which normally are of two kinds: 1. the data to which the operations will be applied on. 2. the details of the operations (so-called “arguments”).
?
(e.g. ?print()
)to get information about functions arguments.TAB
keyboard to get suggestion of function and object names with AUTOCOMPLETE. This is very useful!You can do easy maths-tasks in R:
…and you can do the same in R using data stored in objects:
Vectors in R are collections of the same object types (numeric, character/string, logical) within an object
Exercise: Creating sequences
seq()
to create a vector from 1 to 9 skipping every second number. Store results in a object name “odds”.?seq()
.Similar to vectors, but are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:
You can apply functions to vectors. Two useful functions to keep in mind:
length
, which returns the length of a vector (i.e. the number of elements it contains)sum
which calculates the sum of the elements of a vectorother interesting ones
mean(a)
summary(a)
min(a,b)
, max(a,b)
Let’s suppose we’ve collected some data from an experiment and stored them in an object x
. Some simple summary statistics of these data can be produced:
You can pick one or more elements of a vector by using the []
and the index
of the element.
It is easy to get the first element by indicating its index (1).
You can substract one element of the vector as well using -
. For instance, get all elements except the 3rd.
And you can use a vector of indexes to select one of more elements.
Subscripting is very useful. For instance, using the experimental data about, let’s suppose that we subsequently learn that the first 6 data points correspond to measurements made in one experiment, and the second six on another experiment. This might suggest summarizing the two sets of data separately, so we would need to extract from x
the two relevant subvectors. This is achieved by subscripting.
The function head
provides a preview of the vector.
There are also useful functions to order and sort vectors:
sort
: sort in increasing orderorder
: orders the indexes is such a way that the elements of the vector are sorted, i.e sort(v) = v[order(v)]
order
: orders the indexes is such a way that the elements of the vector are sorted, i.e sort(v) = v[order(v)]
rank
: gives the ranks of the elements of a vector, different options for handling ties are available.Matrices are two–dimensional vectors and can be created in R in a variety of ways.
Create the columns and then glue them together with the command cbind
. For example:
We can also use the function matrix()
directly to create a matrix.
There is a similar command, rbind
, for building matrices by gluing rows together. The functions cbind
and rbind
can also be applied to matrices themselves (provided the dimensions match) to form larger matrices.
R will try to interpret operations on matrices in a natural way. For example, with z as above, and y defined below we get:
Notice that multiplication here is component–wise. As with vectors it is useful to be able to extract sub-components of matrices.
As before, the [ ]
notation is used to subscript. The following examples illustrate this:
So, in particular, it is necessary to specify which rows and columns are required, whilst omitting the index for either dimension implies that every element in that dimension is selected.
A data frame is a matrix where the columns can have different data types. As such, it is usually used to represent a whole data set, where the rows represent the samples and columns the variables. Essentially, you can think of a data frame as an excel table.
Tidyverse is a collection of R packages for data science. In tidyverse, the package r CRANpkg("tibble")
improves the conventional R data.frame
class. A tibble is a data.frame
which a lot of tweaks and more sensible defaults that make your life easier.
Let’s illustrate this by loading the GMbC metadata for each individual stored in a file in tab–separated-format (tsv). We load it using the function read_tsv
, which is used to import a data file in tab separated format — tsv into R. In a .tsv–file the data are stored row–wise, and the entries in each row are separated by commas.
The function read_tsv
is from the r CRANpkg("readr")
package and will give us a tibble as the result.
It has many information on the sampled individuals in the GMbC collection that were acquired using questionnaires, such as their place of residence, their age, their weight and height, and information about their life, e.g. what they eat, how they live, etc.
Now that we have imported the data set, you might be wondering how to actually access the data. For this the functions filter
and select
from the r CRANpkg("dplyr")
package of the r CRANpkg("tidyverse")
are useful. filter
will select certain rows (observations), while select
will subset the columns (variables of the data).
In the following command, we get all the patients that are taller than 190cm and select their height and the country they were sampled in, as well as their Id:
There are a couple of operators useful for comparisons:
Variable == value
: equalVariable != value
: un–equalVariable < value
: lessVariable > value
: greater&: and
|
or!
: negation%in%
: is element?The function filter
allows us to combine multiple conditions easily, if you specify multiple of them, they will automatically concatenated via a &
.
For example, we can easily get tall individuals from Ghana via:
We can also retrieve tall individuals OR individuals from Ghana via
Using available tidyverse functions, we can also create new columns in our data frames.
Let’s try to calculate the Body-Mass-Index (BMI), defined as BMI = Weight (kg) / Height (m).
As the metdata carries the height in cm, we first need to transform this to meters in a new column. Then we used the new column and the weight column to calculated the BMI:
Use %>%
(pipe) to emphasize a sequence of actions, rather than the object that the actions are being performed on. Pipe is a function from r CRANpkg("magrittr")
. New versions of R (> 4.1) have a native pipe: |>
. Pipes are useful to skip the need of naming intermediate files.
Pipes redirects the object placed before the pipe to the function placed after the pipe.
bmi_selected <- metadata %>% #object here is a data.frame
mutate(., height_m = height_cm / 100) %>% # The object data frame is placed where you indicate it with an ".".
mutate(BMI = weight_kg / height_m^2) %>% # Also, pipe places automatically in the first position of the function
select(SampleID, country, height_cm, weight_kg, height_m, BMI)
bmi_selected
You can see, we also selected the “country” column and stored everything in the new object bmi_selected
.
The package r CRANpkg("ggplot2")
allows very flexible plotting in R. It is very power but it has its own grammar, which is not very intuitive at first.
There are 3 basic components of ggplot2 grammar:
Ggplots are build up by combining the plot elements. For instance, first you build the plot are, then you select the geometrics and their aesthetics. See example below.
Build plot frames.
Add plot geometrics. In this case we are plotting a scatter plot, which is a done by plotting points (geometric geom_point
). Note that we are using an +
to add the geometrics to the plot area.
Now that we have a basic plot to outlined, we can work on geometric aesthetics.
As you can see, we can color the points using additional data. ggplot2 also automatically detects the type of data that is used: the first plot is colored in a gradient as BMI
has numerical values, the second plot uses discrete colors as country
is a vector of strings/characters values.
Another kind of plot: histogram. Similarly, first the plot frame is build, followed by the geometric.
Another kind of plot: Box plots.
Different geometric can be combined, for instance boxplot and points.
Different geometrics offer different visualization of the same data.
Plot appearance can also be done without being dependent on a specific variable. For instance, color all points blue or add transparancy to all of points. These modifications are done out of the aes()
functions, but withing the geometric functions (geom_*
)
Any aspects of the plot can be modified. For instance, we are making the plot more informative by adding labels to the axes.
You can do much much more with ggplot2. Some good ideas ideas are found on the tutorial by Zev Ross and Cédric Scherer and the Book Data Visualization
by Keiran Healy.
Here is an example of a plot that shows the distribution of samples along the first two lifestyle principal components, colored by the variable “subsistence”, which is a broad categorization of lifestyles. Large shapes show the centroids of group means of the samples in each “urbanism” category
metadata %>%
ggplot(aes(x=Dim1_lifestyle, y=Dim2_lifestyle)) +
theme_bw() + theme(panel.grid=element_blank(), legend.position = "right") +
xlab("Lifestyle (Dim1)") + ylab("Lifestyle (Dim2)") +
geom_point(aes(shape=lifestyle, fill=urbanism)) +
scale_shape_manual(values=c(21,22)) +
scale_fill_brewer(palette="Paired") +
geom_point(data = . %>% group_by(urbanism) %>%
summarise(
Dim1_lifestyle=mean(Dim1_lifestyle),
Dim2_lifestyle=mean(Dim2_lifestyle)),
aes(fill=as.factor(urbanism)), size=10, shape=23)
Most of this tutorial is an ipsis litteris copy of the tutorial given by Bernd Klaus at the EMBL in 2017.
We also got inspired and got some ideas from the slides from Olivier Gimenez workshop on reproducible science.
Lucas created the original document, Malte adapted parts for use in the KMC Workshop 2024.
For more extensive and view of R coding with Tidyverse see their style guide.