Introduction to R

Author

Ana Schaan (aschaan@ikmb.uni-kiel.de) and Malte Rühlemann (m.ruehlemann@ikmb.uni-kiel.de)

Packages and functions in R

This is how you load packages into R. Packages contain a series of functions and data developed by R community. We will be working with tidyverse. Normally, you load all the packages needed at the beginning of your code.

Packages are collections of functions that are often written and collected for a specific purpose. Bioconductor is a place where many packages aimed at the analysis of biological data are archived and provided for installation. For this part of the tutorial which aims at introducing you to R using tidyverse, we will load named tidyverse pacakge.

Code

library(tidyverse)

R

R is a software language for statistical analyses. There is tons of excellent quality info out there. So, we will focus on the practical aspects of it.

Getting help

Just ask!
help.start() and the HTML help button in the Windows GUI.
help and ?: help("data.frame") or ?help.
help.search(), apropos(). example, apropos("sum")
browseVignettes("package")
rseek.org
There is always the internet :)

Basic data types and functions

R is an object-oriented language, so every data item is an object in R. These are the following basic data types in R.

numeric: real number

Code

character: chain of characters, text

Code

"1"

logical: TRUE, FALSE

Code

TRUE
FALSE

special values: NA (missing value), NULL (“empty object”), Inf, -Inf (infinity), NaN (not a number)

Code

NA

Assignments

It is practical to store objects and values into named objects. You assign names to values using <- as name_of_object <- value or = as name_of_object = value.

Code

x <- 6
x
y <- "object y is a text"
y
z <- FALSE
z

Now check that the x, y and z appeared on the environment. Use the function ls() to inspect the environment using the console.

Functions

Functions contain operations that will be applied to objects. Functions have arguments which normally are of two kinds: 1. the data to which the operations will be applied on. 2. the details of the operations (so-called “arguments”).

Code

x <- 1
y <- "This is a text"

print(x)
print(y)
print(y, quote = FALSE)

Remember that you can use ? (e.g. ?print())to get information about functions arguments.
Use the TAB keyboard to get suggestion of function and object names with AUTOCOMPLETE. This is very useful!

Data types and arithmetic

You can do easy maths-tasks in R:

Code

4 + 6

…and you can do the same in R using data stored in objects:

Code

a <- 4
b <- 6

a + b

Vectors

Vectors in R are collections of the same object types (numeric, character/string, logical) within an object

Code

z <- c(5, 9, 1, 0)
#this is a comment within the code.
z

Exercise: Creating sequences

1. Try out using the function seq() to create a vector from 1 to 9 skipping every second number. Store results in a object name “odds”.
1. Use the function rep to create a vector with the number 1 repeated ten times. Hint: you can inspect what the function does by typing ?seq().

Factor: categorical data that takes a fixed set of values

Similar to vectors, but are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

Code

factor(1)
x <- factor(c("wt", "wt", "mut", "mut"), levels = c("wt", "mut"))
x

Component-wise operations

Code

x <- c(6, 8 , 9 )
x + 2

You can apply functions to vectors. Two useful functions to keep in mind:

length, which returns the length of a vector (i.e. the number of elements it contains)
sum which calculates the sum of the elements of a vector

other interesting ones

mean(a)
summary(a)
min(a,b), max(a,b)

Summaries, subscripting and useful vector functions

Let’s suppose we’ve collected some data from an experiment and stored them in an object x. Some simple summary statistics of these data can be produced:

Code

x <- c(7.5, 8.2, 3.1, 5.6, 8.2, 9.3, 6.5, 7.0, 9.3, 1.2, 14.5, 6.2)

mean(x)

Code

var(x)

Code

summary(x)

You can pick one or more elements of a vector by using the [] and the index of the element.

It is easy to get the first element by indicating its index (1).

Code

x[1]

You can substract one element of the vector as well using -. For instance, get all elements except the 3rd.

Code

x[-3]

And you can use a vector of indexes to select one of more elements.

Code

x[c(1,5,8)]

Code

x[-(1:6)]

Subscripting is very useful. For instance, using the experimental data about, let’s suppose that we subsequently learn that the first 6 data points correspond to measurements made in one experiment, and the second six on another experiment. This might suggest summarizing the two sets of data separately, so we would need to extract from x the two relevant subvectors. This is achieved by subscripting.

Code

x[1:6]
summary(x[1:6])

Code

x[7:12]
summary(x[7:12])

The function head provides a preview of the vector.

Code

head(x)

There are also useful functions to order and sort vectors:

sort: sort in increasing order
order: orders the indexes is such a way that the elements of the vector are sorted, i.e sort(v) = v[order(v)]

Code

x <- c(1.3, 3.5, 2.7, 6.3, 6.3)
sort(x)

order: orders the indexes is such a way that the elements of the vector are sorted, i.e sort(v) = v[order(v)]

Code

order(x)

Code

x[order(x)]

rank: gives the ranks of the elements of a vector, different options for handling ties are available.

Code

rank(x)

Matrices, lists, data frames and basic data handling

Matrices

Matrices are two–dimensional vectors and can be created in R in a variety of ways.

Creating matrices

Create the columns and then glue them together with the command cbind. For example:

Code

x <- c(5, 7 , 9)
y <- c(6, 3 , 4)
z <- cbind(x, y)
z
dim(z)

We can also use the function matrix() directly to create a matrix.

Code

z <- matrix(c(5, 7, 9, 6, 3, 4), nrow = 3)

There is a similar command, rbind, for building matrices by gluing rows together. The functions cbind and rbind can also be applied to matrices themselves (provided the dimensions match) to form larger matrices.

Matrices operations

R will try to interpret operations on matrices in a natural way. For example, with z as above, and y defined below we get:

Code

y <- matrix(c(1, 3, 0, 9, 5, -1), nrow = 3, byrow = TRUE)
y

Code

y + z

Code

y * z

Notice that multiplication here is component–wise. As with vectors it is useful to be able to extract sub-components of matrices.

Subsetting

As before, the [ ] notation is used to subscript. The following examples illustrate this:

Code

z[1, 1]
z[, 2]
z[1:2, ]
z[-1, ]
z[-c(1, 2), ]

So, in particular, it is necessary to specify which rows and columns are required, whilst omitting the index for either dimension implies that every element in that dimension is selected.

Data frames (tibbles) and lists

A data frame is a matrix where the columns can have different data types. As such, it is usually used to represent a whole data set, where the rows represent the samples and columns the variables. Essentially, you can think of a data frame as an excel table.

Tidyverse is a collection of R packages for data science. In tidyverse, the package r CRANpkg("tibble") improves the conventional R data.frame class. A tibble is a data.frame which a lot of tweaks and more sensible defaults that make your life easier.

Let’s illustrate this by loading the GMbC metadata for each individual stored in a file in tab–separated-format (tsv). We load it using the function read_tsv, which is used to import a data file in tab separated format — tsv into R. In a .tsv–file the data are stored row–wise, and the entries in each row are separated by commas.

The function read_tsv is from the r CRANpkg("readr") package and will give us a tibble as the result.

Code

metadata <- read_tsv("~/kmc_workshop/inputs/metadata_gmbc_bn10_complete_fixed.tsv")
#metadata <- read.table("metadata_gmbc_bn10_complete_fixed.tsv", stringsAsFactors = F)
#metadata = metadata %>% rownames_to_column("SampleID")
#write_tsv(metadata, "metadata_gmbc_bn10_complete_fixed.tsv")


metadata

It has many information on the sampled individuals in the GMbC collection that were acquired using questionnaires, such as their place of residence, their age, their weight and height, and information about their life, e.g. what they eat, how they live, etc.

Accessing data in data frames

Now that we have imported the data set, you might be wondering how to actually access the data. For this the functions filter and select from the r CRANpkg("dplyr") package of the r CRANpkg("tidyverse") are useful. filter will select certain rows (observations), while select will subset the columns (variables of the data).

In the following command, we get all the patients that are taller than 190cm and select their height and the country they were sampled in, as well as their Id:

Code

pat_tall <- filter(metadata, height_cm > 190)
select(pat_tall, SampleID,  height_cm, country)

There are a couple of operators useful for comparisons:

Variable == value: equal
Variable != value: un–equal
Variable < value: less
Variable > value: greater
&: and
| or
!: negation
%in%: is element?

The function filter allows us to combine multiple conditions easily, if you specify multiple of them, they will automatically concatenated via a &.

For example, we can easily get tall individuals from Ghana via:

Code

filter(metadata, height_cm > 190, country == "Ghana")

We can also retrieve tall individuals OR individuals from Ghana via

Code

filter(metadata, (height_cm > 190) | (country == "tall"))

Computing variables from existing ones

Using available tidyverse functions, we can also create new columns in our data frames.

Let’s try to calculate the Body-Mass-Index (BMI), defined as BMI = Weight (kg) / Height (m).

As the metdata carries the height in cm, we first need to transform this to meters in a new column. Then we used the new column and the weight column to calculated the BMI:

Code

metadata_new = mutate(metadata, height_m = height_cm / 100)
metadata_new = mutate(metadata_new, BMI = weight_kg / height_m^2)

head(select(metadata_new, SampleID, height_cm, weight_kg, height_m, BMI))

Combine sequences of actions with pipes

Use %>% (pipe) to emphasize a sequence of actions, rather than the object that the actions are being performed on. Pipe is a function from r CRANpkg("magrittr"). New versions of R (> 4.1) have a native pipe: |>. Pipes are useful to skip the need of naming intermediate files.

Pipes redirects the object placed before the pipe to the function placed after the pipe.

Code

bmi_selected <- metadata %>% #object here is a data.frame
  mutate(., height_m = height_cm / 100) %>% # The object data frame is placed where you indicate it with an ".".
  mutate(BMI = weight_kg / height_m^2) %>% # Also, pipe places automatically in the first position of the function
  select(SampleID, country, height_cm, weight_kg, height_m, BMI)

bmi_selected

You can see, we also selected the “country” column and stored everything in the new object bmi_selected.

Plots with ggplot2

The package r CRANpkg("ggplot2") allows very flexible plotting in R. It is very power but it has its own grammar, which is not very intuitive at first.

There are 3 basic components of ggplot2 grammar:

Data. The dataframe to be plotted
Geometrics. The geommetric shape to represent the data (e.g., point, boxplot)
Aesthetics. The aesthetics of the geometric object (e.g., color, size, shape)

Ggplots are build up by combining the plot elements. For instance, first you build the plot are, then you select the geometrics and their aesthetics. See example below.

Build plot frames.

Code

ggplot(data = bmi_selected, aes(x = height_m, y = weight_kg))

Scatterplot

Add plot geometrics. In this case we are plotting a scatter plot, which is a done by plotting points (geometric geom_point). Note that we are using an + to add the geometrics to the plot area.

Code

ggplot(data = bmi_selected, aes(x = height_m, y = weight_kg)) +
  geom_point()

Scatterplot with additional data

Now that we have a basic plot to outlined, we can work on geometric aesthetics.

Code

# make the size proportional to the age
ggplot(data = bmi_selected, aes(x = height_m, y = weight_kg)) +
  geom_point(aes(color=BMI))

#colour by binned weight
ggplot(data = bmi_selected, aes(x = height_m, y = weight_kg)) +
  geom_point(aes(color = country))

As you can see, we can color the points using additional data. ggplot2 also automatically detects the type of data that is used: the first plot is colored in a gradient as BMI has numerical values, the second plot uses discrete colors as country is a vector of strings/characters values.

Histograms

Another kind of plot: histogram. Similarly, first the plot frame is build, followed by the geometric.

Code

bmi_selected %>% # note that we are using pipe to redirect the data frame to the function ggplot
  ggplot(aes(x = weight_kg)) +
  geom_histogram()

Boxplots

Another kind of plot: Box plots.

Code

bmi_selected %>% 
  ggplot(aes(x = country, y = height_cm)) +
  geom_boxplot()

Combining geometrics

Different geometric can be combined, for instance boxplot and points.

Code

bmi_selected %>% 
  ggplot(aes(x = country, y = height_cm)) +
  geom_boxplot(outlier.color = NA) +
  geom_point()

Different geometrics offer different visualization of the same data.

Code

bmi_selected %>% 
  ggplot(aes(x = country, y = height_cm)) +
  geom_boxplot(outlier.color = NA) +
  geom_jitter()

Changing plot colors and other parameters

Plot appearance can also be done without being dependent on a specific variable. For instance, color all points blue or add transparancy to all of points. These modifications are done out of the aes() functions, but withing the geometric functions (geom_*)

Code

bmi_selected %>% 
  ggplot(aes(x = country, y = height_cm)) +
  geom_boxplot(outlier.color = NA) +
  geom_jitter(alpha = 0.5, # alpha is transparency
              color = "blue")

Any aspects of the plot can be modified. For instance, we are making the plot more informative by adding labels to the axes.

Code

bmi_selected %>% 
  ggplot(aes(x = country, y = height_cm)) +
  geom_boxplot(outlier.color = NA) +
  geom_jitter(alpha = 0.5, # alpha is transparency
              color = "blue") +
  labs(x = "Country of sampling", y = "Height (cm)") +
  theme_classic()

You can do much much more with ggplot2. Some good ideas ideas are found on the tutorial by Zev Ross and Cédric Scherer and the Book Data Visualization by Keiran Healy.

Here is an example of a plot that shows the distribution of samples along the first two lifestyle principal components, colored by the variable “subsistence”, which is a broad categorization of lifestyles. Large shapes show the centroids of group means of the samples in each “urbanism” category

Code

metadata %>% 
    ggplot(aes(x=Dim1_lifestyle, y=Dim2_lifestyle)) +
    theme_bw() + theme(panel.grid=element_blank(), legend.position = "right") +
    xlab("Lifestyle (Dim1)") + ylab("Lifestyle (Dim2)") +
    geom_point(aes(shape=lifestyle, fill=urbanism)) +
    scale_shape_manual(values=c(21,22)) + 
    scale_fill_brewer(palette="Paired") +
    geom_point(data = . %>% group_by(urbanism) %>% 
               summarise(
                Dim1_lifestyle=mean(Dim1_lifestyle),
                Dim2_lifestyle=mean(Dim2_lifestyle)),
               aes(fill=as.factor(urbanism)), size=10, shape=23)

Sources

Most of this tutorial is an ipsis litteris copy of the tutorial given by Bernd Klaus at the EMBL in 2017.
We also got inspired and got some ideas from the slides from Olivier Gimenez workshop on reproducible science.
Lucas created the original document, Malte adapted parts for use in the KMC Workshop 2024.
For more extensive and view of R coding with Tidyverse see their style guide.

Session information

Code

sessionInfo()