Manipulating data frames

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do I access the data stored in data frames?

Objectives

Understand the fundamentals of addressing data frames.

Datasets

In R, there are several built-in datasets and many more available in various packages. To access any of these datasets, you can simply use:

iris

Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
...
149          6.2         3.4          5.4         2.3  virginica
150          5.9         3.0          5.1         1.8  virginica

Additionally, you can use the data() function to load any built-in dataset into your R session.

data(iris)

You will now be able to view the iris dataset in your environment pane (selecting it will open it as a file in your script pane).

We first want to explore what high-level information RStudio can give us about the iris dataset.

str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

From the output, you can see that the iris dataset has 150 observations and 5 variables. The variables include numeric variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and a factor variable (Species). The factor variable Species has three levels: “setosa”, “versicolor”, and “virginica”. Additionally, it shows the first few observations of the dataset.

summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

The summary() function in this case, provides:

Minimum: The minimum value of the variable.
1st Quartile (Q1): The value below which 25% of the data fall.
Median (Q2): The middle value of the variable, where 50% of the data fall below and 50% fall above.
Mean: The average value of the variable.
3rd Quartile (Q3): The value below which 75% of the data fall.
Maximum: The maximum value of the variable.
The frequency count of each level for a factor.

dim(iris)

150 5

The dim() fuctions provides the dimensions of the dataset. 150 rows, 5 columns.

Next we can look at the data in more detail. Type but do not press enter:

iris$

You should see an autocomplete option panel for every column name in the dataset. Select one and press enter. This will show you all the content from this column.

There are several subtly different ways to call variables, observations, and elements from data frames. Try all of these:

iris [3]

Returns a dataframe with just column 3.

iris [[3]]

Returns the contents of column 3.

head(iris)

Returns the top 6 results.

tail(iris)

Returns the bottom 6 results.

iris$Sepal.Length

Returns the contents of column 3. ($ operator is used to extract variables)

iris [3, 3]

The syntax iris [row_index, column_index]. So this returns the item in the 3rd row and in the 3rd column.

iris [, c] #col number

Returns the contents of column 3.

iris [r, ] #row number

Returns the contents of row 3.

Be clear that calling columns will be return vectors and calling rows will return lists.

Adding and removing columns and rows

Add column
We know that columns are vectors so we can generate a new vector and add it to our dataset:

num_petals <- rep(4, 150) # we are assuming all the flowers have four petals for the sake of this example 

iris <- cbind(iris, num_petals)

Note:
If you try and inspect the data in output in the console you may get a warning that you have reached your max.print and rows have been omitted. You can change this if you wish using:

options(max.print = n) # n being the number of elements to print

Remove column

To remove the column, we want to provide the index of the column we want to remove (you can think of it as we are asking for all the columns, minus the specified column.)

iris <- iris[,-6] # remove the pointless column we just added

Add row
Remember, rows are lists, so we need to generate a new list that matches our dataset.

new_row <- list(5.4, 3.2, 4.9, 1.9, "setosa")

iris <- rbind(iris, new_row)

Remove row
Removing rows is very similar to removing columns:

iris <- iris[-151,]

Note: If you have added multiple new rows, you can also pass in a vector of rows to be removed (e.g., iris[-c(151:153),])

Subsetting data frames

Now we will consider more complicated selections of the data. We have seen that we can pass a vector into the index selection of our data frame (e.g., -c(151:153) ). We can use what we know about vectors to make some interesting selections:

iris[c(1,2,1,3,1,4)]

		Sepal.Length Sepal.Width Sepal.Length.1 Petal.Length Sepal.Length.2 Petal.Width
1            5.1         3.5            5.1          1.4            5.1         0.2
2            4.9         3.0            4.9          1.4            4.9         0.2
...

iris[2:4]

		Sepal.Width Petal.Length Petal.Width
1           3.5          1.4         0.2
2           3.0          1.4         0.2
...

iris[-(2:4)]

	Sepal.Length    Species
1            5.1     setosa
2            4.9     setosa
...

Note that we are not using ‘iris <-‘ before our commands so we are not changing the original structure. We could be creating subsets of our data if we were assigning it to a new variable (e.g., subset_iris <- iris[2:4]).

Subsetting by name

We have been selecting by index up until this point. However, indexes can change if we manipulate a data frame. Column names will not change. Try:

iris[c("Sepal.Length","Sepal.Width")]

Using names to index requires us to use a different syntax than indexing with numbers (think back to data types and their operations, Sepal.Length”:”Sepal.Width does not make sense). If we combine that with our understanding of how logical operators worked, we have the building blocks for how we are going to address a data frame using column names.

To provide some clarity, type:

iris[c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)]

    Sepal.Width Petal.Width    Species
1           3.5         0.2     setosa
2           3.0         0.2     setosa
...

This is indexing with logical operators. We can use a new operator that we have not looked at yet ‘%in%’. This operator is used to check if elements in one vector are present in another vector. It returns a logical vector indicating whether each element of the first vector is found in the second vector.

iris[, ! names(iris) %in% c(“Sepal.Length”, "Sepal.Width")] # notice the negation

    Petal.Length Petal.Width    Species
1            1.4         0.2     setosa
2            1.4         0.2     setosa
...

With this we will get all columns not in the list.

Many more options

The complexity of subsetting data is going to depend on the task at hand. The scope of this session does not allow us to explore all eventualities but should give you a base understanding of how to effectively subset your data. There are also packages designed specifically for manipulating data frames (e.g.,plyr, dplyr and tidyr). Although it is possible to achieve similar results using base R functions, these packages provide a more efficient and convenient way to work with data, especially for tasks involving data manipulation, summarization, and reshaping.

Key Points

The complexity of the process is related to the complexity of the conditions for retrieving the data you want.

previous episode

Introduction to R - theory sessions

next episode