R for reproducible scientific analysis

Data frames and reading in data

Learning Objectives

  • Read tabular data from a file into a program.
  • Select individual values and subsections from data.
  • Display simple graphs.

We are studying the changes in total population, life expectancy and gross national income per capita (US $ by exchange rate) for 142 countries over a period of 55 years with measurements taken every 5 year, from 1952 to 2007. The data sets are stored in comma-separated values (CSV) format. Each row holds the observations for a country at one time point. The columns hold the following information: country, year, total population, continent, life expectancy and gross national income per capita. For more information visit the Gapminder website.

The first few rows of our file look like this:

country,year,pop,continent,lifeExp,gdpPercap
Afghanistan,1952,8425333,Asia,28.801,779.4453145
Afghanistan,1957,9240934,Asia,30.332,820.8530296
Afghanistan,1962,10267083,Asia,31.997,853.10071
Afghanistan,1967,11537966,Asia,34.02,836.1971382
Afghanistan,1972,13079460,Asia,36.088,739.9811058
Afghanistan,1977,14880372,Asia,38.438,786.11336
Afghanistan,1982,12881816,Asia,39.854,978.0114388
Afghanistan,1987,13867957,Asia,40.822,852.3959448
Afghanistan,1992,16317921,Asia,41.674,649.3413952

We want to:

  • Load data into memory,
  • Find the country with the highest gross national income per capita, and
  • Plot the total population of Argentina over the years.

Loading Data

To load our gapminder data, first we need to tell our computer where is the file that contains the values. We know its name is gapminder-FiveYearData.csv. This is very important in R, if we forget this step we’ll get an error message when trying to read. We can change the current working directory using the function setwd. For this example, we change the path to the working directory where our project is stored that is named test_project:

setwd("~/test_project")

Alternatively you can change the working directory using the RStudio GUI using the menu option Session -> Set Working Directory -> Choose Directory...

We also know that experimental files are located in the directory data inside the working directory. Now we can load the data into R using read.csv:

read.csv(file = "data/gapminder-FiveYearData.csv", header = TRUE)

The expression read.csv(...) is a function call that asks R to run the function read.csv.

read.csv has two arguments: the name of the file we want to read, and whether the first line of the file contains names for the columns of data. The filename needs to be a character string (or string for short), so we put it in quotes. Assigning the second argument, header, to be TRUE indicates that the data file has column headers.

Since we didn’t assign the output of read.csv to any variable, the console will display the full contents of the file gapminder-FiveYearData.csv onto the interactive R console.

Challenge 1

Go to File -> New File -> R Script, and write a R script to read the gapminder dataset and assign it to a variable named gapminder .

Save the R script as data_analysis under a new folder in your project named scripts/ and add it to version control.

Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).

What is the data structure of gapminder? (hint: use function class)

Data frames are similar to matrices, except each column can be a different atomic type. Underneath the hood, data frames are really lists, where each element is an atomic vector, with the added restriction that they’re all the same length. Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.

We can see the dimensions, or shape, of the data frame with the function dim:

dim(gapminder)
[1] 1704    6

This tells us that our data frame, dat, has 1704 rows and 6 columns. Alternatively, you can see the number of rows and columns using the functions nrow and ncol, respectively.

Slicing data using indices

If we want to get a single value from the data frame, we can provide an index in square brackets, just as we do in math:

# first value in dat
gapminder[1, 1]
[1] Afghanistan
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

R automatically stored this first column as a factor, not a character vector. We can change this by coercing the column to character vector:

gapminder$country <- as.character(gapminder$country)
class(gapminder$country)
[1] "character"

An index like [1, 1] selects a single element of a data frame, but we can select whole sections as well. For example, we can select the first five measurements (rows) like this:

gapminder[1:5, ]
      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811

The slice 1:5 means, “Start at index 1 and go to index 5”. To select all the columns you simply don’t include a slice for the those and R returns them all. If we don’t provide a slice for either rows or columns, e.g. dat[, ], R returns the full data frame.

The slice does not need to start at 1, e.g. the line below selects rows 5 through 10:

gapminder[5:10, ]

We can use the function c to select non-contiguous values using both indices(for the rows) and names (for the columns):

gapminder[c(3, 8, 37, 56), c("country", "year", "pop")]
       country year      pop
3  Afghanistan 1962 10267083
8  Afghanistan 1987 13867957
37      Angola 1952  4232095
56   Argentina 1987 31620918

Conditional slicing of data

To extract all the measurements for Argentina, we use conditional slicing:

gapminder[gapminder$country == "Argentina", ]
     country year      pop continent lifeExp gdpPercap
49 Argentina 1952 17876956  Americas  62.485  5911.315
50 Argentina 1957 19610538  Americas  64.399  6856.856
51 Argentina 1962 21283783  Americas  65.142  7133.166
52 Argentina 1967 22934225  Americas  65.634  8052.953
53 Argentina 1972 24779799  Americas  67.065  9443.039
54 Argentina 1977 26983828  Americas  68.481 10079.027
55 Argentina 1982 29341374  Americas  69.942  8997.897
56 Argentina 1987 31620918  Americas  70.774  9139.671
57 Argentina 1992 33958947  Americas  71.868  9308.419
58 Argentina 1997 36203463  Americas  73.275 10967.282
59 Argentina 2002 38331121  Americas  74.340  8797.641
60 Argentina 2007 40301927  Americas  75.320 12779.380

The condition operator is applied to every element of country column of the gapminder data frame, only to return those rows of gapminder for which the country is “Argentina”.

Challenge 2

Find which country(-ies) had the highest gdpPercap throughout all years.

Discuss with your neighbors. Have you come up with the same answer? Have you > used the same command(s)?

Plotting

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualise data. Visualisation deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.

Let’s take a look at the population of Argentina over time. Plotting the values is done with the function plot.

dat <- gapminder[gapminder$country=="Argentina", ]
plot(dat$year, dat$pop)

plot of Argentina's population over the years

Above, we gave the function plot a vector of numbers corresponding to the population of Argentina over the years and the years. plot created a scatter plot where the y-axis is the population and the x-axis are the years.