R for reproducible scientific analysis

Project management with RStudio

Learning objectives

To create self-contained projects in RStudio
To use git from within RStudio

Introduction

The scientific process is naturally incremental and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.

A good project layout will make your life easier by:

ensuring the integrity of your data;
making it simpler to share the code with peers;
allowing you to easily upload your code with your manuscript submission;
enabling you to pick the project up again after a break hassle-free.

A possible solution

Fortunately, there are tools and packages which can help you manage your work effectively.

One of the most powerful and useful aspects of RStudio is the project management functionality. We’ll be using this today to create a self-contained, reproducible project.

Challenge: Creating a self-contained project

We’re going to create a new project in RStudio:

Click the “File” menu button, then “New Project”.
Click “New Directory”.
Click “Empty Project”.
Type in the name of the directory to store your project, e.g. “test_project”.
Make sure that the checkbox for “Create a git repository” is selected.
Click the “Create Project” button.

Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.

Version Control

We also set up our project to integrate with git, putting it under version control. RStudio has an interface to git, but is very limited in what it can do. Let’s make an initial commit of our template files.

The top right panel in RStudio has a tab for “Git”. When files are not yet tracked by git these are marked by yellow question marks. Now our project has two items untracked:

.gitignore (automatically generated by git)
test_project.Rproj (automatically generated by RStudio)

Stage these two files by selecting them in the Git tab and pressing Commit. On the pop up window in the Comment box type: “Adding .gitgnore and test_proj.Rproj files”. Press Commit, then Close on the pop up window and then close the last pop up window too.

Best practices for project organisation

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data preprocessing

In many cases your data will be “dirty”: it will need significant preprocessing to get into a R format (or any other programming language). It is often useful to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “processed” data sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

Create data directory and store the data

First step for our project is to generate a new folder, called data, and store the raw data files under it.

Challenge 1

To create a new folder:

On the bottom right panel click the New Folder button.
In the popup window type the name of the new folder.

To download the gapminder data from this link, you would have typically used the wget function in the shell. In that case the command would be: wget http://tinyurl.com/gapminder-FiveYearData-csv.

To run a shell command in R we can use the system function. Look at the arguments of the command in the help page and then:

Use the R function system to run the shell command wget from the link above. To save the file under the recently created data folder, add the following flag --output-document=data/gapminder-FiveYearData.csv to the above wget function.
Click on the data folder (bottom right panel) to check if a new file exists.

We will load, inspect and analyse these data later.

Check the Git tab for any changes. What do you see?

Challenge 2

Modify the .gitignore file to contain data/* so that the data folder isn’t versioned. To do so, click on the file in the bottom right panel and add a line at the end. Save the changes.
Stage and commit the .gitignore file using the R git interface.