R packages by Hadley Wickham


R code

The first principle of using a package is that all R code goes in R/. In this chapter, you’ll learn about the R/ directory, my recommendations for organising your functions into files, and some general tips on good style. You’ll also learn about some important differences between functions in scripts and functions in packages.

R code workflow

The first practical advantage to using a package is that it’s easy to re-load your code. You can either run devtools::load_all(), or in RStudio press Ctrl/Cmd + Shift + L, which also saves all open files, saving you a keystroke.

This keyboard shortcut leads to a fluid development workflow:

  1. Edit an R file.

  2. Press Ctrl/Cmd + Shift + L.

  3. Explore the code in the console.

  4. Rinse and repeat.

Congratulations! You’ve learned your first package development workflow. Even if you learn nothing else from this book, you’ll have gained a useful workflow for editing and reloading R code.

Organising your functions

While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file. (It’s OK if some files only contain one function, particularly if the function is large or has a lot of documentation.). File names should be meaningful and end in .R.

# Good
fit_models.R
utility_functions.R

# Bad
foo.r
stuff.r

Pay attention to capitalization, since you, or some of your collaborators, might be using an operating system with a case-insensitive file system (e.g., Microsoft Windows or OS X). Avoid problems by never using filenames that differ only in capitalisation.

My rule of thumb is that if I can’t remember the name of the file where a function lives, I need to either separate the functions into more files or give the file a better name. (Unfortunately you can’t use subdirectories inside R/. The next best thing is to use a common prefix, e.g., abc-*.R.).

The arrangement of functions within files is less important if you master two important RStudio keyboard shortcuts that let you jump to the definition of a function:

  • Click a function name in code and press F2.

  • Press Ctrl + . then start typing the name:

After navigating to a function using one of these tools, you can go back to where you were by clicking the back arrow at the top-left of the editor (), or by pressing Ctrl/Cmd-F9.

Code style

Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. The following guide describes the style that I use (in this book and elsewhere). It is based on Google’s R style guide, with a few tweaks.

You don’t have to use my style, but I strongly recommend that you use a consistent style and you document it. If you’re working on someone else’s code, don’t impose your own style. Instead, read their style documentation and follow it as closely as possible.

Good style is important because while your code only has one author, it will usually have multiple readers. This is especially true when you’re writing code with others. In that case, it’s a good idea to agree on a common style up-front. Since no style is strictly better than another, working with others may mean that you’ll need to sacrifice some preferred aspects of your style.

The formatR package, by Yihui Xie, makes it easier to clean up poorly formatted code. It can’t do everything, but it can quickly get your code from terrible to pretty good. Make sure to read the notes on the website before using it. It’s as easy as:

install.packages("formatR")
formatR::tidy_dir("R")

A complementary approach is to use a code linter. Rather than automatically fixing problems, a linter just warns you about them. The lintr package by Jim Hester checks for compliance with this style guide and lets you know where you’ve missed something. Compared to formatR, it picks up more potential problems (because it doesn’t need to also fix them), but you will still see false positives.

install.packages("lintr")
lintr::lint_package()

Object names

Variable and function names should be lowercase. Use an underscore (_) to separate words within a name (reserve . for S3 methods). Camel case is a legitimate alternative, but be consistent! Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).

# Good
day_one
day_1

# Bad
first_day_of_the_month
DayOne
dayone
djm1

Where possible, avoid using names of existing functions and variables. This will cause confusion for the readers of your code.

# Bad
T <- FALSE
c <- 10
mean <- function(x) sum(x)

Spacing

Place spaces around all infix operators (=, +, -, <-, etc.). The same rule applies when using = in function calls. Always put a space after a comma, and never before (just like in regular English).

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

There’s a small exception to this rule: :, :: and ::: don’t need spaces around them. (If you haven’t seen :: or ::: before, don’t worry - you’ll learn all about them in namespaces.)

# Good
x <- 1:10
base::get

# Bad
x <- 1 : 10
base :: get

Place a space before left parentheses, except in a function call.

# Good
if (debug) do(x)
plot(x, y)

# Bad
if(debug)do(x)
plot (x, y)

Extra spacing (i.e., more than one space in a row) is ok if it improves alignment of equal signs or assignments (<-).

list(
  total = a + b + c, 
  mean  = (a + b + c) / n
)

Do not place spaces around code in parentheses or square brackets (unless there’s a comma, in which case see above).

# Good
if (debug) do(x)
diamonds[5, ]

# Bad
if ( debug ) do(x)  # No spaces around debug
x[1,]   # Needs a space after the comma
x[1 ,]  # Space goes after comma not before

Curly braces

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else.

Always indent the code inside curly braces.

# Good

if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

# Bad

if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to leave very short statements on the same line:

if (y < 0 && debug) message("Y is negative")

Line length

Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.

Indentation

When indenting your code, use two spaces. Never use tabs or mix tabs and spaces. Change these options in the code preferences pane:

The only exception is if a function definition runs over multiple lines. In that case, indent the second line to where the definition starts:

long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

Assignment

Use <-, not =, for assignment.

# Good
x <- 5
# Bad
x = 5

Commenting guidelines

Comment your code. Each line of a comment should begin with the comment symbol and a single space: #. Comments should explain the why, not the what.

Use commented lines of - and = to break up your file into easily readable chunks.

# Load data ---------------------------

# Plot data ---------------------------

Top-level code

Up until now, you’ve probably been writing scripts, R code saved in a file that you load with source(). There are two main differences between code in scripts and packages:

  • In a script, code is run when it is loaded. In a package, code is run when it is built. This means your package code should only create objects, the vast majority of which will be functions.

  • Functions in your package will be used in situations that you didn’t imagine. This means your functions need to be thoughtful in the way that they interact with the outside world.

The next two sections expand on these important differences.

Loading code

When you load a script with source(), every line of code is executed and the results are immediately made available. Things are different in a package, because it is loaded in two steps. When the package is built (e.g. by CRAN) all the code in R/ is executed and the results are saved. When you load a package, with library() or require(), the cached results are made available to you. If you loaded scripts in the same way as packages, your code would look like this:

# Load a script into a new environment and save it
env <- new.env(parent = emptyenv())
source("my-script.R", local = env)
save(envir = env, "my-script.Rdata")

# Later, in another R session
load("my-script.Rdata")

For example, take x <- Sys.time(). If you put this in a script, x would tell you when the script was source()d. But if you put that same code in a package, x would tell you when the package was built.

This means that you should never run code at the top-level of a package: package code should only create objects, mostly functions. For example, imagine your foo package contains this code:

library(ggplot2)

show_mtcars <- function() {
  qplot(mpg, wt, data = mtcars)
}

If someone tries to use it:

library(foo)
show_mtcars()

The code won’t work because ggplot2’s qplot() function won’t be available: library(foo) doesn’t re-execute library(ggplot2). The top-level R code in a package is only executed when the package is built, not when it’s loaded.

To get around this problem you might be tempted to do:

show_mtcars <- function() {
  library(ggplot2)
  qplot(mpg, wt, data = mtcars)
}

That’s also problematic, as you’ll see below. Instead, describe the packages your code needs in the DESCRIPTION file, as you’ll learn in package dependencies.

The R landscape

Another big difference between a script and a package is that other people are going to use your package, and they’re going to use it in situations that you never imagined. This means you need to pay attention to the R landscape, which includes not just the available functions and objects, but all the global settings. You have changed the R landscape if you’ve loaded a package with library(), or changed a global option with options(), or modified the working directory with setwd(). If the behaviour of other functions differs before and after running your function, you’ve modified the landscape. Changing the landscape is bad because it makes code much harder to understand.

There are some functions that modify global settings that you should never use because there are better alternatives:

  • Don’t use library() or require(). These modify the search path, affecting what functions are available from the global environment. It’s better to use the DESCRIPTION to specify your package’s requirements, as described in the next chapter. This also makes sure those packages are installed when your package is installed.

  • Never use source() to load code from a file. source() modifies the current environment, inserting the results of executing the code. Instead, rely on devtools::load_all() which automatically sources all files in R/. If you’re using source() to create a dataset, instead switch to data/ as described in datasets.

Other functions need to be used with caution. If you use them, make sure to clean up after yourself with on.exit():

  • If you modify global options() or graphics par(), save the old values and reset when you’re done:

    old <- options(stringsAsFactors = FALSE)
    on.exit(options(old), add = TRUE)
  • Avoid modifying the working directory. If you do have to change it, make sure to change it back when you’re done:

    old <- setwd(tempdir())
    on.exit(setwd(old), add = TRUE)
  • Creating plots and printing output to the console are two other ways of affecting the global R environment. Often you can’t avoid these (because they’re important!) but it’s good practice to isolate them in functions that only produce output. This also makes it easier for other people to repurpose your work for new uses. For example, if you separate data preparation and plotting into two functions, others can use your data prep work (which is often the hardest part!) to create new visualisations.

The flip side of the coin is that you should avoid relying on the user’s landscape, which might be different to yours. For example, functions like read.csv() are dangerous because the value of stringsAsFactors argument comes from the global option stringsAsFactors. If you expect it to be TRUE (the default), and the user has set it to be FALSE, your code might fail.

When you do need side-effects

Occasionally, packages do need side-effects. This is most common if your package talks to an external system — you might need to do some initial setup when the package loads. To do that, you can use two special functions: .onLoad() and .onAttach(). These are called when the package is loaded and attached. You’ll learn about the distinction between the two in Namespaces. For now, you should always use .onLoad() unless explicitly directed otherwise.

Some common uses of .onLoad() and .onAttach() are:

  • To display an informative message when the package loads. This might make usage conditions clear, or display useful tips. Startup messages is one place where you should use .onAttach() instead of .onLoad(). To display startup messages, always use packageStartupMessage(), and not message(). (This allows suppressPackageStartupMessages() to selectively suppress package startup messages).

    .onAttach <- function(libname, pkgname) {
      packageStartupMessage("Welcome to my package")
    }
  • To set custom options for your package with options(). To avoid conflicts with other packages, ensure that you prefix option names with the name of your package. Also be careful not to override options that the user has already set.

    I use the following code in devtools to set up useful options:

    .onLoad <- function(libname, pkgname) {
      op <- options()
      op.devtools <- list(
        devtools.path = "~/R-dev",
        devtools.install.args = "",
        devtools.name = "Your name goes here",
        devtools.desc.author = '"First Last <first.last@example.com> [aut, cre]"',
        devtools.desc.license = "What license is it under?",
        devtools.desc.suggests = NULL,
        devtools.desc = list()
      )
      toset <- !(names(op.devtools) %in% names(op))
      if(any(toset)) options(op.devtools[toset])
    
      invisible()
    }

    Then devtools functions can use e.g. getOption("devtools.name") to get the name of the package author, and know that a sensible default value has already been set.

  • To connect R to another programming language. For example, if you use rJava to talk to a .jar file, you need to call rJava::.jpackage(). To make C++ classes available as reference classes in R with Rcpp modules, you call Rcpp::loadRcppModules().

  • To register vignette engines with tools::vignetteEngine().

As you can see in the examples, .onLoad() and .onAttach() are called with two arguments: libname and pkgname. They’re rarely used (they’re a holdover from the days when you needed to use library.dynam() to load compiled code). They give the path where the package is installed (the “library”), and the name of the package.

If you use .onLoad(), consider using .onUnload() to clean up any side effects. By convention, .onLoad() and friends are usually saved in a file called zzz.R. (Note that .First.lib() and .Last.lib() are old versions of .onLoad() and .onUnload() and should no longer be used.)

S4 classes, generics and methods

Another type of side-effect is defining S4 classes, methods and generics. R packages capture these side-effects so they can be replayed when the package is loaded, but they need to be called in the right order. For example, before you can define a method, you must have defined both the generic and the class. This requires that the R files be sourced in a specific order. This order is controlled by the Collate field in the DESCRIPTION. This is described in more detail in documenting S4.

CRAN notes

(Each chapter will finish with some hints for submitting your package to CRAN. If you don’t plan on submitting your package to CRAN, feel free to ignore them!)

If you’re planning on submitting your package to CRAN, you must use only ASCII characters in your .R files. You can still include unicode characters in strings, but you need to use the special unicode escape "\u1234" format. The easiest way to do that is to use stringi::stri_escape_unicode():

x <- "This is a bullet •"
y <- "This is a bullet \u2022"
identical(x, y)
## [1] TRUE
cat(stringi::stri_escape_unicode(x))
## This is a bullet \u2022