3 Package structure and state

This chapter will start you on the road to package development by converting the implicit knowledge you’ve gained from using R packages into the explicit knowledge needed to create and modify them. You’ll learn about the various states a package can be in and the difference between a package and library (and why you should care).

3.1 Package states

When you create or modify a package, you work on its “source code” or “source files”. You interact with the in-development package in its source form. This is NOT the package form you are most familiar with from day-to-day usage. Package development workflows make much more sense if you understand the five states an R package can be in:

source
bundled
binary
installed
in-memory

You already know some of the functions that put packages into these states. For example, install.packages() can move a package from the source, bundled, or binary states into the installed state. devtools::install_github() takes a source package on GitHub and moves it into the installed state. The library() function loads an installed package into memory, making it available for immediate and direct use.

3.2 Source package

A source package is just a directory of files with a specific structure. It includes particular components, such as a DESCRIPTION file, an R/ directory containing .R files, and so on. Most of the remaining chapters in this book are dedicated to detailing these components.

If you are new to package development, you may have never seen a package in source form! You might not even have any source packages on your computer. The easiest way to see a package in source form right away is to browse around its code on the web.

Many R packages are developed in the open on GitHub (or GitLab or similar). The best case scenario is that you visit the package’s CRAN landing page, e.g.:

forcats: https://cran.r-project.org/package=forcats
readxl: https://cran.r-project.org/package=readxl

and one of its URLs links to a repository on a public hosting service, e.g.:

forcats: https://github.com/tidyverse/forcats
readxl: https://github.com/tidyverse/readxl

Some maintainers forget to list this URL, even though the package is developed in a public repository, but you still might be able to discover it via search.

Even if a package is not developed on a public platform, you can visit its source in the unofficial, read-only mirror maintained by R-hub. Examples:

MASS: https://github.com/cran/MASS
car: https://github.com/cran/car

Note that exploring a package’s source and history within the cran GitHub organisation is not the same as exploring the package’s true development venue, because this source and its evolution is just reverse-engineered from the package’s CRAN releases. This presents a redacted view of the package and its history, but, by definition, it includes everything that is essential.

3.3 Bundled package

A bundled package is a package that’s been compressed into a single file. By convention (from Linux), package bundles in R use the extension .tar.gz and are sometimes referred to as “source tarballs”. This means that multiple files have been reduced to a single file (.tar) and then compressed using gzip (.gz). While a bundle is not that useful on its own, it’s a platform-agnostic, transportation-friendly intermediary between a source package and an installed package.

In the rare case that you need to make a bundle from a package you’re developing locally, use devtools::build(). Under the hood, this calls pkgbuild::build() and, ultimately, R CMD build, which is described further in the Building package tarballs section of Writing R Extensions.

This should tip you off that a package bundle or “source tarball” is not simply the result of making a tar archive of the source files, then compressing with gzip. By convention, in the R world, a few more operations are carried out when making the .tar.gz file and this is why we’ve elected to refer to this form as a package bundle, in this book.

Every CRAN package is available in bundled form, via the “Package source” field of its landing page. Continuing our examples from above, you could download the bundles forcats_0.4.0.tar.gz and readxl_1.3.1.tar.gz (or whatever the current versions may be). You could unpack such a bundle in the shell (not the R console) like so:

tar xvf forcats_0.4.0.tar.gz

If you decompress a bundle, you’ll see it looks almost the same as a source package. Figure 3.1 shows the files present in the source, bundled, and binary forms of a fictional package named zzzpackage. We’ve deliberately crafted this example to include most of the package parts covered in this book. Not every package will include every file seen here, nor does this diagram include every possible file that might appear in a package.

Side-by-side comparison of source, bundled, and binary package. The flow of files (location and format) from package source to bundled to binary state is shown. This is described in detail in the appropriately named sections of this chapter. — Figure 3.1: Package forms: source vs. bundled vs. binary.

The main differences between a source package and an uncompressed bundle are:

Vignettes have been built, so rendered outputs, such as HTML, appear below inst/doc/ and a vignette index appears in the build/ directory.
A local source package might contain temporary files used to save time during development, like compilation artefacts in src/. These are never found in a bundle.
Any files listed in .Rbuildignore are not included in the bundle. These are typically files that facilitate your development process, but that should be excluded from the distributed product.

3.3.1 `.Rbuildignore`

You won’t need to contemplate the exact structure of package .tar.gz files very often, but you do need to understand the .Rbuildignore file. It controls which files from the source package make it into the downstream forms.

Each line of .Rbuildignore is a Perl-compatible regular expression that is matched, without regard to case, against the path to each file in the source package¹. If the regular expression matches, that file or directory is excluded. Note there are some default exclusions implemented by R itself, mostly relating to classic version control systems and editors, such as SVN, Git, and Emacs.

We usually modify .Rbuildignore with the usethis::use_build_ignore() function, which takes care of easy-to-forget details, such as regular expression anchoring and escaping. To exclude a specific file or directory (the most common use case), you MUST anchor the regular expression. For example, to exclude a directory called “notes”, the .Rbuildignore entry must be ^notes$, whereas the unanchored regular expression notes will match any file name containing “notes”, e.g. R/notes.R, man/important-notes.R, data/endnotes.Rdata, etc. We find that use_build_ignore() helps us get more of our .Rbuildignore entries right the first time.

.Rbuildignore is a way to resolve some of the tension between the practices that support your development process and CRAN’s requirements for submission and distribution (Chapter 22). Even if you aren’t planning to release on CRAN, following these conventions will allow you to make the best use of R’s built-in tooling for package checking and installation. The files you should .Rbuildignore fall into two broad, semi-overlapping classes:

Files that help you generate package contents programmatically. Examples:
- Using README.Rmd to generate an informative and current README.md (Section 18.1).
- Storing .R scripts to create and update internal or exported data (Section 7.1.1).
Files that drive package development, checking, and documentation, outside of CRAN’s purview. Examples:
- Files relating to the RStudio IDE (Section 4.2).
- Using the pkgdown package to generate a website (Chapter 19).
- Configuration files related to continuous integration/deployment (Section 20.2).

Here is a non-exhaustive list of typical entries in the .Rbuildignore file for a package in the tidyverse:

^.*\.Rproj$         # Designates the directory as an RStudio Project
^\.Rproj\.user$     # Used by RStudio for temporary files
^README\.Rmd$       # An Rmd file used to generate README.md
^LICENSE\.md$       # Full text of the license
^cran-comments\.md$ # Comments for CRAN submission
^data-raw$          # Code used to create data included in the package
^pkgdown$           # Resources used for the package website
^_pkgdown\.yml$     # Configuration info for the package website
^\.github$          # GitHub Actions workflows

Note that the comments above must not appear in an actual .Rbuildignore file; they are included here only for exposition.

We’ll mention when you need to add files to .Rbuildignore whenever it’s important. Remember that usethis::use_build_ignore() is an attractive way to manage this file. Furthermore, many usethis functions that add a file that should be listed in .Rbuildignore take care of this automatically. For example, use_readme_rmd() adds “^README\.Rmd$” to .Rbuildignore.

3.4 Binary package

If you want to distribute your package to an R user who doesn’t have package development tools, you’ll need to provide a binary package. The primary maker and distributor of binary packages is CRAN, not individual maintainers. But even if you delegate the responsibility of distributing your package to CRAN, it’s still important for a maintainer to understand the nature of a binary package.

Like a package bundle, a binary package is a single file. Unlike a bundled package, a binary package is platform specific and there are two basic flavors: Windows and macOS. (Linux users are generally required to have the tools necessary to install from .tar.gz files, although the emergence of resources like Posit Public Package Manager is giving Linux users the same access to binary packages as their colleagues on Windows and macOS.)

Binary packages for macOS are stored as .tgz, whereas Windows binary packages end in .zip. If you need to make a binary package, use devtools::build(binary = TRUE) on the relevant operating system. Under the hood, this calls pkgbuild::build(binary = TRUE) and, ultimately, R CMD INSTALL --build, which is described further in the Building binary packages section of Writing R Extensions. If you choose to release your package on CRAN (Chapter 22), you submit your package in bundled form, then CRAN creates and distributes the package binaries.

CRAN packages are usually available in binary form, for both macOS and Windows, for the current, previous, and (possibly) development versions of R. Continuing our examples from above, you could download binary packages such as:

forcats for macOS: forcats_0.4.0.tgz
readxl for Windows: readxl_1.3.1.zip

and this is, indeed, part of what’s usually going on behind the scenes when you call install.packages().

If you uncompress a binary package, you’ll see that the internal structure is rather different from a source or bundled package. Figure 3.1 includes this comparison, so this is a good time to revisit that diagram. Here are some of the most notable differences:

There are no .R files in the R/ directory - instead there are three files that store the parsed functions in an efficient file format. This is basically the result of loading all the R code and then saving the functions with save(). (In the process, this adds a little extra metadata to make things as fast as possible).
A Meta/ directory contains a number of .rds files. These files contain cached metadata about the package, like what topics the help files cover and a parsed version of the DESCRIPTION file. (You can use readRDS() to see exactly what’s in those files). These files make package loading faster by caching costly computations.
The actual help content appears in help/ and html/ (no longer in man/).
If you had any code in the src/ directory, there will now be a libs/ directory that contains the results of compiling the code.
If you had any objects in data/, they have now been converted into a more efficient form.
The contents of inst/ are moved to the top-level directory. For example, vignette files are now in doc/.
Some files and folders have been dropped, such as README.md, build/, tests/, and vignettes/.

3.5 Installed package

An installed package is a binary package that’s been decompressed into a package library (described in Section 3.7). Figure 3.2 illustrates the many ways a package can be installed, along with a few other functions for converting a package from one state to another. This diagram is complicated! In an ideal world, installing a package would involve stringing together a set of simple steps: source -> bundle, bundle -> binary, binary -> installed. In the real world, it’s not this simple because there are often (faster) shortcuts available.

A chart showing different ways to go from one package state to another: 1. library() puts an installed package into memory. 2. Functions such as install.packages(), devtools::install_github(), and devtools::install() can install a package starting variously in the source, bundle, or binary forms. 3. devtools::build() can create a bundle or a binary. 4. devtools::load_all() puts a source package into memory. — Figure 3.2: Many methods for converting between package states.

The built-in command line tool R CMD INSTALL powers all package installation. It can install a package from source files, a bundle (a.k.a. a source tarball), or a binary package. Details are available in the Installing packages section of R Installation and Administration. Just like with devtools::build(), devtools provides a wrapper function, devtools::install(), that makes this tool available from within an R session.

RStudio

RStudio can also help you install your in-development package via the Install and More drop-downs in the Build pane and with Install Package in the Build menu.

Most useRs understandably like to install packages from the comfort of an R session and directly from CRAN. The built-in function install.packages() meets this need. It can download the package, in various forms, install it, and optionally attend to the installation of dependencies.

There is a price, however, for the convenience of installing R packages from within an R session. As you might expect, it can be a bit tricky to re-install a package that is already in use in the current session. This actually works most of the time, but sometimes it does not, especially when installing an R package with compiled code on Windows. Due to how file handles are locked on Windows, an attempt to install a new version of a package that’s in use can result in a corrupt installation where the package’s R code has been updated, but its compiled code has not. When troubleshooting, Windows users should strive to install packages in a clean R session, with as few packages loaded as possible.

The pak package (https://pak.r-lib.org/) is a relative newcomer (at the time of writing) and provides a promising alternative to install.packages(), as well as other more specialized functions such as devtools::install_github(). It’s too early to make a blanket recommendation for using pak for all of your package installation needs, but we are certainly using it more and more in our personal workflows. One of pak’s flagship features is that it nicely solves the “locked DLL” problem described above, i.e. updating a package with compiled code on Windows. As you get deeper into package development, you will find yourself doing a whole new set of tasks, such as installing a dependency from an in-development branch or scrutinizing package dependency trees. pak provides a rich toolkit for this and many other related tasks. We predict that pak will soon become our official recommendation for how to install packages (and more).

However, in the meantime, we describe the status quo. devtools has long offered a family of install_*() functions to address some needs beyond the reach of install.packages() or to make existing capabilities easier to access. These functions are actually maintained in the remotes package and are re-exported by devtools. (Given what we said above, it is likely that remotes will essentially become superseded, in favor of pak, but we’re not quite there yet.)

library(remotes)

funs <- as.character(lsf.str("package:remotes"))
grep("^install_.+", funs, value = TRUE)
#>  [1] "install_bioc"      "install_bitbucket" "install_cran"     
#>  [4] "install_deps"      "install_dev"       "install_git"      
#>  [7] "install_github"    "install_gitlab"    "install_local"    
#> [10] "install_remote"    "install_svn"       "install_url"      
#> [13] "install_version"

install_github() is the most useful of these functions and is also featured in Figure 3.2. It is the flagship example of a family of functions that can download a package from a remote location that is not CRAN and do whatever is necessary to install it and its dependencies. The rest of the devtools/remotes install_*() functions are aimed at making things that are technically possible with base tooling a bit easier or more explicit, such as install_version() which installs a specific version of a CRAN package.

Analogous to .Rbuildignore, described in section Section 3.3.1, .Rinstignore lets you keep files present in a package bundle out of the installed package. However, in contrast to .Rbuildignore, this is rather obscure and rarely needed.

3.6 In-memory package

We finally arrive at a command familiar to everyone who uses R:

library(usethis)

Assuming usethis is installed, this call makes its functions available for use, i.e. now we can do:

create_package("/path/to/my/coolpackage")

The usethis package has been loaded into memory and, in fact, has also been attached to the search path. The distinction between loading and attaching packages is not important when you’re writing scripts, but it’s very important when you’re writing packages. You’ll learn more about the difference and why it’s important in Section 10.4.

library() is not a great way to iteratively tweak and test drive a package you’re developing, because it only works for an installed package. In Section 4.4, you’ll learn how devtools::load_all() accelerates development by allowing you to load a source package directly into memory.

3.7 Package libraries

We just discussed the library() function, whose name is inspired by what it does. When you call library(somepackage), R looks through the current libraries for an installed package named “somepackage” and, if successful, it makes somepackage available for use.

In R, a library is a directory containing installed packages, sort of like a library for books. Unfortunately, in the R world, you will frequently encounter confused usage of the words “library” and “package”. It’s common for someone to refer to dplyr, for example, as a library when it is actually a package. There are a few reasons for the confusion. First, R’s terminology arguably runs counter to broader programming conventions, where the usual meaning of “library” is closer to what we mean by “package”. The name of the library() function itself probably reinforces the wrong associations. Finally, this vocabulary error is often harmless, so it’s easy for R users to fall into the wrong habit and for people who point out this mistake to look like insufferable pedants. But here’s the bottom line:

We use the library() function to load ² a package.

The distinction between the two is important and useful as you get involved in package development.

You can have multiple libraries on your computer. In fact, many of you already do, especially if you’re on Windows. You can use .libPaths() to see which libraries are currently active. Here’s how this might look on Windows:

# on Windows
.libPaths()
#> [1] "C:/Users/jenny/AppData/Local/R/win-library/4.2"
#> [2] "C:/Program Files/R/R-4.2.2/library"

lapply(.libPaths(), list.dirs, recursive = FALSE, full.names = FALSE)
#> [[1]]
#>   [1] "abc"           "anytime"       "askpass"       "assertthat"   
#>  ...
#> [145] "zeallot"      
#> 
#> [[2]]
#>  [1] "base"         "boot"         "class"        "cluster"     
#>  [5] "codetools"    "compiler"     "datasets"     "foreign"     
#>  [9] "graphics"     "grDevices"    "grid"         "KernSmooth"  
#> [13] "lattice"      "MASS"         "Matrix"       "methods"     
#> [17] "mgcv"         "nlme"         "nnet"         "parallel"    
#> [21] "rpart"        "spatial"      "splines"      "stats"       
#> [25] "stats4"       "survival"     "tcltk"        "tools"       
#> [29] "translations" "utils"

Here’s a similar look on macOS (but your results may vary):

# on macOS
.libPaths()
#> [1] "/Users/jenny/Library/R/arm64/4.2/library"
#> [2] "/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library"

lapply(.libPaths(), list.dirs, recursive = FALSE, full.names = FALSE)
#> [[1]]
#>    [1] "abc"                  "abc.data"             "abind"                
#>  ...
#> [1033] "Zelig"                "zip"                  "zoo"                 
#> 
#> [[2]]
#>  [1] "base"         "boot"         "class"        "cluster"     
#>  [5] "codetools"    "compiler"     "datasets"     "foreign"     
#>  [9] "graphics"     "grDevices"    "grid"         "KernSmooth"  
#> [13] "lattice"      "MASS"         "Matrix"       "methods"     
#> [17] "mgcv"         "nlme"         "nnet"         "parallel"    
#> [21] "rpart"        "spatial"      "splines"      "stats"       
#> [25] "stats4"       "survival"     "tcltk"        "tools"       
#> [29] "translations" "utils"

In both cases we see two active libraries, consulted in this order:

A user library
A system-level or global library

This setup is typical on Windows, but is something you usually need to opt into on macOS and Linux³. With this setup, add-on packages installed from CRAN (or elsewhere) or under local development are kept in the user library. Above, the macOS system is used as a primary development machine and has many packages here (~1000), whereas the Windows system is only used occasionally and is much more spartan. The core set of base and recommended packages that ship with R live in the system-level library and are the same on all operating systems. This separation appeals to many developers and makes it easy to, for example, clean out your add-on packages without disturbing your base R installation.

If you’re on macOS or Linux and only see one library, there is no urgent need to change anything. But next time you upgrade R, consider creating a user-level library. By default, R looks for a user library found at the path stored in the environment variable R_LIBS_USER, which itself defaults to ~/Library/R/m/x.y/library, on macOS, and ~/R/m-library/x.y on Linux (where m is a concise description of your CPU architecture, and x.y is the R version). You can see this path with Sys.getenv("R_LIBS_USER"). These directories do not exist by default, and the use of them must be enabled by creating the directory. When you install a new version of R, and prior to installing any add-on packages, use dir.create(Sys.getenv("R_LIBS_USER"), recursive = TRUE) to create a user library in the default location. Now you will have the library setup seen above. Alternatively, you could set up a user library elsewhere and tell R about that by setting the R_LIBS_USER environment variable in .Renviron. The simplest way to edit your .Renviron file is with usethis::edit_r_environ(), which will create the file if it doesn’t exist, and open it for editing.

The filepaths for these libraries also make it clear they are associated with a specific version of R (4.2.x at the time of writing), which is also typical. This reflects and enforces the fact that you need to reinstall your add-on packages when you update R from, say, 4.1 to 4.2, which is a change in the minor version. You generally do not need to re-install add-on packages for a patch release, e.g., going from R 4.2.1 to 4.2.2.

As your R usage grows more sophisticated, it’s common to start managing package libraries with more intention. For example, tools like renv (and its predecessor packrat) automate the process of managing project-specific libraries. This can be important for making data products reproducible, portable, and isolated from one another. A package developer might prepend the library search path with a temporary library, containing a set of packages at specific versions, in order to explore issues with backwards and forwards compatibility, without affecting other day-to-day work. Reverse dependency checks are another example where we explicitly manage the library search path.

Here are the main levers that control which libraries are active, in order of scope and persistence:

Environment variables, like R_LIBS and R_LIBS_USER, which are consulted at startup.
Calling .libPaths() with one or more filepaths.
Executing small snippets of code with a temporarily altered library search path via withr::with_libpaths().
Arguments to individual functions, like install.packages(lib =) and library(lib.loc =).

Finally, it’s important to note that library() should NEVER be used inside a package. Packages and scripts rely on different mechanisms for declaring their dependencies and this is one of the biggest adjustments you need to make in your mental model and habits. We explore this topic fully in Section 9.6 and Chapter 11.

To see the set of filepaths that should be on your radar, execute dir(full.names = TRUE, recursive = TRUE, include.dirs = TRUE, all.files = TRUE) in the package’s top-level directory.↩︎
Well, actually, library() loads and attaches a package, but that’s a topic for Section 10.4.↩︎
For more details, see the Maintaining R section in What They Forgot To Teach You About R, Managing Libraries in R Installation and Administration and the R help files for ?Startup and ?.libPaths.↩︎

3.1 Package states

3.2 Source package

3.3 Bundled package

3.3.1 .Rbuildignore

3.4 Binary package

3.5 Installed package

3.6 In-memory package

3.7 Package libraries

3.3.1 `.Rbuildignore`