R packages by Hadley Wickham


Package metadata

The job of the DESCRIPTION file is to store important metadata about your package. When you first start with packages, you’ll mostly use this metadata to record what packages are needed to run your package. As time goes by and you start sharing your package with others, the metadata becomes increasingly important because it also lays out who can use it (the license), and who to contact (you!) if there are any problems.

Every package must have a DESCRIPTION. In fact, it’s the distinguishing feature of a package: RStudio and devtools consider any directory containing DESCRIPTION to be a package. To get you started, devtools::create("mypackage") automatically adds a minimal description. This allows you to start a package without having to worry about the metadata until you need to. The minimal description will vary a bit depending on your settings, but should look something like this:

Package: mypackage
Title: What the package does (one line)
Version: 0.1
Authors@R: "First Last <first.last@example.com> [aut, cre]"
Description: What the package does (one paragraph)
Depends: R (>= 3.1.0)
License: What license is it under?
LazyData: true

(If you’re creating a lot of packages, you can set global options devtools.desc.author , devtools.desc.license, devtools.desc.suggests, and devtools.desc to modify the defaults. See package?devtools for more details.)

The DESCRIPTION uses a simple file format called DCF, the Debian control format. You can see most of the structure in this simple example. Each line consists of a field name and a value, separated by a colon. When values span multiple lines, they need to be indented:

Description: The description of a package is usually long,
    spanning multiple lines. The second and subsequent lines
    should be indented, usually with four spaces.

The minimal DESCRIPTION provides the bare necessities, but doesn’t include the two most useful fields:

  • Imports and Suggests: tell R that you need (for imports) or want (for suggests) additional packages to be available.

The other fields are described in the remainder of the chapter:

  • Package: what your package is called.

  • Title and Description: what your package does.

  • License: who’s allowed to use and distribute it

  • Authors@R: who wrote it

What does the package need?

It’s the job of the DESCRIPTION to list which other packages that your package needs to work. R has a rich set of ways of describing potential dependencies. For example, the following lines indicate that your package needs both ggvis and dplyr to work:

Imports:
    dplyr,
    ggvis

Whereas these lines mean that your package can take advantage of ggvis and dplyr, but they’re not required to make it work:

Suggests:
    dplyr,
    ggvis,

Both imports and suggests take a comma separate list of package names. I recommend putting one package on each line, and keeping them in alphabetical order. That makes it easy to skim the list.

Imports and suggests differ in the strength of dependency:

  • Imports: these packages must be installed for your package to work. Any time your package is installed, these packages will also be installed if they’re not already present. (devtools::load_all() also checks that the packages are installed.)

    Note that adding a package dependency makes sure it’s installed, but does not automatically load it with library(x). Instead, it’s best practice to explicitly refer to external functions using the syntax package::function(). This makes it very easy to identify functions that live outside your package when reading your code in the future.

    If you use a lot functions from other packages this is rather verbose, and there’s minor performance penalty associated with :: (on the order of 5µs, so it will only matter if you’re calling the function millions of times). You’ll learn about alternatives in namespace imports.

  • Suggests: your package can take advantage of these packages if they’re installed. Maybe they provide datasets for examples, or they’re only used by one function in your package, or only used by tests or to build vignettes.

    Packages listed in Suggests are not automatically installed along with your package. This means that you need to check if the package is available (with requireNamespace(x, quietly = TRUE)) before using it. There are two basic scenarios:

    # You need the suggested package for this function    
    my_fun <- function(a, b) {
      if (!requireNamespace("pkg", quietly = TRUE)) {
        stop("Pkg needed for this function to work. Please install it.",
          call. = FALSE)
      }
    }
    
    # There's a fallback method if the package isn't available
    my_fun <- function(a, b) {
      if (requireNamespace("pkg", quietly = TRUE)) {
        pkg::f()
      } else {
        g()
      }
    }

When developing packages locally, you never need to use suggests. When releasing your package, using suggests is a courtesy to your users. It frees them from downloading rarely needed packages, and lets them get started with your package as quickly as possible.

The easiest way to add imports and suggests to your package is to use devtools::use_package(). This automatically adds them in the right place in your DESCRIPTION, and reminds you how to use them.

devtools::use_package("dplyr") # Defaults to imports
#> Adding dplyr to Imports
#> Refer to functions with dplyr::fun()
devtools::use_package("dplyr", "Suggests")
#> Adding dplyr to Suggests
#> Use requireNamespace("dplyr", quietly = TRUE) to test if package is 
#>  installed, then use dplyr::fun() to refer to functions.

Versioning

If you need a specific version of a package, you can specify it in parentheses after the package name:

Imports:
    ggvis (>= 0.2),
    dplyr (>= 0.3.0.1)
Suggests:
    MASS (>= 7.3.0)

You almost always want to specify a minimum version rather than an exact version (MASS (= 7.3.0)). Since R can’t have multiple versions of the same package loaded at the same time, specifying an exact dependency dramatically increases the chance of conflicting versions that can’t be resolved.

Versioning is mostly important when you release your package. Usually people don’t have exactly the same versions of packages installed that you do. If someone has an older package that doesn’t have a function you need, they’ll get an unhelpful error message. If you supply the version number, they’ll get a error message that tells them exactly what the problem is: a package is out of date.

Generally, it’s better to be conservation about version specifications, and always supply them. Unless you know otherwise, always require a version greater than or equal to the version you’re currently using.

Other dependencies

The are three other fields that allow you to express more specialised dependencies:

  • Depends: use this if your package requires a specify version of R to work. For example, Depends: R (>= 3.0.1). As with packages, it’s a good idea to play it safe and set to the version of R that you’re currently using. devtools::create() does this for you.

Prior to the rollout of namespaces in R 2.14.0, depends was the only way to “depend” on another package. Now, despite the name, you should almost always use imports, not depends. You’ll learn why, and when you should still use depends, in namespaces.

In R 3.1.1 and earlier you’ll also need to use Depends: methods if you use S4. This bug is fixed in R 3.2.0, so methods can go back to

  • LinkingTo: use this if your package needs to link to or compile against the C code included in another package. You’ll learn more about LinkingTo in compiled code.

  • Enhances: these packages are “enhanced” by your package, typically because you provide methods for classes defined in the package. It’s a sort of reverse suggests. But it’s hard to define what its means, so I don’t recommend using enhances.

You can also list things that your package needs outside of R in the in SystemRequirements field. But this is just a plain text field and is not automatically checked. Think of it as a quick reference; you’ll also need to include detailed system requirements (and how to install them) in your README.

Exercises

  • What are the dependencies of ggplot2?

  • What does devtools::revdep() do? Why might you use it?

Naming your package

The Package field gives the name of the package, which should be same as the directory name (and the RStudio project file). For me, the hardest thing about creating a new package is often coming up with a good name. There’s only one formal requirement: the package name can only consist of letters, numbers and . (and it must start with a letter and cannot end with a period). Unfortunately this means you can’t use - or _ in your package name. I recommend against using . in package names because the other connotations (i.e., file extension or S3 method) are confusing.

If you’re planning on releasing your package, I think it’s worth spending a few minutes to come up with a good name. I have two recommendations:

  • Pick a unique name so you can easily google it. This makes it easy for potential users to find your package (and associated resources), and it makes it easier for you to see who’s using it.

  • Avoid using both upper and lower case letters: they make the package name hard to type and hard to remember. For example, I can never remember if it’s Rgtk2 or RGTK2 or RGtk2.

Some strategies I’ve used in the past to create packages names:

  • Find a name evocative of the problem and modify it so that it’s unique: plyr (generalisation of apply tools), lubridate (makes dates and times easier), mutatr (mutable objects), classifly (high-dimensional views of classification).

  • Use abbreviations: lvplot (letter value plots), meifly (models explored interactively).

  • Add an extra R: stringr (string processing), tourr (grand tours), httr (HTTP requests).

Other package names I particulary like are:

  • knitr: “the package name knitr was coined with weave in mind, and it also aims to be neater.”

  • analogsea, a R package that talks to the digitial ocean API.

What does the package do?

The title and description fields describe what the package does. They differ only in length:

  • Title is a one line description of the package, and is often shown in package listing. It should be plain text (no markup), be capitalised like a sentence, but not end in a period. Keep it short: listings will often truncate the title to 65 characters.

  • Description is more detailed: you can use multiple sentences, but still only one paragraph. If your description spans multiple lines (and it should!), keep each line at most 80 characters wide, and indent subsequent lines with 4 spaces.

The Title and Description for ggplot2 are:

Title: An implementation of the Grammar of Graphics
Description: An implementation of the grammar of graphics in R. It combines 
    the advantages of both base and lattice graphics: conditioning and shared 
    axes are handled automatically, and you can still build up a plot step 
    by step from multiple data sources. It also implements a sophisticated 
    multidimensional conditioning system and a consistent interface to map
    data to aesthetic attributes. See the ggplot2 website for more information, 
    documentation and examples.

A good title and description are important if you plan to release your package to CRAN, because they’re shown on the CRAN download page as follows:

Even the description only provides a small amount of space to describe what your package does, so I recommend also including a README.md file that goes into much more depth and shows a few examples. You’ll learn about that README.md.

Exercises

  • Read the title and description of the packages that you use most commonly. What works well? What could be done better?

Who wrote the package?

To describe who wrote the packaage, and who to contact if something goes wrong, use the Authors@R field. This field is unusual because it contains executable R code rather than plain text. Here’s an example:

Authors@R: person("Hadley", "Wickham", email = "hadley@rstudio.com",
  role = c("aut", "cre"))
person("Hadley", "Wickham", email = "hadley@rstudio.com", 
  role = c("aut", "cre"))
#> [1] "Hadley Wickham <hadley@rstudio.com> [aut, cre]"

This command says the the author (aut) and maintainer (cre) of the package are Hadley Wickham, who has email address hadley@rstudio.com. The person() function has four main arguments:

  • The name, specified by the first two arguments, given and family (these are normally supplied by position, not name). In English cultures given is the first name and family is the last name, but this convention differs between cultures.

  • The email address.

  • A three letter code specifying the role. There are four important roles:

    • cre: the package maintainer (creator), the person you should bother if you have problems.

    • aut: full authors who have contributed much to the package.

    • ctb: people who have made smaller contributions, like patches.

    • cph: copyright holder. This is used if copyright is held by someone other than the author, typically a company (their employer).

    (The full list of roles is extremely comprehensive. Should your package have a woodcutter (“wdc”), lyricist (“lyr”) or costume designer (“cst”), rest comfortably that you can correctly describe their role in creating your package.)

If you need additional clairification, you can also use the comment argument to supply additional arbitrary text.

You can list multiple authors with c():

Authors@R: c(
    person("Hadley", "Wickham", email = "hadley@rstudio.com", role = "cre"),
    person("Winston", "Chang", email = "winston@rstudio.com", role = "aut"))

Alternatively you can specify a little more concisely by using as.person():

Authors@R: as.person(c(
    "Hadley Wickham <hadley@rstudio.com> [aut, cre]", 
    "Winston Chang <winston@rstudio.com> [aut]"
  ))

(This only works well for names with only one first and last name.)

Every package must have at least one author (aut) and one maintainer (cre) (they might be the same person). The creator must have an email addresses. These fields are used to generate the basic citation for a package (e.g. citation("pkgname")). Only people listed as authors will be included in the autogenerated citation. There are a few extra details if you’re including code that other people have written. Since this most commonly occurs when you’re wrapping a C library, it’s discussed in compiled code.

As well as your email address, it’s also a good idea to list other resources avaialble for help. You can list urls in URL Multiple urls can be separated with a comma. BugReports takes a url to where bug reports should be submitted. For example, knitr has:

URL: http://yihui.name/knitr/
BugReports: https://github.com/yihui/knitr/issues

You can also use separate Maintainer and Author and fields to describe authors and maintainers. I prefer not to use these fields because Authors@R offers richer metadata.

On CRAN

The most important thing to note is that your email address (i.e., the address of cte) is the address that CRAN will use to contact you about your package, so make sure you use an email address that’s likely to be around for a while. This address will be used for automated mailings, so the CRAN policies require that this be for a single person (not a mailing list), and it can not require any confirmation or use any filtering.

Who can use it?

The License field can be either a standard abbreviation for an open source license, like GPL-2 or BSD, or a pointer to a file contain more information file LICENSE. The license is only really important if your planning on releasing your package. If you don’t, you can ignore this section. If you want to make it clear that your package is not open source, use License: file LICENSE and then create a file called LICENSE, containing (e.g.):

Proprietary 

Do not distribute outside of Widgets Incorporated.

Open source software licensing is a rich and complex field. Fortunately, in my opinion, there are only three licenses that you need to consider for your R package:

  • MIT (v. similar: to BSD 2 and 3 clause licenses): this is a simple and permissive license. It lets people use your code and freely distribute subject to only one restriction: the license must always be distributed with the code.

    The MIT license is a “template”, so if you use it, you need License: MIT + file LICENSE, and LICENSE file that looks like this:

    YEAR: <Year or years when changes have been made>
    COPYRIGHT HOLDER: <Name of the copyright holder>
  • GPL-2 or GPL-3: these are “copy-left” licenses, which means that any one who distributes your code in a bundle must license the whole bundle in a GPL-compatible way. Additionally anyone who distributes modified versions of your code (derivative works) must also make the source code avaialble. GPL-3 is a little stricter than GPL-2, closing some older loopholes.

  • CC0: It relinquishes all your rights on the code and data so that it can be freely used by any one for any purpose. This is sometimes called putting it in the public domain, although that term is not well-defined, and not meaningful in all countries.

This license is most appropriate for data packages. Data, at least in the US, is not copyrightable anyway, so you’re not really giving up much. This license just makes it clear.

If you’d like to learn about other common licenses Github’s choosealicense.com is a good place to start. Another good resource is https://tldrlegal.com/, which explains the most important parts of each license. If you use a different license to the three I suggest, also make sure to consulte the “Writing R Extensions” section on licensing.

If your package includes code that you didn’t write, you need to make sure you’re in compliance with its license. Since this occurs most commonly when you’re including C source code, it’s discussed in more detail in compiled code.

On CRAN

If you want to release your package to CRAN, you must pick a standard license. Otherwise it’s difficult for CRAN to determine whether or not it’s legal for them to distribute your package! A complete list of valid licenses for cran can be found at https://svn.r-project.org/R/trunk/share/licenses/license.db.

Other components

A number of other fields are described elsewhere in the book:

  • Collate controls the order in which R files are source. This only matters if your functions have side-effects, most commonly because you’re using S4. This is described in more depth in documenting S4.

  • The Version number is most important when releasing your package. See version numbers for more deatils.

  • LazyData makes it easier to access data in your package. It’s included in the minimal description because it’s so important and is described in external data.

There are even more fields that are rarely (if ever) used. A complete list can be found in the found in the “The DESCRIPTION file” section of the [R extensions manual][description]. You can also use your own fields to add additional arbitrary metadata. The only restriction is that you shouldn’t use existing names, and if you plan to submit to CRAN the names should be valid English words (so a spell-checking NOTE isn’t generated).