R packages by Hadley Wickham


Vignettes: long-form documentation

A vignette is a long-form guide to your package. Function documentation is great if you know the name of the function you need, but it’s useless otherwise. A vignette is like a book chapter or an academic paper: it can describe the problem that your package is designed to solve, and then show the reader how to solve it. A vignette should divide functions into useful categories, and demonstrate how to coordinate multiple functions to solve problems. Vignettes are also useful if you want to explain the details of your package. For example, if you have implemented a complex statistical algorithm, you might want to describe all the details in a vignette so that users of your package can understand what’s going on under the hood, and be confident that you’ve implemented the algorithm correctly.

Many existing packages have vignettes. You can see all the installed vignettes with browseVignettes(). To see the vignette for a specific package, use the argument, browseVignettes("packagename"). Each vignette provides three things: the original source file, a readable HTML page or PDF, and a file of R code. You can read a specific vignette with vignette(x), and see its code with edit(vignette(x)). To see vignettes for a package you haven’t installed, look at its CRAN page, e.g., http://cran.r-project.org/web/packages/dplyr.

Before R 3.0.0, the only way to create a vignette was with Sweave. This was challenging because Sweave only worked with LaTeX, and LaTeX is both hard to learn and slow to compile. Now, any package can provide a vignette engine, a standard interface for turning input files into HTML or PDF vignettes. In this chapter, we’re going to use the R markdown vignette engine provided by knitr. I recommend this engine because:

  • You write in Markdown, a plain text formatting system. Markdown is limited compared to LaTeX, but this limitation is good because it forces you to focus on the content.

  • It can intermingle text, code and results (both textual and visual).

  • Your life is further simplified by the rmarkdown package, which coordinates Markdown and knitr by using pandoc to convert Markdown to HTML and by providing many useful templates.

Switching from Sweave to R Markdown had a profound impact on my use of vignettes. Previously, making a vignette was painful and slow and I rarely did it. Now, vignettes are an essential part of my packages. I use them whenever I need to explain a complex topic, or to show how to solve a problem with multiple steps.

Currently, the easiest way to get R Markdown is to use RStudio. RStudio will automatically install all of the needed prerequisites. If you don’t use RStudio, you’ll need to:

  1. Install the rmarkdown package with install.packages("rmarkdown").

  2. Install pandoc.

Vignette workflow

To create your first vignette, run:

devtools::use_vignette("my-vignette")

This will:

  1. Create a vignettes/ directory.

  2. Add the necessary dependencies to DESCRIPTION (i.e. it adds knitr to the Suggests and VignetteBuilder fields).

  3. Draft a vignette, vignettes/my-vignette.Rmd.

The draft vignette has been designed to remind you of the important parts of an R Markdown file. It serves as a useful reference when you’re creating a new vignette.

Once you have this file, the workflow is straightforward:

  1. Modify the vignette.

  2. Press Ctrl/Cmd + Shift + K (or click ) to knit the vignette and preview the output.

There are three important components to an R Markdown vignette:

  • The initial metadata block.
  • Markdown for formatting text.
  • Knitr for intermingling text, code and results.

These are described in the following sections.

Metadata

The first few lines of the vignette contain important metadata. The default template contains the following information:

---
title: "Vignette Title"
author: "Vignette Author"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

This metadata is written in yaml, a format designed to be both human and computer readable. The basics of the syntax is much like the DESCRIPTION file, where each line consists of a field name, a colon, then the value of the field. The one special YAML feature we’re using here is >. It indicates the following lines of text are plain text and shouldn’t use any special YAML features.

The fields are:

  • Title, author and date: this is where you put the vignette’s title, author and date. You’ll want to fill these in yourself (you can delete them if you don’t want the title block at the top of the page). The date is filled in by default: it uses a special knitr syntax (explained below) to insert today’s date.

  • Output: this tells rmarkdown which output formatter to use. There are many options that are useful for regular reports (including html, pdf, slideshows, …) but rmarkdown::html_vignette has been specifically designed to work well inside packages. See ?rmarkdown::html_vignette for more details.

  • Vignette: this contains a special block of metadata needed by R. Here, you can see the legacy of LaTeX vignettes: the metadata looks like LaTeX commands. You’ll need to modifiy the \VignetteIndexEntry to provide the title of your vignette as you’d like it to appear in the vignette index. Leave the other two lines as is. They tell R to use knitr to process the file, and that the file is encoded in UTF-8 (the only encoding you should ever use to write vignettes).

Markdown

R Markdown vignettes are written in Markdown, a light weight markup language. John Gruber, the author of Markdown, summarises the goals and philosophy of Markdown:

Markdown is intended to be as easy-to-read and easy-to-write as is feasible.

Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters — including Setext, atx, Textile, reStructuredText, Grutatext, and EtText — the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

To this end, Markdown’s syntax is comprised entirely of punctuation characters, which punctuation characters have been carefully chosen so as to look like what they mean. E.g., asterisks around a word actually look like emphasis. Markdown lists look like, well, lists. Even blockquotes look like quoted passages of text, assuming you’ve ever used email.

Markdown isn’t as powerful as LaTeX, reStructuredText or docbook, but it’s simple, easy to write, and easy to read even when it’s not rendered. I find Markdown’s constraints helpful for writing because it lets me focus on the content, and prevents me from messing around with the styling.

If you’ve never used Markdown before, a good place to start is John Gruber’s Markdown syntax documentation. Pandoc’s implementation of Markdown rounds off some of the rough edges and adds a number of new features, so I also recommend familiarising yourself with the pandoc readme. When editing a Markdown document, RStudio presents a drop-down menu via the question mark icon, which offers a Markdown reference card.

The sections below show you what I think are the most important features of pandoc’s Markdown dialect. You should be able to learn the basics in under 15 minutes.

Sections

Headings are identified by #:

# Heading 1
## Heading 2
### Heading 3

Create a horizontal rule with three or more hyphens (or asterisks):

--------
********

Lists

Basic unordered lists use *:

* Bulleted list
* Item 2
    * Nested bullets need a 4-space indent.
    * Item 2b

If you want multiparagraph lists, the second and subsequent paragraphs need additional indenting:

  * It's possible to put multiple paragraphs of text in a list item. 

    But to do that, the second and subsequent paragraphs must be
    indented by four or more spaces. It looks better if the first
    bullet is also indented.

Ordered lists use: 1.:

1. Item 1
1. Item 2
1. Items are numbered automatically, even though they all start with 1.

You can intermingle ordered and bulleted lists, as long as you adhere to the four space rule:

1.  Item 1.
    *  Item a
    *  Item b
1.  Item 2.

Definition lists use :

Definition
  : a statement of the exact meaning of a word, especially in a dictionary.
List 
  : a number of connected items or names written or printed consecutively, 
    typically one below the other. 
  : barriers enclosing an area for a jousting tournament.

Inline formatting

Inline format is similarly simple:

_italic_ or *italic*
__bold__ or **bold**    
[link text](destination)
<http://this-is-a-raw-url.com>

Tables

There are four types of tables. I recommend using the pipe table which looks like this:

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

Notice the use of the : in the spacer under the heading. This determines the alignment of the column.

If the data underlying your table exists in R, don’t lay it out by hand. Instead use knitr::kable(), or look at printr or pander.

Code

For inline code use `code`.

For bigger blocks of code, use ```. These are known as “fenced” code blocks:

```
# A comment
add <- function(a, b) a + b
```

To add syntax highlighting to the code, put the language name after the backtick:

```c
int add(int a, int b) {
  return a + b;
}
```

(At time of printing, languages supported by pandoc were: actionscript, ada, apache, asn1, asp, awk, bash, bibtex, boo, c, changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css, curry, d, diff, djangotemplate, doxygen, doxygenlua, dtd, eiffel, email, erlang, fortran, fsharp, gnuassembler, go, haskell, haxe, html, ini, java, javadoc, javascript, json, jsp, julia, latex, lex, literatecurry, literatehaskell, lua, makefile, mandoc, matlab, maxima, metafont, mips, modula2, modula3, monobasic, nasm, noweb, objectivec, objectivecpp, ocaml, octave, pascal, perl, php, pike, postscript, prolog, python, r, relaxngcompact, rhtml, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql, sqlpostgresql, tcl, texinfo, verilog, vhdl, xml, xorg, xslt, xul, yacc, yaml. Syntax highlighting is done by the haskell package highlighting-kate; see the website for current list.)

When you include R code in your vignette, you usually won’t use ```r. Instead, you’ll use ```{r}, which is specially processed by knitr, as described next.

Knitr

Knitr allows you to intermingle code, results and text. Knitr takes R code, runs it, captures the output, and translates it into formatted Markdown. Knitr captures all printed output, messsages, warnings, errors (optionally) and plots (basic graphics, lattice & ggplot and more).

Consider the simple example below. Note that a knitr block looks similar to a fenced code block, but instead of using r, you using {r}.

```{r}
# Add two numbers together
add <- function(a, b) a + b
add(10, 20)
```

This generates the following Markdown:

```r
# Add two numbers together
add <- function(a, b) a + b
add(10, 20)
## [1] 30
```

Which, in turn, is rendered as:

# Add two numbers together
add <- function(a, b) a + b
add(10, 20)
## 30

Once you start using knitr, you’ll never look back. Because your code is always run when you build the vignette, you can rest assured knowing that all your code works. There’s no way for your input and output to be out of sync.

Options

You can specify additional options to control the rendering:

  • To affect a single block, add the block settings:

    ```{r, opt1 = val1, opt2 = val2}
    # code
    ```
  • To affect all blocks, call knitr::opts_chunk$set() in a knitr block:

    ```{r, echo = FALSE}
    knitr::opts_chunk$set(
      opt1 = val1,
      opt2 = val2
    )
    ```

The most important options are:

  • eval = FALSE prevents evaluation of the code. This is useful if you want to show some code that would take a long time to run. Be careful when you use this: since the code is not run, it’s easy to introduce bugs. (Also, your users will be puzzled when they copy & paste code and it doesn’t work.)

  • echo = FALSE turns off the printing of the code input (the output will still be printed). Generally, you shouldn’t use this in vignettes because understanding what the code is doing is important. It’s more useful when writing reports since the code is typically less important than the output.

  • results = "hide" turns off the printing of code output.

  • warning = FALSE and message = FALSE suppress the display of warnings and messages.

  • error = TRUE captures any errors in the block and shows them inline. This is useful if you want to demonstrate what happens if code throws an error. Whenever you use error = TRUE, you also need to use purl = FALSE. This is because every vignette is accompanied by a file code that contains all the code from the vignette. R must be able to source that file without errors, and purl = FALSE prevents the code from being inserted into that document.

  • collapse = TRUE and comment = "#>" are my preferred way of displaying code output. I usually set these globally by putting the following knitr block at the start of my document.

    ```{r, echo = FALSE}
    knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
    ```
  • results = "asis" treats the output of your R code as literal Markdown. This is useful if you want to generate text from your R code. For example, if you want to generate a table using the pander package, you’d do:

    ```{r, results = "asis"}
    pander::pandoc.table(iris[1:3, 1:4])
    ```

    That generates a Markdown table that looks like:

    --------------------------------------------------------
     Sepal.Length   Sepal.Width   Petal.Length   Petal.Width 
    -------------- ------------- -------------- -------------
         5.1            3.5           1.4            0.2     
    
         4.9             3            1.4            0.2     
    
         4.7            3.2           1.3            0.2     
    ---------------------------------------------------------

    Which makes a table that looks like:

    Sepal.Length Sepal.Width Petal.Length Petal.Width
    5.1 3.5 1.4 0.2
    4.9 3 1.4 0.2
    4.7 3.2 1.3 0.2
  • fig.show = "hold" holds all figures until the end of the code block.

  • fig.width = 5 and fig.height = 5 set the height and width of figures (in inches).

Many other options are described at http://yihui.name/knitr/options.

Development cycle

Run code a chunk at a time using Cmd + Alt + C. Re-run the entire document in a fresh R session using Knit (Ctrl/Cmd + Shift + K).

You can build all vignettes from the console with devtools::build_vignettes(), but this is rarely useful. Instead use devtools::build() to create a package bundle with the vignettes included. RStudio’s “Build & reload” does not build vignettes to save time. Similarly, devtools::install_github() (and friends) will not build vignettes by default because they’re time consuming and may require additional packages. You can force building with devtools::install_github(build_vignettes = TRUE). This will also install all suggested packages.

Advice for writing vignettes

If you’re thinking without writing, you only think you’re thinking. — Leslie Lamport

When writing a vignette, you’re teaching someone how to use your package. You need to put yourself in the readers’ shoes, and adopt a “beginner’s mind”. This can be difficult because it’s hard to forget all of the knowledge that you’ve already internalised. For this reason, I find teaching in-person a really useful way to get feedback on my vignettes. Not only do you get that feedback straight away but it’s also a much easier way to learn what people already know.

A useful side effect of this approach is that it helps you improve your code. It forces you to re-see the initial onboarding process and to appreciate the parts that are hard. Every time that I’ve written text that describes the initial experience, I’ve realised that I’ve missed some important functions. Adding those functions not only helps my users, but it often also helps me! (This is one of the reasons that I like writing books).

  • I strongly recommend literally anything written by Kathy Sierra. Her old blog, Creating passionate users is full of advice about programming, teaching, and how to create valuable tools. I thoroughly recommend reading through all the older content. Her new blog, Serious Pony, doesn’t have as much content, but it has some great articles.

  • If you’d like to learn how to write better, I highly recommend Style: Lessons in Clarity and Grace by Joseph M. Williams and Joseph Bizup. It helps you understand the structure of writing so that you’ll be better able to recognise and fix bad writing.

Writing a vignette also makes a nice break from coding. In my experience, writing uses a different part of the brain from programming, so if you’re sick of programming, try writing for a bit. (This is related to the idea of structured programming.).

Organisation

For simpler packages, one vignette is often sufficient. But for more complicated packages you may actually need more than one. In fact, you can have as many vignettes as you like. I tend to think of them like chapters of a book - they should be self-contained, but still link together into a cohesive whole.

Although it’s a slight hack, you can link various vignettes by taking advantage of how files are stored on disk: to link to vignette abc.Rmd, just make a link to abc.html.

CRAN notes

Note that since you build vignettes locally, CRAN only receives the html/pdf and the source code. However, CRAN does not re-build the vignette. It only checks that the code is runnable (by running it). This means that any packages used by the vignette must be declared in the DESCRIPTION. But this also means that you can use Rmarkdown (which uses pandoc) even though CRAN doesn’t have pandoc installed.

Common problems:

  • The vignette builds interactively, but when checking, it fails with an error about a missing package that you know is installed. This means that you’ve forgotten to declare that dependency in the DESCRIPTION (usually it should go in Suggests).

  • Everything works interactively, but the vignette doesn’t show up after you’ve installed the package. One of the following may have occurred. First, because RStudio’s “build and reload” doesn’t build vignettes, you may need to run devtools::install() instead. Next check:

    1. The directory is called vignettes/ and not vignette/.

    2. Check that you haven’t inadvertently excluded the vignettes with .Rbuildignore

    3. Ensure you have the necessary vignette metadata.

  • If you use error = TRUE, you must use purl = FALSE.

You’ll need to watch the file size. If you include a lot of graphics, it’s easy to create a very large file. There are no hard and fast rules, but if you have a very large vignette be prepared to either justify the file size, or to make it smaller.

Where next

If you’d like more control over the appearance of your vignette, you’ll need to learn more about Rmarkdown. The website, http://rmarkdown.rstudio.com, is a great place to start. There you can learn about alternative output formats (like LaTeX and pdf) and how you can incorporate raw HTML and LateX if you need additional control.

If you write a nice vignette, consider submitting it to the Journal of Statistical Software or The R Journal. Both journals are electronic only and peer-reviewed. Comments from reviewers can be very helpful for improving the quality of your vignette and the related software.