This is a short tutorial on building R packages in RStudio. A major part of this tutorial has been borrowed from Hadley Wickham’s book R Packages: Organize, Test, Document, and Share Your Code which is an excellent reference. This tutorial will not cover the entire book and you are encouraged to read the Introduction and the chapter on Package structure before our session on Monday (10/8).
We will be using the R script funlib.R
(inside the zipped folder that came in your email) as the set of codes that we wish to convert into an R-package and finally submit it to CRAN. In particular, you will notice that funlib.R
stores two R functions: ejs
and sureshrink
. We will briefly discuss what these functions do as we go along. For our forthcoming discussion, please ensure that you have (i) the latest versions of R and RStudio installed, and (ii) the following packages are installed:
install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
We will also need the following two packages which are required by the scripts in funlib.R
:
install.packages(c("rwt", "wavethresh"))
To ensure that you have everything installed and working, please run the following code:
library(devtools)
has_devel()
## '/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore \
## --quiet CMD SHLIB foo.c
##
## [1] TRUE
If you see a TRUE
in your output, like you see above, then you are good to go. If not, then it means that you may have some components missing.
Through this section, you will (i) come up with a name for your package and, (ii) create the package folder using RStudio.
Naming - your package name may only consist of letters, numbers and periods. It must start with a letter and it cannot end with a period. If you are thinking about submitting your package to CRAN then make sure that your favorite name is not already used in CRAN.
For the purposes of this tutorial, I will name my package mypack
.
Package folder in RStudio - you will use devtools
to create the package folder as follows:
devtools::create("/home/trambak/Dropbox/R packages/mypack")
You should now see a bunch of files inside the package folder.
We are now ready to make changes to this folder, edit some of these files and add more files. As a first step,
open the project file mypack.Rproj
in a new RStudio session and,
please place funlib.R
inside the \R
folder.
The DESCRIPTION file
stores important metadata about our package.
You will need to edit this file with your package specific information as shown below (as an example)
After you have finished editing the DESCRIPTION file
, request devtools
to do a check for you:
devtools::check()
Once you do this devtools
will perform a comprehensive check on your package for internal consistency, documentation and coding errors. At this stage your package is incomplete and devtools
will throw you some warnings but you should not get any errors. Much of the package development work is iterating the following three steps:
devtools::check()
devtools::check()
roxygen2
In this section you will document your code inside the \R
folder using R package roxygen2
. This step will also automatically edit the NAMESPACE
file. Documentation is by far the most important and time consuming (depending upon the number of functions in your package) step of package development. If you expect good package documentation from other contributors then it must start with you!
There are two functions inside funlib.R
: ejs
and sureshrink
. We will discuss them briefly before documenting them.
ejs - Extended James-Stein estimator
ejs
computes the extended James-Stein estimate (Brown, L.D. (2008), Effron and Morris (1973)) of a parameter vector \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_n)\) given independent observations \(\boldsymbol{y}=(y_1,\cdots,y_n)\) under the following hierarchical model: \[
\begin{aligned}
Y_i &= \theta_i+\epsilon_i,~\epsilon_i\stackrel{i.i.d}{\sim} N(0,\sigma_0^2),\\
\theta_i &\stackrel{i.i.d}{\sim} N(\mu,\sigma^2)
\end{aligned}
\] Here \(\sigma_0^2\) is known while the hyperparameters \((\mu,\sigma^2)\) are unknown but fixed. Under this model, the Bayes estimate of \(\theta_i\) under squared error loss is \[\theta_i^{Bayes}=\mu(1-\lambda)+\lambda y_i~\text{where }\lambda=\frac{\sigma^2}{\sigma_0^2+\sigma^2}\] Since the hyperparameters \((\mu,\sigma^2)\) are unknown, we cannot really use \(\theta_i^{Bayes}\). However, we can estimate \((\mu,\sigma^2)\) from our data \(\boldsymbol y\) if we note that \[
\begin{align}
\text{Marginally, }& Y_i\sim N(\mu,\sigma^2+\sigma_0^2)~\text{and so }\mathbb{E}(\bar{Y})=\mu\\
\text{and if }& S^2=\sum_{i=1}^{n}(Y_i-\bar{Y})^2~\text{then }\mathbb{E}\Big[\dfrac{(n-3)\sigma_0^2}{S^2}\Big]=1-\lambda
\end{align}
\] Substituting these estimates in \(\theta_i^{Bayes}\) gives the James-Stein (James and Stein (1961)) estimate of \(\theta_i\) \[
\hat{\theta}_i^{JS}=\bar{y}+\Big(1-\dfrac{(n-3)\sigma_0^2}{S^2}\Big)(y_i-\bar{y})
\] and the extended James-Stein estimate of \(\theta_i\) is simply \[
\hat{\theta}_i^{EJS}=\bar{y}+\Big(1-\dfrac{(n-3)\sigma_0^2}{S^2}\Big)_+(y_i-\bar{y})
\] Although both \(\hat{\theta}_i^{JS}\) and \(\hat{\theta}_i^{EJS}\) are biased estimates of \(\theta_i\), they have a lower MSE than the maximum likelihood estimate which is \(y_i\) in this case.
surehsrink - SureShrink estimator
The sureshrink
function that you see computes the SureShrink (Donoho and Johnstone (1995)) estimate of a sparse parameter vector \(\boldsymbol{\theta}\) based on observations \(\boldsymbol y\) where
\[ Y_i=\theta_i+\epsilon_i \text{ with } \epsilon_i\stackrel{i.i.d}{\sim}N(0,\sigma_0^2) \]
We will set \(\sigma_0^2=1\) for simplicity and assume that \(\boldsymbol\theta\) is sparse. To estimate \(\boldsymbol \theta\), one can solve the following LASSO problem with regularization parameter \(\lambda\in\mathbb{R}^{+}\):
\[ \min_{\boldsymbol \theta\in\mathbb{R}^n}\frac{1}{2}||\boldsymbol y-\boldsymbol \theta||_2^2+\lambda||\boldsymbol \theta||_1 \] The above convex problem has a closed form solution (the soft thresholding rule) given by: \[ \hat{\theta}_i(\lambda)=\begin{cases} 0,\text{ if }|y_i|\le \lambda\\ y_i-\lambda,\text{ if }y_i>\lambda\\ y_i+\lambda,\text{ if }y_i<-\lambda \end{cases} \] The choice of \(\lambda\) is important here and ideally we would like it to be the one that minimizes the estimation risk \(r(\boldsymbol \theta,\lambda)=\mathbb{E}\big|\big|\boldsymbol \theta-\hat{\boldsymbol \theta}(\lambda)\big|\big|_2^2\). However this involves \(\boldsymbol \theta\) which we are trying to estimate in the first place. Fortunately, under this model, we can construct an estimate \(S(\boldsymbol y,\lambda)\) of the true risk such that
\[ \mathbb{E}S(\boldsymbol Y,\lambda) = r(\boldsymbol \theta,\lambda)~\text{for any }\lambda>0 \] \(S(\boldsymbol y,\lambda)\) is the well known Stein’s Unbiased Risk Estimate (Stein (1981)). Finally, we choose \(\lambda\) as \[ {\lambda^*}=\arg\min_{\lambda\in[0,t_n]}S(\boldsymbol y,\lambda) \] where \(t_n=\sqrt{2\log n}\). The SureShrink estimate of \(\boldsymbol \theta\) is simply \(\hat{\boldsymbol \theta}(\lambda^*)\). When \(\boldsymbol \theta\) is sparse, \(\hat{\boldsymbol \theta}(\lambda^*)\) may exhibit a smaller MSE than \(\hat{\boldsymbol \theta}^{EJS}\). Please see Donoho and Johnstone (1995) for more details around the SureShrink estimator.
Back to R-packages
We will use a very particular (and intuitive) syntax to document our scripts inside funlib.R
. Here is an example of how funlib.R
may look like after adding documentation (see the file funlib_documented.R
).
When you are happy with your documentation, do the following:
Run the following:
devtools::document()
This step calls roxygen2
to create .md
files inside the \man
folder (see them!). Moreover, it will also edit your NAMESPACE
file and it should look like this:
The NAMESPACE
file tells you what packages are being imported by mypack
. These are your package dependencies and must be included in your DESCRIPTION
file as follows:
Finally, run:
devtools::check(,manual = TRUE)
You should just get warnings at this stage and a sleek package manual!
A vignette is a long form documentation to your package where you can go into more details describing the statistical problem that you are trying to solve using your package mypack
. See for example the glmm vignette. We will create a simpler vignette in this section.
Run the following:
devtools::use_vignette("mypackintro")
This will create a \vignettes
folder and a bare bones mypackintro.Rmd
file for your editing. Look at the file mypackintro_edited.Rmd
that came in your email.
After you have edited the .Rmd file to your satisfaction, you can build your vignette as follows:
devtools::build()
This step will create a folder mypack_1.0.0.tar.gz
which will hold /mypack/inst/doc/mypackintro.html
(see it!). You will also notice that your DESCRIPTION
file has new information:
For submitting our package to CRAN we will need a few more steps. They are as follows:
Test environments
(a.) Use devtools::build_win()
to check if your package works in the current developmental version of R.
(b.) If you are on Linux or Mac, build_win()
will check if your package is working fine on Windows. In case you are on Windows, you will need to use Travis to check your package on Linux.
A Final Check
If the above test results look fine then you perform a final check on your package:
devtools::check()
and make sure you get no errors and warnings.
cran-comments.md
You will need to report the results from steps 1 and 2 in the file cran-comments.md
which may look like this:
README.Rmd
Use devtools::use_readme_rmd()
to create a readme file for new users:
Please remember to re-knit README.Rmd
each time you modify it.
NEWS.md
The NEWS.md is aimed at existing users and lists all the changes in each release.
Release
The final step that calls upon devtools to do some more checks and submits the package to CRAN:
devtools::release()
In the end, your package directory will look something like this:
Brown, L.D. (2008). In-Season Prediction of Batting Averages: A Field Test of Empirical Bayes and Bayes Methodologies. The Annals of Applied Statistics, 2, 113-152
David L Donoho and Iain M Johnstone. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 90(432):1200-1224
Bradley Efron and Carl Morris (1973). Stein’s Estimation Rule and Its Competitors–An Empirical Bayes Approach. Journal of the American Statistical Association Vol. 68, No. 341 (Mar., 1973), pp. 117-130
Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 1135–1151.
W. James and Charles Stein (1961). Estimation with Quadratic Loss. Proc. Fourth Berkeley Symp. on Math. Statist. and Prob., Vol. 1 (Univ. of Calif. Press, 1961), 361-379