Chapter 13 Version Control

13.1 Using Version Control with git

“Friends do not let friends work on a coding project without version control.” You might have heard this before, without really considering what this means. Or maybe you are convinced about this saying, but have not had the opportunity to use git, GitHub or GitLab for versioning your applications. If so, now is the time for a workflow change!

13.1.1 Why Version Control?

Have you ever experienced a piece of code disappearing? Or the unsolvable problem of integrating changes when several people have been working on the same piece of code? Or the inability to find back something you have written a while back?

If so, you might have been missing Version Control (also shortened as VC). In this chapter, we’ll be focusing on git, but you should be aware that other VC system exist. As they are less popular than git, we will not cover them here. Git was designed to handle collaboration on code projects41 where potentially a lot of people have to interact and make changes to the codebase. Git might feel a little bit daunting at first, and even seasoned developers still misuse it, or do not understand it completely, but getting at ease with the basics will significantly improve the way you build software, so do not give up: the benefits from learning it really outweigh the (apparent) complexity.

There are many advantages to VC, including:

  • You can go back in time. With a VC system like git, every change is recorded (well, every committed change), meaning that you can potentially go back in time to a previous version of a project, and see the complete history of a file. This feature is very important: if you accidentally made changes that break your application, or if you deleted a feature you thought you would never need, you can go back to where you were a few hours, a few days, a few months back.

  • Several people can work on the same file. Git relies on a system of branches. Within this branch pattern, there is one main branch, called “master”, which contains the stable, main version of the code-base. By “forking” this branch (or any other branch), developers will have a copy of the base branch, where they can safely work on changing (and breaking) things, without impacting the origin branch. This allows to try things in a safe environment, without touching what works.

  • You can safely track changes. Every time a developer records something to git, changes are listed. In other words, you can see what changes are made to a specific file in your codebase.

  • It centralizes the codebase. You can use git locally, but its strength also relies on the ability to synchronize your local project with a distant server. This also means that several people can synchronize with this server and collaborate on a project. That way, changes on a branch on a server can be downloaded (it is called pull in git terminology) by all the members of the team, and synchronized locally, i.e if someone makes changes to a branch and sends them to the main server, all the other developers can retrieve these change on their machine.

13.1.2 Git basics: add - commit - push - pull

These are the four main actions you will be performing in git: if you just need to learn the minimum to get started, they are the four essential ones.

13.1.2.1 add

When using add, you are choosing which elements of your project you want to track, be it new files or modifications of an already versioned file. This action does not save the file in the git repository, but flags the changes to be added to the next commit.

13.1.2.2 commit

A commit is a photography of a codebase at a given moment in time. Each commit is associated with two things: a sha1, which is a unique reference in the history of the project and that allows you to identify this precise state when you need to get back in time, and a message, which is a piece of text that describes the commit42 . Note that messages are mandatory, you can not commit without them, and that the sha1 are automatically generated by git. Do not overlook these messages: they might seem like a constraint at first but they are a life saver when you need to understand the history of a project.

There is no strict rule about what and when to commit. Keep in mind that commits are what allow you to go back in time, so a commit is a complete state of your codebase to which it would make sense to go back to. A good practice is to state in the commit message which choices you made and why (but not how you implemented these changes), so that other developers (and you in the future) will be able to understand changes.

13.1.2.3 push

Once you have a set of commits ready, you are ready to push it to the server. In other words, you will permanently record these commits (hence the series of changes) to the server.

Making a push implies three things:

  • Other people in the team will be able to retrieve the changes you have made

  • These changes will be recorded permanently in the project history

  • You can not modify commits once they were sent to the server43

13.1.2.4 pull

Once changes have been recorded in the main server, everybody synchronized with the project can pull the commits to their local project.

13.1.3 About branches

Branches are git way to organize work and ideas, notably when several people are collaborating on the same project (which might be the case when building large web applications with R).

How does it work? When your start a project, you are in the main branch, which is called the “master”. In a perfect world, you never work directly on this branch: it should always contain a working, deployable version of the application.

Other branches are to be thought as work areas, where developers fix bugs or add features. The modifications made in these development branches will then be transferred (directly or indirectly) to the master branch.

Branches in git.

FIGURE 13.1: Branches in git.

In practice, you might want to use a workflow where each branch is designed to fix a small issue or implement a feature, so that it is easier to separate each small part of the work. Even when working alone.

13.1.4 Issues

If you are working with a remote tool with a graphical interface like GitLab, GitHub or Bitbucket, there is a good chance you will be using issues. Issues are “notes” or “tickets” that can be used to track a bug or to suggest a feature. This tool is crucial when it comes to project management: they are the perfect spot for organizing and discussing ideas, but also to have an overview of what has been done, what is currently done and what is left to be done. Issue can also be used as a discussion medium with beta testers, clients or sponsors.

One other valuable feature of issues is that they can be referenced inside commits using a hashtag and its number: #123. . In other words, when you send code to the centralized server, you can link this code to one or more issues and corresponding commits appear in the issue discussions.

13.2 Git integration

13.2.1 With RStudio

Git is very well integrated to the RStudio IDE, and using git can be as simple as clicking on a button from time to time. If you are using RStudio, you will find a pull/push button, a stage & commit interface, a tool for visualizing differences in files. Everything you need to get started is there.

13.2.2 As part of a larger world

Git is not reserved for team work: even if you are working alone on a project, using git is definitely worth the effort. Using git, and particularly issues, helps you organize your train of thoughts, especially upfront when you need to plan what you will be doing.

And of course, remember that git is not reserved to Shiny Applications: it can be used for any other R related projects, and at the end of the day for any code related projects, making it a valuable skill to have in your toolbox, whatever language you will be working with in 10 years!

13.2.3 About git-flow

There are a lot of different ways and methodologies to organize your Git workflow. One of the most popular ones is called git flow, and we will give you here a quick introduction on how you can manage your work using this approach. Please note that this is a quick introduction, not a complete guide: we will link to some further reading just at the end of this section.

So, here are the key concept of git flow:

  • The master branch only contains stable code: most of the time is matches a tagged, fixed version (v0.0.1, 0.1.0, v1.0.0, etc). A very small subset of developers involved in the project have writing access to the master branch, and no developer should ever push code straight to this branch: new code to master only comes either from the dev branch, or from a hotfix branch. For an app in production, the last commit of this branch should be the version that is currently in production.

  • The dev branch, on the other hand, is the “Work in progress” branch: the one that contains the latest changes before they are merged into master. This is the common working branch for every developers. Most of the time, developers do not push code into these branch either: they make merge/pull request (MR/PR) to dev from one of their feature branch.

  • A feature branch is one branch, forked from dev, that implements one of the feature of the application. To keep a clean track of what each branch is doing, a good practice is to use issue-XXX, where XXX is the corresponding issue you plan to solve in this branch.

  • A hot fix branch is a branch to correct a critical issue in master. If is forked from master, and is merged straight into master using a MR.

Here is a summary of this process:

Presentation of a git flow (Vincent Driessen, http://nvie.com).

FIGURE 13.2: Presentation of a git flow (Vincent Driessen, http://nvie.com).

From a software engineer point of view, here is how daily work goes:

  • Identify an issue to work on

  • Fork dev into issue-XXX

  • Develop feature inside the branch

  • Regularly, run git stash, git rebase dev, and git stash apply to include the latest changes from dev to stay synchronized with dev44

  • Make a pull request to dev so that the feature is included

  • Once the PR is accepted by the project manager, notify the rest of the team that there have been changes to dev, so they can rebase it to the branch they are working on

  • Start working on a new feature

Of course, there are way more subtleties to this flow of work, but this gives you a good starting point. Generally speaking, a good communication between developers is essential for a successful collaborative development project.

13.2.4 Further readings on git

Git can be used in different ways and different approaches exist. The comprehensiveness of the different possible approaches is beyond the scope of this book, and other resources exist as well. You can find more under these links:

13.3 Automated testing

We have seen in chapter 12 how to build a testing infrastructure for your app, notably using the {testthat} (Wickham 2020) package. What we have described is a way to build it locally, before running your test on your own machine. But there is a big flaw to this approach: you have to remember to run the tests, be it regularly or before making a pull request/pushing to the server. To do this kind of job, you will be looking for a tool to do automated testing at the repository level: in other words, a software that can test your application whenever a piece of code is pushed/moved on the repository.

To do this, various tools are available, each with there own features. Here is a non exhaustive list of the one you can choose:

travis-ci is a software that can be synced with your git repositories (GitHub or Bitbucket), and whenever something happens on the repo, the events described in the travis configuration file (.travis.yml) are executed. If they exit with a code 0, the test passes. If they do not, the integrated tests have failed. This Travis CI integration can be used internally and externally: internally, in the sense that before merging any pull request, the project manager have access to a series of tests that are automatically launched. Externally, as a “health check” before installing a software: if you visit a GitHub repository that has Travis badges included, you can check if the current state of the package/software is stable, i.e. if it passes the automated tests.

Travis CI can do a lot more than just testing your app: it can be used to build documentation, deploy to production, or to run any other scripts you want to be run before/after the tests have passed. And the nice thing is that you can test for various versions of R, so that you are sure that you are supporting current, future and previous versions of R.

All of this is defined in the .travis.yml file, which is to be put at the root of your source directory, a file that is automatically generated when calling usethis::use_travis(). Here is an example of one of this file, for the {golem} (Guyader et al. 2020) package:

# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r
language: R
sudo: false
cache: packages

r_github_packages:
  - ThinkR-open/golem  # pre-install to avoid vignette package errors

# build matrix; turn on vdiffr only on r release
matrix:
  include:
  - r: devel
    env: VDIFFR_RUN_TESTS=true
    before_cache:
    - Rscript -e 'remotes::install_cran("pkgdown")'
    - Rscript -e 'remotes::install_github("ThinkR-open/thinkrtemplate")'
    deploy:
      provider: pages
      skip-cleanup: true
      github-token: $GITHUB_PAT
      keep-history: true
      local-dir: docs/dev
      on:
        branch: dev
      skip_cleanup: true
  - r: release
    env: VDIFFR_RUN_TESTS=true
    before_cache:
    - Rscript -e 'remotes::install_cran("pkgdown")'
    - Rscript -e 'remotes::install_github("ThinkR-open/thinkrtemplate")'
    deploy:
      provider: pages
      skip-cleanup: true
      github-token: $GITHUB_PAT
      keep-history: true
      local-dir: docs
      on:
        branch: master
      skip_cleanup: true
  - r: oldrel
  - r: 3.3
  - r: 3.4

before_install:
  - Rscript -e 'update.packages(ask = FALSE)'
  
after_success:
  - Rscript -e 'covr::codecov()'
  - Rscript -e 'pkgdown::build_site()'
  

Note that Travis CI can run tests on GNU/Linux or MacOS operating systems.

Appveyor has the same functionalities as Travis CI. This service can integrate with GitHub, GitHub Enterprise, Bitbucket, GitLab, Azure Repos, Kiln, Gitea. It supports Windows, Linux and macOS.

GitHub actions serve a related purpose: defining actions to be performed as responses to events on the GitHub repository. Testing, building documentation, push to another repository, deploy on the server… all these actions can be automatically performed. As with Travis CI, these actions are defined in a yaml file. Examples for these configuration can be find at r-lib/actions, and some can be automatically linked to your project using functions from {usethis}: use_github_action_check_release(), use_github_action_check_standard(), use_github_action_check_full() and use_github_action_pr_commands(). The three first perform a standard R CMD check, under various conditions:

  • the release tests on MacOS, with the latest version of R, and runs the check via the {rcmdcheck} (Csárdi 2019b) package
  • standard does the check for 3 OS (Windows, Mac and Linux), and for R and R-devel
  • full does standard but for the last 5 minors version of R

Finally, use_github_action_pr_commands(), sets checks to be performed when a Pull Request is made to the repository.

If you are working with GitLab, you can use the integrated GitLab CI service: it serves the same purpose, with the little difference that it is completely docker-based: you define a yaml with a series of stages that are performed (concurrently or sequentially), and they are all launched inside a docker container. To help you with this, the colinfay/r-ci-tidyverse docker image comes with pre-installed packages for testing: {remotes} (Hester et al. 2020), {testthat} (Wickham 2020), {config} (Allaire 2018)… and is available for several R versions. This docker image can be used as the source image for your GitLab CI yaml file.

Here is an example of one of these files

image: colinfay/r-ci-tidyverse:3.6.0

cache:
  paths:
    - ci/

stages:
  - test
  - document

building:
  stage: test
  script:
    - R -e "remotes::install_deps(dependencies = TRUE)"
    - R -e 'devtools::check()'

documenting:
    stage: document
    allow_failure: true
    when: on_success
    only:
        - master
    script:
        - Rscript -e 'install.packages("DT")'
        - Rscript -e 'covr::gitlab(quiet = FALSE)'
    artifacts:
        paths:
            - public

Automated testing, continuous integration and continuous deployment is a vast topic that can not be covered in a few pages inside this book, but spending some time learning about this methodologies is definitely worth the time spent: the more you can automate these processes, and the more you test, the more your application will be resilient, easy to maintain and easy to enhance: the more you check, the quicker you will discover bugs.

And the quicker you detect bugs, the easier it is to correct them!

References

Allaire, JJ. 2018. Config: Manage Environment Specific Configuration Values. https://CRAN.R-project.org/package=config.

Csárdi, Gábor. 2019b. Rcmdcheck: Run ’R Cmd Check’ from ’R’ and Capture Results. https://CRAN.R-project.org/package=rcmdcheck.

Guyader, Vincent, Colin Fay, Sébastien Rochette, and Cervan Girard. 2020. Golem: A Framework for Robust Shiny Applications. https://github.com/ThinkR-open/golem.

Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan Tenenbaum. 2020. Remotes: R Package Installation from Remote Repositories, Including ’Github’. https://CRAN.R-project.org/package=remotes.

Wickham, Hadley. 2020. Testthat: Unit Testing for R. https://CRAN.R-project.org/package=testthat.


  1. It was first developed by Linus Torvalds, the very same man behind Linux.↩︎

  2. For example: “Added a graph in the analysis tab”, or “Fixed the docx export bug”↩︎

  3. If you want to modify some code or have to go back in time, the best way to do it is to create a new commit with these changes or use adequate git commands.↩︎

  4. There are two strategies for merging dev: either a “merge strategy” or a “rebase strategy”. Both strategies have pros and cons. We work with the “rebase strategy” to force ourselves to stay updated. We also noticed that this strategy lower the risk of bad merging that can cause code loss. However, this requires a lot of communication between developers and a good knowledge of git.↩︎


ThinkR Website