A git workflow for ecologists

reproducible research
version control
Author

Thierry Onkelinx

Published

August 23, 2017

Git

For those how don’t know git, it is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. I use git daily, including for this blog. Have a look at Wikipedia for more background.

Although it requires some overhead, it saves a lot of time once you get the hang of it. Why? Because you have the confidence that you can go back to any point in the history of a project. So you can explore new things without risking to ruin everything. The new things don’t work out? Just go back to the last good point in the history and start over.

Each point in the history is called a commmit. A commit contains all essential information on what needs to change to recreate the current state starting from the previous commit. It also contains useful metadata: who created the commit, when and why1.

Git works great with plain text files like R scripts, RMarkdown files, data in txt or csv format, … You can add binary files (Word, Excel, pdf, jpg, …) to a git project, but not as efficient as plain text files and with less options. In case of a plain text file, git notes which lines in the file are removed and where a line was inserted. A change in a line is a combination of removing the old line and inserting the new line. Have a look a this commit if you want a real life example. Such granular approach is not available for binary files. Hence the old version is removed and the new version is added.

Target audience for this workflow

The workflow is useful for anyone with basic computer skills. The workflow does not use all whistles and bells available in git. Only the minimal functionality which is all accessible via either a graphical user interface (GUI) or a website. We target ecologists who often write R scripts and have no prior knowledge on version control systems.

This workflow seems to work for a team of scientists how work on the same project and have all write access to that project (repository in git terminology).

Basic workflow

Use case

  • First repositories of git novices.
  • Initial start of a repository.

It is no longer valid as soon as more than one user commits to the repository.

Principle

The basic workflow is just a simple linear history. The user makes a set of changes and commits those changes. This is repeated over and over until the project is finished. The resulting history will look like Figure 1.

One extra step is at least a daily push to another machine. This creates (or updates) a copy of the entire project history to that other machine. And thus serves as a backup copy. Therefore this should be done at least daily. The easiest way is to use an on-line service like GitHub, Bitbucket, GitLab, … GitHub is free for public repositories and is popular for freeware open source projects. Bitbucket offers free private repositories but only for small teams (max. 5 users). Having the repository on an on-line platform has another benefit: it is easy to share your work and collaborate.

Figure 1: An example of the history of a basic workflow

Branching workflow with pull requests

Use case

  • Working with several people on the same repository
  • More experienced git users

Principle

  1. Commits are only created in feature branches, not in the master branch.
  2. Finalised branches are merged into the master branch by pull requests.

Branch

The basic workflow has a single branch which is called master. Git makes it easy to create new branches. A branch starts from a specific commit. Each user should create a new branch when he starts working on a new feature in the repository. Because each user works in his own branch, he is the only one writing to this part of the history. This avoids a lot of conflicts. Figure 2 illustrates how the history looks like when a few branches are created.

Figure 2: An example of a history with a few feature branches

Pull request

Creating branches is fine, but they diverge the history of the repository. So we need a mechanism to merge branches together. In this workflow we will work on a feature branch until it is finished. Then we merge it into the master branch. Figure 3 illustrates the resulting history. This can be done locally using a merge, but it is safer to do it on-line via a pull request.

Figure 3: An example of a history after two pull requests

A pull request is a two step procedure. First you create the pull request by indicating via the webapp which branches you would like to merge. The second step is to merge the pull request. Documentation on how to handle pull requests can be found on the websites of GitHub, Bitbucket and GitLab.

Pull requests have several advantages over local merges

  1. It works only when the branches are pushed to the on-line copy of the repository. This ensures not only a backup but also gives access to the latest version to your collaborators.
  2. All pull requests are done against the common (on-line) master branch. Local merges would create diverging master branches which will create a lot of conflicts.
  3. Since the pull request is a two step procedure, one user can create the pull request and another (e.g. the project leader) can do the actual merge.
  4. The pull request gives an overview of the aggregated changes of all the commits in the pull request. This makes it easier to get a feeling on what has been changed within the range of the pull request.
  5. Most on-line tools allow to add comments and reviews to a pull request. This is useful to discuss a feature prior to merging it. In case additional changes are required, the user should update his feature branch. The pull request gets automatically updated.

Conflicts

Conflicts arise when a file is changed at the same location in two different branches and with different changes. Git cannot decide which version is correct and therefore blocks the merging of the pull request. It is up to the user to select the correct version and commit the required changes. See on-line tutorials on how to do this. Once the conflicts are resolved, you can go ahead and merge the pull request. This is illustrated in Figure 3. First master is merged back into feature B to handle the merge conflict and then feature B is merged into master.

What if I choose the wrong version? Don’t panic, both versions remain in the history so you don’t loose any. So you can create a new branch starting for the latest commit with the correct version and merge that branch.

Flowcharts

Here a a few flowcharts that illustrate several components of the branching workflow with pull requests. Figure 4 illustrates the steps you need when you want to start working on a project. Once you have a local clone of the repository you can check out the required feature branch (Figure 5). The last flowchart handles working in a feature branch and merge it when finished (Figure 6).

Figure 4: Flowchart for preparing a repository.

Figure 5: Flowchart for changing to a feature branch.

Figure 6: Flowchart for applying changes in a feature branch.

Rules for collaboration

  1. Always commit into a feature branch, never in the master branch.
  2. Always start features branches for the master branch.
  3. Only work in your own branches.
  4. Never merge someone else’s pull request without their consent.
  5. Don’t wait too long for merging a branch. Keep the scope of a feature branch narrow.

Exceptions

Starting branches not from master

In case you want to apply a change to someone else’s branch. Create a new branch starting from the other’s branch, add commits and create a pull request. Ask the branch owner to merge the pull request. Basically you use someone else’s branch as the master branch.

Working with multiple users in the same branch

This is OK as long as users don’t work simultaneously in the branch.

  • Person A create the branch
  • Person A adds commits
  • Person A pushes and notifies person B
  • Person B adds commits
  • Person B pushes and notifies the next person
  • Person A creates a pull request

Session info

These R packages were used to create this post.

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language nl_BE:nl
 collate  nl_BE.UTF-8
 ctype    nl_BE.UTF-8
 tz       Europe/Brussels
 date     2023-08-30
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 diagram     * 1.6.5      2020-09-30 [1] CRAN (R 4.3.1)
 digest        0.6.32     2023-06-26 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21       2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4      2023-01-22 [1] CRAN (R 4.3.0)
 farver        2.1.1      2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0      2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2      2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3      2023-03-21 [1] CRAN (R 4.3.0)
 hms           1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5      2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7      2023-06-29 [1] CRAN (R 4.3.1)
 knitr         1.43       2023-05-25 [1] CRAN (R 4.3.0)
 labeling      0.4.2      2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2.9000 2023-05-15 [1] https://inbo.r-universe.dev (R 4.3.0)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0      2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4      2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.23       2023-07-01 [1] CRAN (R 4.3.1)
 rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.3.0)
 scales        1.2.1      2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 shape       * 1.4.6      2021-05-19 [1] CRAN (R 4.3.1)
 stringi       1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0      2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0      2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0      2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0      2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0      2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3      2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3      2023-06-14 [1] CRAN (R 4.3.0)
 withr         2.5.0      2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39       2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7      2023-01-23 [1] CRAN (R 4.3.0)

 [1] /home/thierry/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. Assuming that the user entered a sensible commit message.↩︎