A git workflow for ecologists
Git
For those how don’t know git, it is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. I use git daily, including for this blog. Have a look at Wikipedia for more background.
Although it requires some overhead, it saves a lot of time once you get the hang of it. Why? Because you have the confidence that you can go back to any point in the history of a project. So you can explore new things without risking to ruin everything. The new things don’t work out? Just go back to the last good point in the history and start over.
Each point in the history is called a commmit
. A commit
contains all essential information on what needs to change to recreate the current state starting from the previous commit
. It also contains useful metadata: who created the commit
, when and why1.
Git works great with plain text files like R scripts, RMarkdown files, data in txt or csv format, … You can add binary files (Word, Excel, pdf, jpg, …) to a git project, but not as efficient as plain text files and with less options. In case of a plain text file, git notes which lines in the file are removed and where a line was inserted. A change in a line is a combination of removing the old line and inserting the new line. Have a look a this commit if you want a real life example. Such granular approach is not available for binary files. Hence the old version is removed and the new version is added.
Target audience for this workflow
The workflow is useful for anyone with basic computer skills. The workflow does not use all whistles and bells available in git. Only the minimal functionality which is all accessible via either a graphical user interface (GUI) or a website. We target ecologists who often write R scripts and have no prior knowledge on version control systems.
This workflow seems to work for a team of scientists how work on the same project and have all write access to that project (repository
in git terminology).
Basic workflow
Use case
- First
repositories
of git novices. - Initial start of a
repository
.
It is no longer valid as soon as more than one user commits to the repository
.
Principle
The basic workflow is just a simple linear history. The user makes a set of changes and commits those changes. This is repeated over and over until the project is finished. The resulting history will look like Figure 1.
One extra step is at least a daily push
to another machine. This creates (or updates) a copy of the entire project history to that other machine. And thus serves as a backup copy. Therefore this should be done at least daily. The easiest way is to use an on-line service like GitHub, Bitbucket, GitLab, … GitHub is free for public repositories and is popular for freeware open source projects. Bitbucket offers free private repositories but only for small teams (max. 5 users). Having the repository on an on-line platform has another benefit: it is easy to share your work and collaborate.
Branching workflow with pull requests
Use case
- Working with several people on the same repository
- More experienced git users
Principle
- Commits are only created in
feature branches
, not in themaster branch
. - Finalised
branches
aremerged
into themaster branch
bypull requests
.
Branch
The basic workflow has a single branch
which is called master
. Git makes it easy to create new branches
. A branch
starts from a specific commit. Each user should create a new branch
when he starts working on a new feature in the repository. Because each user works in his own branch, he is the only one writing to this part of the history. This avoids a lot of conflicts. Figure 2 illustrates how the history looks like when a few branches are created.
Pull request
Creating branches is fine, but they diverge the history of the repository. So we need a mechanism to merge
branches together. In this workflow we will work on a feature branch until it is finished. Then we merge it into the master branch. Figure 3 illustrates the resulting history. This can be done locally using a merge
, but it is safer to do it on-line via a pull request
.
A pull request
is a two step procedure. First you create the pull request
by indicating via the webapp which branches you would like to merge
. The second step is to merge
the pull request. Documentation on how to handle pull requests
can be found on the websites of GitHub, Bitbucket and GitLab.
Pull requests have several advantages over local merges
- It works only when the branches are pushed to the on-line copy of the repository. This ensures not only a backup but also gives access to the latest version to your collaborators.
- All pull requests are done against the common (on-line) master branch. Local merges would create diverging master branches which will create a lot of conflicts.
- Since the pull request is a two step procedure, one user can create the pull request and another (e.g. the project leader) can do the actual merge.
- The pull request gives an overview of the aggregated changes of all the commits in the pull request. This makes it easier to get a feeling on what has been changed within the range of the pull request.
- Most on-line tools allow to add comments and reviews to a pull request. This is useful to discuss a feature prior to merging it. In case additional changes are required, the user should update his feature branch. The pull request gets automatically updated.
Conflicts
Conflicts arise when a file is changed at the same location in two different branches and with different changes. Git cannot decide which version is correct and therefore blocks the merging of the pull request. It is up to the user to select the correct version and commit the required changes. See on-line tutorials on how to do this. Once the conflicts are resolved, you can go ahead and merge the pull request. This is illustrated in Figure 3. First master
is merged back into feature B
to handle the merge conflict and then feature B
is merged into master
.
What if I choose the wrong version? Don’t panic, both versions remain in the history so you don’t loose any. So you can create a new branch starting for the latest commit with the correct version and merge that branch.
Flowcharts
Here a a few flowcharts that illustrate several components of the branching workflow with pull requests. Figure 4 illustrates the steps you need when you want to start working on a project. Once you have a local clone
of the repository you can check out
the required feature branch (Figure 5). The last flowchart handles working in a feature branch and merge it when finished (Figure 6).
Rules for collaboration
- Always commit into a feature branch, never in the master branch.
- Always start features branches for the master branch.
- Only work in your own branches.
- Never merge someone else’s pull request without their consent.
- Don’t wait too long for merging a branch. Keep the scope of a feature branch narrow.
Exceptions
Starting branches not from master
In case you want to apply a change to someone else’s branch. Create a new branch starting from the other’s branch, add commits and create a pull request. Ask the branch owner to merge the pull request. Basically you use someone else’s branch as the master branch.
Working with multiple users in the same branch
This is OK as long as users don’t work simultaneously in the branch.
- Person A create the branch
- Person A adds commits
- Person A pushes and notifies person B
- Person B adds commits
- Person B pushes and notifies the next person
- …
- Person A creates a pull request
Session info
These R packages were used to create this post.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.3 LTS
system x86_64, linux-gnu
ui X11
language nl_BE:nl
collate nl_BE.UTF-8
ctype nl_BE.UTF-8
tz Europe/Brussels
date 2023-08-30
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
diagram * 1.6.5 2020-09-30 [1] CRAN (R 4.3.1)
digest 0.6.32 2023-06-26 [1] CRAN (R 4.3.1)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2.9000 2023-05-15 [1] https://inbo.r-universe.dev (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.23 2023-07-01 [1] CRAN (R 4.3.1)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shape * 1.4.6 2021-05-19 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] /home/thierry/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
Footnotes
Assuming that the user entered a sensible commit message.↩︎