To intercept or not to intercept? Is that a question when calculating indices?

statistics

Author

Thierry Onkelinx

Published

November 1, 2018

Indices

Long-term trends in the abundance of species are often communicated as indices. The abundance at some reference point in time is set at 100%. Sometimes this point in time is a range of years, often it is a single year. The reference point in time is something meaningful for the data: some special date (e.g. new legislation) or the start of the monitoring campaign. Although, seldom used, one could use the last year in the data as well.

Once the reference is set, the abundance at the other points in time (mostly years) is expressed as a ratio of the abundance at the reference point in time. An index value of 120% at time point X implies that the abundance at time point X is 1.2 times higher than the abundance at the reference point in time.

For the sake of simplicity, we will assume that each point in time refers to a specific year. Hence when we mention “year”, think of the more general “point in time”.

Estimating abundance

In order to calculate the indices, we need to have an abundance estimate for each year. This is done by applying some statistical model to the data. Let us keep things somewhat simple and assume that the data consist of a number of independent Poisson count spanning several years. The data has multiple observations within each year. The observed counts \(Y_{ij}\) depend only on a latent variable \(\eta_i\) which only depends on the year \(i\).

\[Y_{ij} \sim \mbox{Pois}(\lambda_i) \\ \log(\lambda_i) = \eta_i\]

Let assume two basic models: a linear trend \(\eta_i = \beta_0 + \beta_1 x_i\) and a factor trend \(\eta_i = \alpha_0 + \alpha_2I_2 + \dots + \alpha_iI_i\). \(I_k\) is a dummy variable where \(I_k = 1\) if \(i = k\), otherwise \(I_k = 0\). We use \(\alpha\) for the parameters of the factor trend and \(\beta\) for those of the linear trend.

Calculating an index based on the model parameters

The intercept of a model is the estimated value at the reference state of the data. In case of a continuous covariate, the covariate \(x_i = 0\). In case of a categorical variable, all of the associated dummy variables are \(I_k = 0\). So if we choose the covariates carefully, we can directly use the intercept to estimate the abundance at the reference year. We obtain this for the linear trend by setting \(x_i\) to the number of years since the reference year. At the reference year we get \(x_i = 0\), two year after the reference year we get \(x_i = 2\), \(x_i = -5\) refers to 5 years before the reference year. In case of the factor trend, we make sure that the reference year is used as the reference level of the factor.

Note that since we use a model with a log-link, we need to exponentiate the intercept in order to get an estimate of the abundance. So the estimates at the reference year are \(e^{\beta_0}\) for the linear trend and \(e^{\alpha_0}\) for the factor trend.

The definition of \(\alpha_k\) in the factor model is the difference between the reference status and the state indicated by the dummy variable \(I_k\). Hence \(e^{\alpha_k}\) is a direct estimate of the value of the index. Likewise, \(e^{\beta_1x_i}\) is a direct estimate of the index for the linear trend. Since we are statistician, we like a confidence interval around the estimates. A straightforward option is to use the standard error of the parameter estimates. So the interval for the index with the factor trend becomes \(e^{alpha_k \pm 1.96 \sigma_k}\). For the linear trend we get \(e^{(\beta_1\pm1.96\sigma_1)x_i}\).

Example

library(tidyverse)
set.seed(20181101)
n_year <- 10
effort <- 10
intercept <- 20
trend <- 1.05
sigma_year <- 0.3

We create an example data with the parameters set above. The data consists of 10 of data. The number of observations in each year is 10 times the number of the year. So the first year has 10 observation while the last year has 100 observations. This reflects an increasing effort during the monitoring. A change in monitoring effort is often present in monitoring schemes. The trend in the latent variable \(\eta\) has both a linear component and a first order random walk component.

rw_year <- cumsum(rnorm(n_year, mean = 0, sd = sigma_year))
tibble(
  year = rep(seq_len(n_year), effort * seq_len(n_year))
) %>%
  mutate(
    eta = log(intercept) + log(trend) * year + rw_year[year],
    count = rpois(n(), lambda = exp(eta))
  ) -> ds

Figure 1: Simulated data. The points represent the observed counts, the line the true latent variable \(\lambda\).

Once we have the data, we can fit both models. We create two new variables: start is a continuous variable centred as the first year, fstart is a factor with the first year as reference level.

ds %>%
  mutate(
    start = year - min(year),
    fstart = factor(start)
  ) -> ds
m1_t <- glm(count ~ start, family = poisson, data = ds)
m1_f <- glm(count ~ fstart, family = poisson, data = ds)
summary(m1_f)


Call:
glm(formula = count ~ fstart, family = poisson, data = ds)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.37417    0.05852  57.658  < 2e-16 ***
fstart1     -0.08951    0.07278  -1.230    0.219    
fstart2      0.06732    0.06702   1.004    0.315    
fstart3      0.37592    0.06334   5.935 2.95e-09 ***
fstart4      0.54384    0.06182   8.796  < 2e-16 ***
fstart5      0.47917    0.06147   7.796 6.41e-15 ***
fstart6      0.59315    0.06079   9.758  < 2e-16 ***
fstart7      0.78803    0.06016  13.099  < 2e-16 ***
fstart8      0.59717    0.06028   9.906  < 2e-16 ***
fstart9      0.76164    0.05987  12.721  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1812.33  on 549  degrees of freedom
Residual deviance:  581.48  on 540  degrees of freedom
AIC: 3769.5

Number of Fisher Scoring iterations: 4

This leaves us with the task of calculating the indices based on the trend parameters. The first task is the extract the parameters estimates and their standard errors from the model object and keep only the relevant parameters. And then we can calculate the indices and their confidence intervals.

summary(m1_t) %>%
  coefficients() %>%
  as.data.frame() %>%
  rownames_to_column("parameter") %>%
  filter(parameter != "(Intercept)") %>%
  merge(
    distinct(ds, start, year)
  ) %>%
  mutate(
    index = exp(start * Estimate),
    lcl = exp(start * (Estimate - 1.96 * `Std. Error`)),
    ucl = exp(start * (Estimate + 1.96 * `Std. Error`)),
    model = "linear"
  ) -> m1_t_index
summary(m1_f) %>%
  coefficients() %>%
  as.data.frame() %>%
  rownames_to_column("parameter") %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    year = gsub("fstart", "", parameter) %>%
      as.integer() + 1,
    index = exp(Estimate),
    lcl = exp(Estimate - 1.96 * `Std. Error`),
    ucl = exp(Estimate + 1.96 * `Std. Error`),
    model = "factor"
  ) -> m1_f_index

Notice that the factor model has no index for the reference year 1. Why? Because we have no trend parameter for the reference year. We can argue that there is no change in the reference year and hence the index and its confidence interval should be set at 100%.

Figure 2: Indices based on the trend parameters when the first year is used as reference.

Changing the reference year

What happens if we choose another year as reference year? Suppose we want to use the tenth year of the data as reference. In case of the linear trend we define a new variable \(y_i = x_i + 9\), were \(x_i\) was the variable with the first year as reference. The form of the model doesn’t change and neither do the fitted values change. Hence we get

\[\eta_i = \beta_0 + \beta_1x_i = \gamma_0 + \gamma_1y_i\]

When \(y_i = 0\), \(x_i = 9\) and then \(\beta_0 + 9 \beta_1 = \gamma_0\). Next we set \(x_i = 0\), \(y_i = -9\) and then \(\beta_0 = \gamma_0 + -9 \gamma_1\). We can fill in \(\gamma_0\) from the previous result so \(\beta_0 = \beta_0 + 9 \beta_1 + -9 \gamma_1\) which leads to \(\beta_1 = \gamma_1\). Hence changing the reference year for the linear trend has no effect on the slope. Of course the intercept must change.

Something similar happens with the factor model. We now use another indicator variable \(J_k\) which is always zero at the new reference year. In the first equation \(k = 1\) is used as reference, in the second \(k = i\).

\[\eta_i = \alpha_0 + \alpha_2I_2 + \dots + \alpha_iI_i = \tau_0 + \tau_1J_1 + \tau_2J_2 + \dots + \tau_{i-1}J_{i-1}\]

When \(k = j\) and \(j \ne i\) and \(j \ne 1\), we get \(alpha_0 + \alpha_j = \tau_0 + \tau_j\). \(k = i\), yields \(\alpha_0 + \alpha_i = \tau_0\) which we can substitute in the first equation \(alpha_0 + \alpha_j = \alpha_0 + \alpha_i + \tau_j\) or \(\alpha_j - \alpha_i = \tau_j\). So basically the new intercept \(\tau_0\) is the old intercept \(\alpha_0\) plus the difference between the estimate for the old reference year and the new reference year \(\alpha_i\). All change parameter compensate this by adding \(-\alpha_i\).

Example

In the example we create two new reference variables end and fend. After fitted the new models we ensure that the model fits are identical to the models fit on start and fstart.

ds %>%
  mutate(
    end = year - max(year),
    fend = fct_rev(fstart)
  ) -> ds
m2_t <- glm(count ~ end, family = poisson, data = ds)
m2_f <- glm(count ~ fend, family = poisson, data = ds)
all.equal(fitted(m1_t), fitted(m2_t))

[1] TRUE

all.equal(fitted(m1_f), fitted(m2_f))

[1] TRUE

summary(m2_f)


Call:
glm(formula = count ~ fend, family = poisson, data = ds)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.13581    0.01265 327.069  < 2e-16 ***
fend8       -0.16447    0.01922  -8.558  < 2e-16 ***
fend7        0.02639    0.01883   1.402    0.161    
fend6       -0.16848    0.02074  -8.123 4.56e-16 ***
fend5       -0.28247    0.02266 -12.467  < 2e-16 ***
fend4       -0.21780    0.02361  -9.224  < 2e-16 ***
fend3       -0.38571    0.02735 -14.105  < 2e-16 ***
fend2       -0.69432    0.03503 -19.820  < 2e-16 ***
fend1       -0.85114    0.04508 -18.879  < 2e-16 ***
fend0       -0.76164    0.05987 -12.721  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1812.33  on 549  degrees of freedom
Residual deviance:  581.48  on 540  degrees of freedom
AIC: 3769.5

Number of Fisher Scoring iterations: 4

Next we use nearly the same code to calculate the indices.

summary(m2_t) %>%
  coefficients() %>%
  as.data.frame() %>%
  rownames_to_column("parameter") %>%
  filter(parameter != "(Intercept)") %>%
  merge(
    distinct(ds, end, year)
  ) %>%
  mutate(
    index = exp(end * Estimate),
    lcl = exp(end * (Estimate - 1.96 * `Std. Error`)),
    ucl = exp(end * (Estimate + 1.96 * `Std. Error`)),
    model = "linear"
  ) -> m2_t_index
summary(m2_f) %>%
  coefficients() %>%
  as.data.frame() %>%
  rownames_to_column("parameter") %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    year = gsub("fend", "", parameter) %>%
      as.integer() + 1,
    index = exp(Estimate),
    lcl = exp(Estimate - 1.96 * `Std. Error`),
    ucl = exp(Estimate + 1.96 * `Std. Error`),
    model = "factor"
  ) -> m2_f_index

The figure below illustrates the new trend. Of course, the indices are shifted so that the last year now equals 100%. Something more important is that change in the confidence intervals.

bind_rows(m2_t_index, m2_f_index) %>%
  ggplot(aes(x = year, y = index, ymin = lcl, ymax = ucl)) +
  geom_hline(yintercept = 1, linetype = 2) +
  geom_ribbon(aes(fill = model), alpha = 0.2) +
  geom_line(aes(colour = model)) +
  scale_x_continuous(breaks = seq_len(n_year)) +
  scale_y_continuous("Index", labels = percent)

Figure 3: Indices based on the trend parameters when the last year is used as reference.

In order to highlight the differences we scaled to indices with the last year as reference so that the first year equals 100%. The estimates for the indices are now identical. However, it is now very clear that the confidence intervals of the linear trend are always narrow near the reference year and become wider when we go further away from the reference year.

The factor trend has, in this case, wider intervals when the first year is used as a reference. Recall that the number of observations in the data was low in the earlier years and steadily increases to a maximum in the last year. As a result the standard error of the intercept in the first factor model is larger than the standard error of the second factor model. As a result, the trend parameters also have larger standard errors.

Figure 4: Indices using both reference years but rescaled so that the first year equals 100%.

IMHO, changing the reference year should not impact the uncertainty associated with the index. Furthermore, the index of the reference year should have some uncertainty attached to it since the abundance in the reference year is estimated from the data just like the abundance in any other year.

Solution: estimate the abundance in each year and then rescale

Estimating the abundance is straightforward, we just need to get a prediction (and its standard error) for each year.

nd <- distinct(ds, year, start, fstart, end, fend)
fit_1t <- predict(m1_t, newdata = nd, se.fit = TRUE)
fit_2t <- predict(m2_t, newdata = nd, se.fit = TRUE)
fit_1f <- predict(m1_f, newdata = nd, se.fit = TRUE)
fit_2f <- predict(m2_f, newdata = nd, se.fit = TRUE)

Once we have the predictions we can calculate the abundance and its confidence interval.

nd %>%
  mutate(
    model = "linear",
    reference = "start",
    abundance = exp(fit_1t$fit),
    abundance_lcl = exp(fit_1t$fit - 1.96 * fit_1t$se.fit),
    abundance_ucl = exp(fit_1t$fit + 1.96 * fit_1t$se.fit)
  ) -> abundance_1t
nd %>%
  mutate(
    model = "linear",
    reference = "end",
    abundance = exp(fit_2t$fit),
    abundance_lcl = exp(fit_2t$fit - 1.96 * fit_2t$se.fit),
    abundance_ucl = exp(fit_2t$fit + 1.96 * fit_2t$se.fit)
  ) -> abundance_2t
nd %>%
  mutate(
    model = "factor",
    reference = "start",
    abundance = exp(fit_1f$fit),
    abundance_lcl = exp(fit_1f$fit - 1.96 * fit_1f$se.fit),
    abundance_ucl = exp(fit_1f$fit + 1.96 * fit_1f$se.fit)
  ) -> abundance_1f
nd %>%
  mutate(
    model = "factor",
    reference = "end",
    abundance = exp(fit_2f$fit),
    abundance_lcl = exp(fit_2f$fit - 1.96 * fit_2f$se.fit),
    abundance_ucl = exp(fit_2f$fit + 1.96 * fit_2f$se.fit)
  ) -> abundance_2f

Notice in the figure below that the estimated abundance and its confidence interval does not depend on the reference year.

# label: fig-abundance
# echo: false
# fig-cap: "Estimated abundance from the different models"
bind_rows(abundance_1f, abundance_1t, abundance_2f, abundance_2t) %>%
  ggplot(
    aes(x = year, y = abundance, ymin = abundance_lcl, ymax = abundance_ucl)
  ) +
  geom_ribbon(aes(fill = reference), alpha = 0.2) +
  geom_line(aes(colour = reference)) +
  scale_x_continuous(breaks = seq_len(n_year)) +
  facet_wrap(~model)

Now the only thing left to do is to rescale the abundance so that the index at the reference year becomes 100%. This is very easy: find the estimated abundance at the reference year and divided all abundance estimated and their confidence intervals with this single value.

abundance_1t %>%
  filter(year == 1) %>%
  select(model, base = abundance) %>%
  inner_join(abundance_1t, by = "model") %>%
  mutate(
    index = abundance / base,
    lcl = abundance_lcl / base,
    ucl = abundance_ucl / base
  ) -> m1_t_index_good
abundance_1f %>%
  filter(year == 1) %>%
  select(model, base = abundance) %>%
  inner_join(abundance_1f, by = "model") %>%
  mutate(
    index = abundance / base,
    lcl = abundance_lcl / base,
    ucl = abundance_ucl / base
  ) -> m1_f_index_good
abundance_2t %>%
  filter(year == 10) %>%
  select(model, base = abundance) %>%
  inner_join(abundance_2t, by = "model") %>%
  mutate(
    index = abundance / base,
    lcl = abundance_lcl / base,
    ucl = abundance_ucl / base
  ) -> m2_t_index_good
abundance_2f %>%
  filter(year == 10) %>%
  select(model, base = abundance) %>%
  inner_join(abundance_2f, by = "model") %>%
  mutate(
    index = abundance / base,
    lcl = abundance_lcl / base,
    ucl = abundance_ucl / base
  ) -> m2_f_index_good

Figure 5: Indices based on rescaled abundance.

Conclusion

Using rescaled abundances as indices, yields correct confidence intervals for the indices and they don’t depend on the choice of the reference year. An added bonus is that this workflow yields a correct confidence interval for the index at the reference year.

Using the predicted abundances has a few more nice features:

You don’t need tweak the year covariate so that the trend parameters are relative to the reference year.
You can rescale the index afterwards to any other reference year within the data, without the need to refit the model with an appropriately tweaked year covariate.
You can use more complex models. E.g. a model with a separate intercept and trend for each stratum. The predicted abundance could be some weighted average over the strata.

Session info

These R packages were used to create this post.

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language nl_BE:nl
 collate  nl_BE.UTF-8
 ctype    nl_BE.UTF-8
 tz       Europe/Brussels
 date     2023-09-02
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.0)
 colorspace    2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 digest        0.6.32     2023-06-26 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21       2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4      2023-01-22 [1] CRAN (R 4.3.0)
 farver        2.1.1      2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0      2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.2      2023-04-03 [1] CRAN (R 4.3.0)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3      2023-03-21 [1] CRAN (R 4.3.0)
 hms           1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5      2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7      2023-06-29 [1] CRAN (R 4.3.1)
 knitr         1.43       2023-05-25 [1] CRAN (R 4.3.0)
 labeling      0.4.2      2020-10-20 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2.9000 2023-05-15 [1] https://inbo.r-universe.dev (R 4.3.0)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0      2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4      2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.23       2023-07-01 [1] CRAN (R 4.3.1)
 rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.3.0)
 scales      * 1.2.1      2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0      2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0      2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0      2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0      2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0      2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3      2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3      2023-06-14 [1] CRAN (R 4.3.0)
 withr         2.5.0      2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.39       2023-04-20 [1] CRAN (R 4.3.0)
 yaml          2.3.7      2023-01-23 [1] CRAN (R 4.3.0)

 [1] /home/thierry/R/x86_64-pc-linux-gnu-library/4.3
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────