Modeling of indicator responses to single pressures with GAMs

model_gam applies Generalized Additive Models (GAMs) to each IND~pressure combination created in ind_init and returns a tibble with IND~pressure-specific GAM outputs.

model_gam(init_tbl, k = 5, family = stats::gaussian(), excl_outlier = NULL)

Arguments

init_tbl	The output tibble of the `ind_init` function.
k	Choice of knots (for the smoothing function `s`); the default is 5.
family	A description of the error distribution and link to be used in the GAM. This needs to be defined as a family function (see also `family`). All standard family functions can be used as well some of the distribution families in the mgcv package (see `family.mgcv`; e.g.`negbin` or `nb`).
excl_outlier	A list of values identified as outliers in specific IND~pressure GAMs, which should be excluded in this modeling step (the output tibble of this function includes the variable `pres_outlier`, which is a column-list containing all indices of values with cook`s distance > 1 (see below). The function can be re-run again, then excluding all these outliers provided in `$pres_outlier` from the the first run (see example)).

Value

The function returns a tibble, which is a trimmed down version of the data.frame(), including the following elements:

id: Numerical IDs for the IND~press combinations.
ind: Indicator names.
press: Pressure names.
model_type: Specification of the model type; at this stage containing only "gam" (Generalized Additive Model).
corrstruc: Specification of the correlation structure; at this stage containing only "none".
aic: AIC of the fitted models
edf: Estimated degrees of freedom for the model terms.
p_val: The p values for the smoothing term (the pressure).
signif_code: The significance codes for the p-values.
r_sq: The adjusted r-squared for the models. Defined as the proportion of variance explained, where original variance and residual variance are both estimated using unbiased estimators. This quantity can be negative if your model is worse than a one parameter constant model, and can be higher for the smaller of two nested models.
expl_dev: The proportion of the null deviance explained by the models.
nrmse: Absolute values of the root mean square error normalized by the standard deviation (NRMSE).
ks_test: The p-values from a Kolmogorov-Smirnov Test applied on the model residuals to test for normal distribution. P-values > 0.05 indicate normally distributed residuals.
tac: logical; indicates whether temporal autocorrelation (TAC) was detected in the residuals. TRUE if model residuals show TAC. NAs in the time series due to real missing values, test data extraction or exclusion of outliers are explicitly considered. The test is based on the following condition: if any of the acf and pacf values of lag 1 - 5 are greater than 0.4 or lower than -0.4, a TRUE is returned.
pres_outlier: A list-column with all indices of values identified as outliers in each model (i.e.cook`s distance > 1). The indices present the position in the training data, including NAs.
excl_outlier: A list-column listing all outliers per model that have been excluded in the GAM fitting
model: A list-column of IND~press-specific gam objects that contain additionally the logical vector indicating missing values ($train_na).

Details

To evaluate the IND`s sensitivity and robustness time series of the IND are modeled as a smoothing function of one single pressure variable (using a subset of the data as training dataset, e.g. excluding the years of the annual time series). The GAMs are build using the default settings in the gam function and the smooth term function s). However, the user can adjust the distribution and link by modifying the family argument as well as the maximum level of non-linearity by setting the number of knots:

gam(ind ~ s(press, k = k), family = family, data = training_data)

In the presence of significant temporal auto-correlation, GAMs should be extended to Generalized Additive Mixed Models (GAMMs) by including auto-regressive error structures to correct for the auto-correlation (Pinheiro and Bates, 2000). This is implemented in the function model_gamm.

The returned tibble contains various model outputs needed for scoring the sensitivity and robustness subcriteria:

p_val to identify whether an IND responds to a specific pressure
r_sq for the strength of the IND response
edf for the non-linearity of the IND response
nrmse for the robustness of the established IND~pressure relationship

The robustness of the modeled pressure relationship based on the training data is evaluated by measuring how well the model prediction matches the test dataset, e.g. the last years. This is quantified by computing the absolute value of the normalized root mean square error (NRMSE) on the test dataset. The normalization to the mean of the observed test data allows for comparisons and a general scoring of the model robustness across INDs with different scales or units.

References

Pinheiro, J.C., Bates, D.M. (2000) Mixed-Effects Models in S and S-Plus. Springer, New York, 548pp.

Examples

# Using the Baltic Sea demo data in this package
dat_init <- ind_init(
  ind_tbl = ind_ex[, c("Sprat", "Cod")],
  press_tbl = press_ex[, c("Tsum", "Swin", "Fcod", "Fher")],
  time = ind_ex[ ,1])
gam_tbl <- model_gam(dat_init)
# Any outlier?
gam_tbl$pres_outlier
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> NULL
#> 
#> [[6]]
#> NULL
#> 
#> [[7]]
#> 1991 
#>   13 
#> 
#> [[8]]
#> 2000 
#>   22 
#> 
# Exclude outliers by passing this list as input:
gam_tbl_out <- model_gam(dat_init, excl_outlier = gam_tbl$pres_outlier)

# \donttest{
 # Using another error distribution
 ind_sub <- round(exp(ind_ex[ ,c(2,8,9)]),0) # to unlog data and convert to integers
 ind_tbl2 <- ind_init(ind_sub, press_ex, time = ind_ex$Year)
 model_gam(ind_tbl2, family = poisson(link="log"))
#> # A tibble: 24 × 16
#>       id ind   press  model_type corrstruc     aic   edf p_val signif_code    r_sq
#>    <int> <chr> <chr>  <chr>      <chr>       <dbl> <dbl> <dbl> <chr>         <dbl>
#>  1     1 TZA   Year   gam        none      111636.  4.00     0 ***          0.427 
#>  2     2 TZA   Tsum   gam        none      166179.  3.99     0 ***          0.256 
#>  3     3 TZA   Swin   gam        none      155118.  4.00     0 ***          0.237 
#>  4     4 TZA   Pwin   gam        none      228147.  3.96     0 ***         -0.113 
#>  5     5 TZA   Nwin   gam        none      189112.  4.00     0 ***          0.0710
#>  6     6 TZA   Fsprat gam        none       96200.  4.00     0 ***          0.547 
#>  7     7 TZA   Fher   gam        none      144579.  3.99     0 ***          0.283 
#>  8     8 TZA   Fcod   gam        none      179934.  3.98     0 ***          0.0747
#>  9     9 Sprat Year   gam        none       88118.  3.99     0 ***          0.516 
#> 10    10 Sprat Tsum   gam        none      174764.  3.99     0 ***         -0.0198
#> # … with 14 more rows, and 6 more variables: expl_dev <dbl>, nrmse <dbl>,
#> #   ks_test <dbl>, tac <lgl>, pres_outlier <list>, model <list>
# }