ind_init combines the time vector and the indicator (IND) and pressure data into one tibble with defined training and test observations. All INDs are combined with all pressures provided as input.

ind_init(ind_tbl, press_tbl, time, train = 0.9, random = FALSE)

Arguments

ind_tbl

A data frame, matrix or tibble containing only the (numeric) IND variables. Single indicators should be coerced into a data frame to keep the indicator name. If kept as vector, default name will be `ind`.

press_tbl

A data frame, matrix or tibble containing only the (numeric) pressure variables. Single pressures should be coerced into a data frame to keep the pressure name. If kept as vector, default name will be `press`.

time

A vector containing the actual time steps (e.g. years; should be the same as in the IND and pressure data).

train

The proportion of observations that should go into the training data on which the GAMs are later fitted. Has to be a numeric value between 0 and 1; the default is 0.9.

random

logical; should the observations for the training data be randomly chosen? Default is FALSE, so that the last time units (years) are chosen as test data.

Value

The function returns a tibble, which is a trimmed down version of the data.frame(), including the following elements:

id

Numerical IDs for the IND~press combinations.

ind

Indicator names.These might be modified to exclude any character, which is not in the model formula (e.g. hyphens, brackets, etc. are replaced by an underscore, variables starting with a number will get an x before the number.

press

Pressure names.These might be modified to exclude any character, which is not in the model formula (e.g. hyphens, brackets, etc. are replaced by an underscore, variables starting with a number will get an x before the number.

ind_train

A list-column with indicator values of the training data.

press_train

A list-column with pressure values of the training data.

time_train

A list-column with the time steps of the training data.

ind_test

A list-column with indicator values of the test data.

press_test

A list-column with pressure values of the test data.

time_test

A list-column with the time steps of the test data.

train_na

logical; indicates the joint missing values in the training IND and pressure data. That includes the original NAs as well as randomly selected test observations that are within the training period. This vector is needed later for the determination of temporal autocorrelation.

Details

ind_init will combine every column in ind_tbl with every column in press_tbl so that each row will represent one IND~press combination. The input data will be split into a training and a test data set. The returned tibble is the basis for all IND~pressure modeling functions.

If not all IND~pressure combinations should be modeled, the respective rows can simply be removed from the output tibble or ind_init is applied multiple times on data subsets and their output tibbles merged later using e.g. bind_rows.

See also

tibble and the vignette("tibble") for more informations on tibbles

Other IND~pressure modeling functions: find_id(), model_gamm(), model_gam(), plot_diagnostics(), plot_model(), scoring(), select_model(), test_interaction()

Examples

# Using the Baltic Sea demo data in this package press_tbl <- press_ex[ ,-1] # excl. Year ind_tbl <- ind_ex[ ,-1] # excl. Year time <- ind_ex[ ,1] # Assign randomly 50% of the observations as training data and # the other 50% as test data ind_init(ind_tbl, press_tbl, time, train = 0.5, random = TRUE)
#> # A tibble: 84 × 10 #> id ind press ind_train press_train time_train ind_test press_test #> <int> <chr> <chr> <list> <list> <list> <list> <list> #> 1 1 TZA Tsum <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 2 2 TZA Swin <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 3 3 TZA Pwin <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 4 4 TZA Nwin <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 5 5 TZA Fsprat <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 6 6 TZA Fher <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 7 7 TZA Fcod <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 8 8 MS Tsum <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 9 9 MS Swin <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> 10 10 MS Pwin <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> # … with 74 more rows, and 2 more variables: time_test <list>, train_na <list>
# To keep the name when testing only one indicator and pressure, coerce both vectors # data frames ind_init(ind_tbl = data.frame(MS = ind_tbl$MS), press_tbl = data.frame(Tsum = press_tbl$Tsum), time, train = .5, random = TRUE)
#> # A tibble: 1 × 10 #> id ind press ind_train press_train time_train ind_test press_test #> <int> <chr> <chr> <list> <list> <list> <list> <list> #> 1 1 MS Tsum <dbl [15]> <dbl [15]> <int [15]> <dbl [15]> <dbl [15]> #> # … with 2 more variables: time_test <list>, train_na <list>