This example demonstrates how SEM forests can be grown. SEM forests are ensembles of typically hundreds to thousands of SEM trees. Using permutation-based variable importance estimates, we can aggregate the importance of each predictor for improving model fit.
Here, we use the affect
dataset and a simple SEM with
only a single observed variable and no latent variables.
Load affect dataset from the psychTools
package. These
are data from two studies conducted in the Personality, Motivation and
Cognition Laboratory at Northwestern University to study affect
dimensionality and the relationship to various personality
dimensions.
library(psychTools)
data(affect)
knitr::kable(head(affect))
Study | Film | ext | neur | imp | soc | lie | traitanx | state1 | EA1 | TA1 | PA1 | NA1 | EA2 | TA2 | PA2 | NA2 | state2 | MEQ | BDI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
maps | 3 | 18 | 9 | 7 | 10 | 3 | 24 | 22 | 24 | 14 | 26 | 2 | 6 | 5 | 7 | 4 | NA | NA | 0.0476190 |
maps | 3 | 16 | 12 | 5 | 8 | 1 | 41 | 40 | 9 | 13 | 10 | 4 | 4 | 14 | 5 | 5 | NA | NA | 0.3333333 |
maps | 3 | 6 | 5 | 3 | 1 | 2 | 37 | 44 | 1 | 14 | 4 | 2 | 2 | 15 | 3 | 1 | NA | NA | 0.1904762 |
maps | 3 | 12 | 15 | 4 | 6 | 3 | 54 | 40 | 5 | 15 | 1 | 0 | 4 | 15 | 0 | 2 | NA | NA | 0.3846154 |
maps | 3 | 14 | 2 | 5 | 6 | 3 | 39 | 67 | 12 | 20 | 7 | 13 | 14 | 15 | 16 | 13 | NA | NA | 0.3809524 |
maps | 1 | 6 | 15 | 2 | 4 | 5 | 51 | 38 | 9 | 14 | 5 | 1 | 7 | 12 | 2 | 2 | NA | NA | 0.2380952 |
affect$Film <- as.factor(affect$Film)
affect$lie <- as.ordered(affect$lie)
affect$imp <- as.ordered(affect$imp)
The following code implements a simple SEM with only a single
manifest variables and two parameters, the mean of state anxiety after
having watched a movie (state2
), \(\mu\), and the variance of state anxiety,
\(\sigma^2\).
library(OpenMx)
manifests<-c("state2")
latents<-c()
model <- mxModel("Univariate Normal Model",
type="RAM",
manifestVars = manifests,
latentVars = latents,
mxPath(from="one",to=manifests, free=c(TRUE),
value=c(50.0) , arrows=1, label=c("mu") ),
mxPath(from=manifests,to=manifests, free=c(TRUE),
value=c(100.0) , arrows=2, label=c("sigma2") ),
mxData(affect, type = "raw")
);
result <- mxRun(model)
#> Running Univariate Normal Model with 2 parameters
These are the estimates of the model when run on the entire sample:
summary(result)
#> Summary of Univariate Normal Model
#>
#> free parameters:
#> name matrix row col Estimate Std.Error A
#> 1 sigma2 S state2 state2 115.05414 12.4793862
#> 2 mu M 1 state2 42.45118 0.8226717
#>
#> Model Statistics:
#> | Parameters | Degrees of Freedom | Fit (-2lnL units)
#> Model: 2 168 1289.158
#> Saturated: 2 168 NA
#> Independence: 2 168 NA
#> Number of observations/statistics: 330/170
#>
#> Information Criteria:
#> | df Penalty | Parameters Penalty | Sample-Size Adjusted
#> AIC: 953.1576 1293.158 1293.194
#> BIC: 314.9100 1300.756 1294.412
#> CFI: NA
#> TLI: 1 (also known as NNFI)
#> RMSEA: 0 [95% CI (NA, NA)]
#> Prob(RMSEA <= 0.05): NA
#> To get additional fit indices, see help(mxRefModels)
#> timestamp: 2024-04-15 22:33:05
#> Wall clock time: 0.02428603 secs
#> optimizer: SLSQP
#> OpenMx version number: 2.21.1
#> Need help? See help(mxSummary)
Create a forest control object that stores all tuning parameters of the forest. Note that we use only 5 trees for illustration. Please increase the number in real applications to several hundreds. To speed up computation time, consider score-based test for variable selection in the trees.
control <- semforest_control(num.trees = 5)
print(control)
#> SEM-Forest control:
#> -----------------
#> Number of Trees: 5
#> Sampling: subsample
#> Comparisons per Node: 2
#>
#> SEM-Tree control:
#> ▔▔▔▔▔▔▔▔▔▔
#> ● Splitting Method: fair
#> ● Alpha Level: 1
#> ● Bonferroni Correction:FALSE
#> ● Minimum Number of Cases: 20
#> ● Maximum Tree Depth: NA
#> ● Number of CV Folds: 5
#> ● Exclude Heywood Cases: FALSE
#> ● Test Invariance Alpha Level: NA
#> ● Use all Cases: FALSE
#> ● Verbosity: FALSE
#> ● Progress Bar: TRUE
#> ● Seed: NA
Now, run the forest using the control
object:
Next, we compute permutation-based variable importance. This may take some time.
vim <- varimp(forest)
print(vim, sort.values=TRUE)
#> Variable Importance
#> Study PA2 state1 TA2 Film
#> -9.659311e-08 9.418006e+00 1.218599e+01 2.066494e+01 2.537268e+01
#> NA2
#> 3.658740e+01
plot(vim)
From this, we can learn that variables such as NA2
representing negative affect (after the movie), TA2
representing tense arousal (after the movie), and state1
representing the state anxiety before having watched the movie, are the
best predictors of difference in the distribution of state anxiety (in
either mean, variance or both) after having watched the movie.