Load the Package

We first load the semtree package and the OpenMx package for specifying our SEM.

library(semtree)
#> Loading required package: OpenMx
library(OpenMx)

Simulate data

Now, we simulate some data from a linear latent growth curve model of educational attainment over time (that is, a random intercept and random slope over time). The dataset will be called growth_data. The dataset contains five observations for each individual (X1 to X5) and two predictors representing children’s SES (low/high) and the preschool quality.

set.seed(23)
N <- 1000
M <- 5
icept <- rnorm(N, 10, sd = 4)
slope <- rnorm(N, 3, sd = 1.2)
ses <- factor(sample(c("low", "high"), size = N, replace = TRUE))
preschool_quality <- sample(c(1:5), N, TRUE)
loadings <- 0:4
x <-
  (slope + as.numeric(ses) * 5) %*% t(loadings) + 
  matrix(rep(icept+preschool_quality+as.numeric(ses), each = M), byrow = TRUE, ncol = M) + 
  rnorm(N * M, sd = .08)
growth_data <- data.frame(x, factor(ses), ordered(preschool_quality))
names(growth_data) <- c(paste0("X", 1:M), "SES","Preschool Quality")

Specify an OpenMx model

Now, we specify a linear latent growth curve model using OpenMx’s path specification. The model has five observed variables. Residual variances are assumed to be identical over time.

manifests <- names(growth_data)[1:5]
growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification",
    type="RAM",
       manifestVars=manifests,
    latentVars=c("intercept","slope"),
    mxData(growth_data, type="raw"),
    # residual variances
    mxPath(
        from=manifests,
        arrows=2,
        free=TRUE,
        values = c(.1, .1, .1, .1, .1),
        labels=c("residual","residual","residual","residual","residual")
    ),
    # latent variances and covariance
    mxPath(
        from=c("intercept","slope"),
        arrows=2,
        connect="unique.pairs",
        free=TRUE,
        values=c(2, 0, 1),
        labels=c("vari", "cov", "vars")
    ),
    # intercept loadings
    mxPath(
        from="intercept",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(1, 1, 1, 1, 1)
    ),
    # slope loadings
    mxPath(
        from="slope",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(0, 1, 2, 3, 4)
    ),
    # manifest means
    mxPath(
        from="one",
        to=manifests,
        arrows=1,
        free=FALSE,
        values=c(0, 0, 0, 0, 0)
    ),
    # latent means
    mxPath(
        from="one",
        to=c("intercept", "slope"),
        arrows=1,
        free=TRUE,
        values=c(1, 1),
        labels=c("meani", "means")
    )
) # close model

# fit the model to the entire dataset
growthCurveModel <- mxRun(growthCurveModel)
#> Running Linear Growth Curve Model Path Specification with 6 parameters

Run a tree

Now, we grow a SEM tree using the semtree function, which takes the model and the dataset as input. If not specified otherwise, SEM tree will assume that all variables in the dataset, which are not observed variables in the dataset are potential predictors.

tree <- semtree(model = growthCurveModel, 
                data = growth_data)

Plotting

Once the tree is grown, we can plot it:

plot(tree)

Table

We can also obtain the parameter estimates for each leaf (terminal node) of the tree. Here we use knitr to pretty print the table:

knitr::kable(
  toTable(tree)
)
Preschool Quality SES residual vari cov vars meani means
<= 3 & <= 2 not ( high ) 0.006 14.792 0.478 1.538 13.715 12.967
<= 3 & > 2 not ( high ) 0.007 14.384 0.088 1.098 14.940 13.052
> 3 not ( high ) 0.006 17.054 -0.052 1.439 17.115 12.980
<= 4 & <= 3 high 0.006 17.533 -0.003 1.417 12.871 8.011
<= 4 & > 3 high 0.006 18.046 -0.743 1.388 14.718 7.831
> 4 high 0.007 13.541 0.541 1.438 16.438 7.923

Pruning

Large trees can be pruned back to display only the first few levels of splits. The argument max.depth determines the maximal depth of the resulting tree by counting the maximal number of decision nodes until a leaf is reached. For example, max.depth=1 results in trees that only show one or zero decision nodes.

plot( prune(tree, max.depth=1))

Hyper-parameters

There are various (hyper)parameters that govern the tree growing process. These hyperparameters can be adjusted by a semtree.control object that can be passed to the semtree() function. For example, the following control object specifies that:

  • the significance level for the likelihood ratio tests to determine splitting is set to 1% (instead of the default 5%),
  • the minimum number of cases in each leaf needs to be 25 if further splits should be investigated
  • the maximum depth of the tree is 5 decisions
  • Bonferroni-correction is used when determining significance of splits
tree_control = semtree.control(alpha = 0.01,
                          min.N = 25,
                          max.depth = 5,
                          bonferroni = TRUE)

Here is a tree that is grown using the adjusted hyper-parameters in tree_control.

tree <- semtree(model = growthCurveModel, 
                data = growth_data,
                control = tree_control)
#> Beginning initial fit attemptFit attempt 0, fit=7745.53852713508, new current best! (was 7745.53852713964)                                                                             Beginning initial fit attemptFit attempt 0, fit=2836.90806227974, new current best! (was 3687.59332338577)                                                                             Beginning initial fit attemptFit attempt 0, fit=1713.80837616561, new current best! (was 1739.87876535343)                                                                             Beginning initial fit attemptFit attempt 0, fit=1061.98006591109, new current best! (was 1097.02929672133)                                                                             Beginning initial fit attemptFit attempt 0, fit=3206.09542712537, new current best! (was 4057.94520393808)                                                                             Beginning initial fit attemptFit attempt 0, fit=2454.87187575858, new current best! (was 2466.44438222973)                                                                             Beginning initial fit attemptFit attempt 0, fit=691.794906946785, new current best! (was 739.651045009226)                                                                              Tree construction finished [took 3s].

plot(tree)