Title: | Fused Partitioned Regression for Clinical and Omics Data |
Version: | 1.0.1 |
Description: | Fit (generalized) linear regression models in each leaf node of a tree. The tree is constructed using clinical variables only. The linear regression models are constructed using (high-dimensional) omics variables only. The leaf-node-specific regression models are estimated using the penalized likelihood including a standard ridge (L2) penalty and a fusion penalty that links the leaf-node-specific regression models to one another. The intercepts of the leaf nodes reflect the effects of the clinical variables and are left unpenalized. The tree, fitted with the clinical variables only, should be constructed outside of the package with the 'rpart' 'R' package. See Goedhart and others (2024) <doi:10.48550/arXiv.2411.02396> for details on the method. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | Matrix, splitTools, survival, treeClust |
Suggests: | rpart |
NeedsCompilation: | no |
Packaged: | 2025-07-07 16:52:44 UTC; P072713 |
Author: | Jeroen M. Goedhart
|
Maintainer: | Jeroen M. Goedhart <jeroengoed@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-07-11 12:30:10 UTC |
Create balanced cross-validation folds for hyperparameter tuning
Description
Constructs repeated K-fold cross-validation folds, balanced with respect to the fitted tree structure and outcome (if applicable). The folds contain only the test sample indices. This function is useful for tuning penalty parameters in the fusedTree model.
Usage
CVfoldsTree(Y, Tree, Z, model = NULL, kfold = 5, nrepeat = 3)
Arguments
Y |
The response variable. Should be:
Only right-censored survival data is currently supported. |
Tree |
A fitted decision tree, typically created using |
Z |
A |
model |
Character. Specifies the type of outcome model. Must be one of:
|
kfold |
Integer. Number of folds K for cross-validation. Defaults to 5. |
nrepeat |
Integer. Number of times the K-fold cross-validation is repeated. Defaults to 3. |
Details
For binary and survival outcomes, the function ensures that the proportion of cases vs. controls (or events vs. censored observations) remains relatively constant across folds. In addition, samples are balanced across the leaf nodes of the fitted tree to ensure consistency in node composition between folds.
Value
A list of length kfold × nrepeat
, where each element contains
the test indices for a specific fold. These indices can be used to
systematically split the data during cross-validation.
Examples
p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N) # omics data
betas <- c(1,-1,3,4,2) # omics effects
Y <- g(Z) + X %*% betas + rnorm(N) # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
rp <- rpart::rpart(Y ~ ., data = dat,
control = rpart::rpart.control(xval = 5, minbucket = 10),
model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
plot(Treefit)
folds <- CVfoldsTree(Y = Y, Tree = Treefit, Z = Z, model = "linear")
Construct design data used for fitting fusedTree models
Description
Prepares the full data design used to fit a fusedTree model, including dummy-encoded clinical leaf node indicators, optional continuous clinical variables, and a block-diagonal omics matrix structured per tree node.
Usage
Dat_Tree(Tree, X, Z, LinVars = TRUE)
Arguments
Tree |
A fitted tree object, created using |
X |
A numeric omics data matrix with dimensions
(sample size × number of omics variables). Must be a |
Z |
A |
LinVars |
Logical. Whether to include continuous clinical variables
linearly in the model (in addition to tree clustering). Recommended,
as trees may not capture linear effects well. Defaults to |
Details
This function allows users to inspect the exact data structure used in
fusedTree model fitting. The PenOpt()
and fusedTreeFit()
functions call this function internally so no need to call this function
to set-up the right data format. It is just meant for users to check what
is going on.
Value
A list with the following components:
- Clinical
A matrix encoding the clinical structure:
Dummy variables representing membership to leaf nodes of the tree,
Continuous clinical covariates (if
LinVars = TRUE
).
Each row corresponds to a sample.
- Omics
A matrix of omics data per leaf node. This matrix has dimensions: sample size × (number of leaf nodes × number of omics variables). For each observation, only the block of omics variables corresponding to its tree node is populated (other blocks are set to zero).
#'
Examples
p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N) # omics data
betas <- c(1,-1,3,4,2) # omics effects
Y <- g(Z) + X %*% betas + rnorm(N) # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
library(rpart)
rp <- rpart::rpart(Y ~ ., data = dat,
control = rpart::rpart.control(xval = 5, minbucket = 10),
model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
plot(Treefit)
Dat_fusedTree <- Dat_Tree(Tree = Treefit, X = X, Z = Z, LinVars = FALSE)
Omics <- Dat_fusedTree$Omics
Clinical <- Dat_fusedTree$Clinical
Tuning of the penalty parameters of fusedTree using cross-validation
Description
Tuning is conducted by optimizing the cross-validated likelihood.
Users can either include the fusion penalty (by specifying alphaInit > 0
),
or omit the fusion penalty (by specifying alphaInit = 0
). If
alphaInit = 0
, only the standard ridge penalty lambda
is tuned.
Note that Dat_Tree()
is called internally so please provide the
original data as input arguments.
Usage
PenOpt(
Tree,
X,
Y,
Z,
model = NULL,
lambdaInit = 10,
alphaInit = 10,
folds = CVfoldsTree(Y = Y, Tree = Tree, Z = Z, model = model),
loss = "loglik",
multistart = FALSE,
maxIter = 30,
LinVars = FALSE
)
Arguments
Tree |
The fitted tree. Currently this should be a tree fitted using
|
X |
The original omics data matrix. Has dimensions (sample size × number of omics variables). Should be a matrix. |
Y |
The response; should be either numeric, binary (encoded by 0 and 1),
or a survival object created by |
Z |
The original clinical data matrix, which was used to fit the tree.
Should be a |
model |
Character. Specifies the outcome model. One of |
lambdaInit |
Numeric. Initial value for the standard ridge (L2) penalty
|
alphaInit |
Numeric. Initial value for the fusion penalty |
folds |
List. Each element contains the indices of the test samples for a fold. It is advisable to balance the samples with respect to the outcome (for binary and survival models) and the tree structure. If not provided, folds are generated internally. |
loss |
Character. The loss function to optimize in cross-validation.
For binary and survival outcomes, only |
multistart |
Logical. Whether to initialize with different starting values when
optimizing the cross-validated likelihood. Can help with stability when both
|
maxIter |
Integer. Maximum number of iterations for the IRLS (iterative reweighted least squares) algorithm. Used only for logistic and Cox models. Defaults to 30. |
LinVars |
Logical. Whether to include continuous clinical variables
linearly in the model (in addition to the tree structure). Can be helpful
since trees may not capture linear effects well. Defaults to |
Details
The cross-validated likelihood is optimized using the Nelder-Mead
method from stats::optim()
. When tuning both lambda
and
alpha
, the objective function can be noisy. Setting
multistart = TRUE
performs optimization from several starting values
to improve robustness. This is only applicable when alphaInit > 0
.
Value
A numeric vector with the tuned values of the penalties:
-
lambda
: standard ridge (L2) penalty. -
alpha
: fusion penalty (only ifalphaInit > 0
).
If alphaInit = 0
, only the tuned lambda
is returned.
References
Examples
p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
set.seed(11)
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N) # omics data
betas <- c(1,-1,3,4,2) # omics effects
Y <- g(Z) + X %*% betas + rnorm(N) # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
rp <- rpart::rpart(Y ~ ., data = dat,
control = rpart::rpart.control(xval = 5, minbucket = 10),
model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
plot(Treefit)
folds <- CVfoldsTree(Y = Y, Tree = Treefit, Z = Z, model = "linear")
optPenalties <- PenOpt(Tree = Treefit, X = X, Y = Y, Z = Z,
model = "linear", lambdaInit = 10, alphaInit = 10,
loss = "loglik",
LinVars = FALSE,
folds = folds, multistart = FALSE)
optPenalties
Fit a fusedTree model with or without fusion penalty
Description
Fits a fusedTree model by solving a penalized regression problem using either a linear, logistic, or Cox model. The model includes both a standard ridge (L2) penalty and an optional fusion penalty to encourage similarity between leaf node-specific omics effects.
Usage
fusedTree(
Tree,
X,
Y,
Z,
LinVars = TRUE,
model,
lambda,
alpha,
maxIter = 50,
minSuccDiff = 10^(-10),
dat = FALSE,
verbose = TRUE
)
Arguments
Tree |
The fitted tree object. Should be created using |
X |
A matrix of omics data with dimensions (sample size × number of omics variables). |
Y |
The response variable. Can be:
|
Z |
A data frame of clinical covariates used to fit the tree. Must be a
|
LinVars |
Logical. Whether to include continuous clinical variables
linearly in the model (in addition to the tree structure). Defaults to |
model |
Character. Specifies the type of outcome model to fit.
One of: |
lambda |
Numeric. Value for the standard ridge (L2) penalty. |
alpha |
Numeric. Value for the fusion penalty. |
maxIter |
Integer. Maximum number of iterations for the IRLS (iterative
reweighted least squares) algorithm. Used only when |
minSuccDiff |
Numeric. The minimum difference in log-likelihood between
successive iterations of IRLS to declare convergence. Only used when
|
dat |
Logical. Whether to return the data used in model fitting
(i.e., omics, clinical, and response). Defaults to |
verbose |
Logical. Whether to print progress updates from the IRLS algorithm.
Only applies to |
Details
Linear model: Estimated using a closed-form analytic solution.
Logistic and Cox models: Estimated using IRLS (iterative reweighted least squares), equivalent to the Newton-Raphson algorithm.
Cox model: The full likelihood approach is used, following van Houwelingen et al. (2005). See also van de Wiel et al. (2021) for additional details on penalized regression for survival outcomes.
Value
A list with the following components:
- Tree
The fitted tree object from 'rpart'.
- Effects
A named numeric vector of estimated effect sizes, including: intercepts (tree leaf nodes), omics effects (per node), and linear clinical effects (if
LinVars = TRUE
).- Breslow
(Optional) The breslow estimates of the baseline hazard
ht
and the cumulative baseline hazardHt
for each time point. Only returned formodel = "cox"
.- Parameters
A list of model parameters used in fitting (e.g.,
lambda
,alpha
,model
, etc.).- Clinical
(Optional) The clinical design matrix used in fitting, if
dat = TRUE
.- Omics
(Optional) The omics design matrix used in fitting, if
dat = TRUE
.- Response
(Optional) The response vector used in fitting, if
dat = TRUE
.
The returned list object is of class S3 for which predict() is available
References
van Houwelingen, H. C., et al.. (2005). Cross-validated Cox regression on microarray gene expression data. Stad Med
van de Wiel, M. A., et al. (2021). Fast Cross-validation for Multi-penalty High-dimensional Ridge Regression. J Comput Graph Stat
Examples
p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
set.seed(11)
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N) # omics data
betas <- c(1,-1,3,4,2) # omics effects
Y <- g(Z) + X %*% betas + rnorm(N) # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
rp <- rpart::rpart(Y ~ ., data = dat,
control = rpart::rpart.control(xval = 5, minbucket = 10),
model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
plot(Treefit)
folds <- CVfoldsTree(Y = Y, Tree = Treefit, Z = Z, model = "linear")
optPenalties <- PenOpt(Tree = Treefit, X = X, Y = Y, Z = Z,
model = "linear", lambdaInit = 10, alphaInit = 10,
loss = "loglik",
LinVars = FALSE,
folds = folds, multistart = FALSE)
optPenalties
# with fusion
fit <- fusedTree(Tree = Treefit, X = X, Y = Y, Z = Z,
LinVars = FALSE, model = "linear",
lambda = optPenalties[1],
alpha = optPenalties[2])
# without fusion
fit1 <- fusedTree(Tree = Treefit, X = X, Y = Y, Z = Z,
LinVars = FALSE, model = "linear",
lambda = optPenalties[1],
alpha = 0)
#compare effect estimates
fit$Effects
fit1$Effects
Predict Method for Fused Tree Models
Description
Generates predictions from a fitted fusedTree
object using new clinical and omics data.
Usage
## S3 method for class 'fusedTree'
predict(object, newX, newZ, newY, ...)
Arguments
object |
An object of class |
newX |
A matrix of new omics covariates for prediction. Must have the same number of columns (variables) as used in the model fitting. |
newZ |
A data frame of new clinical covariates for prediction. |
newY |
A numeric response vector or a |
... |
Currently not used. Included for S3 method consistency. |
Value
A model-specific prediction object:
For
"linear"
models: adata.frame
with columnsResp
(observed responses) andYpred
(predicted values).For
"logistic"
models: adata.frame
with columnsResp
(observed labels),Ypred
(predicted probabilities), andLinPred
(linear predictor).For
"cox"
models: alist
with two elements:- data
A
data.frame
with columnsResp
(aSurv
object) andLinPred
.- Survival
A matrix of predicted survival probabilities. Rows = test subjects, columns = unique times from
newY
.
See Also
fusedTree
for model fitting.
Examples
p = 5 # number of omics variables (low for illustration)
p_Clin = 5 # number of clinical variables
N = 100 # sample size
# simulate from Friedman-like function
g <- function(z) {
15 * sin(pi * z[,1] * z[,2]) + 10 * (z[,3] - 0.5)^2 + 2 * exp(z[,4]) + 2 * z[,5]
}
set.seed(11)
Z <- as.data.frame(matrix(runif(N * p_Clin), nrow = N))
X <- matrix(rnorm(N * p), nrow = N) # omics data
betas <- c(1,-1,3,4,2) # omics effects
Y <- g(Z) + X %*% betas + rnorm(N) # continuous outcome
Y <- as.vector(Y)
dat = cbind.data.frame(Y, Z) #set-up data correctly for rpart
rp <- rpart::rpart(Y ~ ., data = dat,
control = rpart::rpart.control(xval = 5, minbucket = 10),
model = TRUE)
cp = rp$cptable[,1][which.min(rp$cptable[,4])] # best model according to pruning
Treefit <- rpart::prune(rp, cp = cp)
fit <- fusedTree(Tree = Treefit, X = X, Y = Y, Z = Z,
LinVars = FALSE, model = "linear",
lambda = 10,
alpha = 1000)
Preds <- predict(fit, newX = X, newZ = Z, newY = Y)