Abstract

The MDCcure package provides statistical tools to test the effect of covariates on the cure rate in mixture cure models. It implements nonparametric hypothesis tests based on the martingale difference correlation (MDC) and its partial version (PMDC), which measure departures from conditional mean independence. Since the cure status is partially observed due to censoring, the package uses an estimator that adjusts for this by leveraging nonparametric methods for cure probability and latency functions. Main features include estimation of MDC and partial MDC, hypothesis tests via permutation and chi-square approximations, and a goodness-of-fit test for parametric cure rate models. The package incorporates efficient implementations in C++ with parallel computing to handle the computational intensity of the MDC methods. The package also offers visualization tools to compare parametric fits to flexible nonparametric estimators, aiding interpretation and model assessment.

Introduction

Standard statistical methods for time-to-event analysis—often referred to as survival models—typically assume that every subject in the study population is susceptible to the event of interest and will eventually experience the event if the follow-up is sufficiently long. However, in practice, a fraction of individuals may never experience the event. These subjects are considered as having infinite survival times and are referred to as cured (Peng and Yu 2021).

Over the last few decades, specific methods have been developed to deal with this type of data, collectively known as cure model analysis. Several reviews and methodological contributions summarize the theory and application of these models (Maller et al. 2024; Peng and Yu 2021; Patilea and Van Keilegom 2020; Mailis Amico and Van Keilegom 2018; Morbiducci, Nardi, and Rossi 2003).

Cure models explicitly account for a subpopulation that is not susceptible to the event, allowing direct modeling of the cure proportion and the effects of covariates on it. Additionally, these models provide inference on the survival distribution of the uncured (susceptible) subjects. Most existing cure models are modifications of standard survival models that incorporate the cure probability. Based on how this incorporation is done, cure models are typically categorized as either mixture or nonmixture cure models. A key advantage of mixture cure models is that they allow covariates to affect both the cured and uncured groups differently. This package focuses on mixture cure models.

Let \(Y\) denote the time to event. It is assumed that subjects are subject to right censoring, and the censoring time \(C\) and the event time \(Y\) are conditionally independent given covariates \(\boldsymbol{X}\). The conditional distribution and survival functions of \(Y\) are given by \[ F(t|\boldsymbol{x}) = \mathbb{P}(Y \leq t \mid \boldsymbol{X} = \boldsymbol{x}), \quad S(t|\boldsymbol{x}) = 1 - F(t|\boldsymbol{x}). \] Under right censoring, only the pair \((T, \delta)\) is observed, where \(T = \min(Y, C)\) and \(\delta = \mathbb{I}(Y \leq C)\) is the censoring indicator. The conditional distributions of \(C\) and \(T\) are denoted by \(G(t|\boldsymbol{x})\) and \(H(t|\boldsymbol{x})\), respectively.

Define the cure indicator \(\nu = \mathbb{I}(Y = \infty)\), where \(\nu = 0\) for susceptible individuals and \(\nu = 1\) for cured individuals. The probability of not being cured (incidence) is \(p(\boldsymbol{x}) = \mathbb{P}(\nu = 0 \mid \boldsymbol{X} = \boldsymbol{x})\), and the conditional survival function of the uncured group (latency) is \[ S_0(t|\boldsymbol{x}) = \mathbb{P}(Y > t \mid \nu = 0, \boldsymbol{X} = \boldsymbol{x}). \] Then, the mixture cure model can be written as: \[ S(t|\boldsymbol{x}) = 1 - p(\boldsymbol{x}) + p(\boldsymbol{x}) S_0(t|\boldsymbol{x}). \]

Motivation for Covariate Testing

Testing whether a covariate has an effect on the cure rate \(1 - p(\boldsymbol{x})\) can be viewed as a problem of statistical dependence between the cure status \(\nu\) and covariates \(\boldsymbol{X}\). Various methods have been proposed to detect such dependencies, including universally consistent measures such as distance correlation (Szekely, Rizzo, and Bakirov 2007), the Hilbert–Schmidt independence criterion (Gretton et al. 2005), and the Heller–Heller–Gorfine statistic (Heller, Heller, and Gorfine 2013).

More recently, Shao and Zhang (2014) proposed the martingale difference correlation (MDC) to measure the departure from conditional mean independence between a scalar response \(Y\) and a vector predictor \(\boldsymbol{X}\): \[ \mathbb{E}(Y \mid \boldsymbol{X}) = \mathbb{E}(Y) \quad \text{a.s.} \] The martingale difference correlation is defined as: \[ \text{MDC}(Y \mid \boldsymbol{X})^2 = \begin{cases} \dfrac{\text{MDD}(Y \mid \boldsymbol{X})^2}{\sqrt{\text{Var}(Y)^2 \mathcal{V}^2(\boldsymbol{X})}}, & \text{if } \text{Var}(Y)^2 \mathcal{V}^2(\boldsymbol{X}) > 0, \\ 0, & \text{otherwise}, \end{cases} \] where \(\mathcal{V}^2(\boldsymbol{X})\) is the distance variance (Szekely, Rizzo, and Bakirov 2007), and martingale difference divergence (MDD) is given by: \[ \text{MDD}(Y \mid \boldsymbol{X})^2 = \frac{1}{c_p} \int_{\mathbb{R}^p} \frac{|g_{Y,X}(s) - g_Y g_X(s)|^2}{|s|_p^{1+p}} ds, \] with \(g_{Y,X}(s) = \mathbb{E}(Y e^{i \langle s, \boldsymbol{X} \rangle})\), \(g_Y = \mathbb{E}(Y)\), and \(g_X(s) = \mathbb{E}(e^{i \langle s, \boldsymbol{X} \rangle})\).

Two estimators are commonly used: a biased version (MDCV) from Shao and Zhang (2014) and an unbiased version (MDCU) developed in Park, Shao, and Yao (2015).

Partial Martingale Difference Correlation

The notion of conditional mean independence has been extended to settings where one adjusts for an additional set of covariates \(\boldsymbol{Z}\). Specifically, \(Y\) is said to be conditionally mean independent of \(\boldsymbol{X}\) given \(\boldsymbol{Z}\) if: \[ \mathbb{E}(Y \mid \boldsymbol{X}, \boldsymbol{Z}) = \mathbb{E}(Y \mid \boldsymbol{Z}) \quad \text{a.s.} \]

Park, Shao, and Yao (2015) introduced a measure called the partial martingale difference correlation (pMDC) to quantify this conditional mean independence, allowing one to assess the dependence of \(Y\) on \(\boldsymbol{X}\) while adjusting for \(\boldsymbol{Z}\).

This framework motivates a nonparametric test for covariate effects in cure models, since the cure probability can be expressed as \(1 - p(\boldsymbol{x}) = \mathbb{E}(\nu \mid \boldsymbol{X} = \boldsymbol{x})\). That is, testing the effect of covariates on the cure rate reduces to testing whether the cure indicator \(\nu\) is conditionally mean independent of \(\boldsymbol{X}\).

To extend this idea to two covariates, we use the partial martingale difference correlation, enabling robust and flexible hypothesis testing in the context of mixture cure models.

MDCcure

Overview of the package

The MDCcure package implements the methodology proposed in Monroy-Castillo et al. (2025) to test whether a covariate or a set of covariates has an effect on the cure rate. This methodology is based on the martingale difference correlation (MDC). Since the MDC function is not available on CRAN, it can be obtained from older versions of the EDMeasure package. However, the original implementation is computationally intensive. To address this, MDCcure leverages C++ and parallel computing using RcppParallel to improve efficiency.

Main Features

The MDCcure package offers functionality in three main areas:

1. Dependence Measures

The package provides functions to estimate the martingale difference correlation (mdc()), martingale difference divergence (mdd()), and their partial counterparts—partial martingale difference correlation and partial martingale difference divergence (pmdc() and pmdd()). It also includes a permutation test for MDC and a chi-squared test based on the bias-corrected MDC (mdc_test()), which test for dependence between a response covariate, \(Y\), and a covariate vector \(\boldsymbol{X}\) in the conditional mean, that is, whether the conditional mean of \(Y\) given \(\boldsymbol{X}\) depends on \(\boldsymbol{X}\) is tested.

2. Covariate Hypothesis Testing for the Cure Rate in Mixture Cure Models

The main function, testcov(), enables formal testing of whether a covariate significantly influences the cure rate in a mixture cure model. This is extended to handle two covariates with the testcov2() function.

3. Goodness-of-Fit Testing for the Cure Rate in Mixture Cure Models

The package implements the goodness-of-fit (GOF) test which assesses whether the specified parametric model for the cure rate adequately fits the data.

In addition to the formal statistical test, MDCcure includes visualization tools that compare the parametric cure rate function with a flexible nonparametric estimate, which is obtained using the probcure() function from the npcure package (López-Cheda, Jácome, and López-de-Ullibarri 2021). This graphical comparison provides an intuitive way to assess model adequacy and explore deviations from the assumed model.

Using the package

The following code illustrates a typical call:

library(MDCcure)

Dependence measures

For the first part, and for illustration purposes, we will simulate data to demonstrate the main functionalities of the package. The functions mdc and mdd return the squared martingale difference correlation and the martingale difference divergence, respectively. Note that two options for the center argument are available: "U" for the unbiased estimator and "D" for the double-centered version, which results in a biased estimator.

set.seed(123)
X <- matrix(rnorm(10 * 2), 10, 2)
Y <- matrix(rnorm(10 * 2), 10, 2)

mdc(X, Y, center = "U")
## [1] 0.2724937
mdc(X, Y, center = "D")
## [1] 0.3998999

To formally assess the dependence between a response variable \(Y\) and a covariate vector \(\boldsymbol{X}\), the function mdc_test() provides hypothesis testing procedures based on the martingale difference correlation framework.

This function supports multiple testing approaches:

MDCU: a permutation test based on the bias-corrected version of the martingale difference correlation.
MDCV: a permutation test based on the biased (double-centered) version.
FMDCU: a fast approximation to the test using the asymptotic distribution of the bias-corrected martingale difference divergence, which is dominated in the upper tail for a chi-squared distribution.

The null hypothesis in all cases is: \[ \mathcal{H}_0: \mathbb{E}(Y \mid \boldsymbol{X}) = \mathbb{E}(Y) \quad \text{a.s.} \] i.e., that \(Y\) is mean independent of \(\boldsymbol{X}\).

The pmdc() and pmdd() functions provide scalar measures to quantify conditional mean dependence while adjusting for a set of conditioning variables.

Specifically:

pmdc() returns the squared partial martingale difference correlation, which measures the conditional mean dependence between \(Y\) and \(\boldsymbol{X}\) adjusting on \(\boldsymbol{Z}\).
pmdd() returns the squared partial martingale difference divergence.

set.seed(123)
X <- matrix(rnorm(10 * 2), 10, 2)
Y <- matrix(rnorm(10 * 2), 10, 2)
Z <- matrix(rnorm(10 * 2), 10, 2)

pmdd(X, Y, Z)
## [1] 0.1247584
pmdc(X, Y, Z)
## [1] 0.2555037

Covariate hypothesis test for the cure rate in mixture cure models

With one covariate

The testcov() function performs nonparametric hypothesis testing to assess whether a covariate is associated with the cure probability in a mixture cure model. This function supports several types of test statistics, including those based on the martingale difference correlation (MDC) and an alternative GOFT test, which compares the distance between two nonparametric estimators of the cure probability: one under the null hypothesis (constant cure rate) and the other under the alternative hypothesis (covariate-dependent cure rate).

The covariate effect is evaluated using observed survival times, censoring indicators, and kernel-based estimation of the latency. The test is flexible in terms of bandwidth selection and computational strategies.

Available Test Methods

The method argument allows the user to choose among the following options:

"MDCU": Martingale Difference Correlation using U-centering, evaluated via a permutation test.
"MDCV": Martingale Difference Correlation using double-centering, with a permutation test.
"FMDCU": A fast approximation of "MDCU" using an asymptotic chi-squared distribution.
"GOFT": An alternative goodness-of-fit test specifically designed for cure models.
"All": Executes all available tests and returns their results in a single output.

Bandwidth Selection

The test relies on the estimation of the cure status, which involves kernel smoothing and requires a bandwidth parameter h. You can:

Provide a numeric value for manual bandwidth selection.
Set h = NULL to let the function select an optimal bandwidth automatically.
Use h = "bootstrap" to apply the bootstrap bandwidth selection method proposed by López-Cheda et al. (2017). Note: This option can significantly increase computation time.

By default, testcov() uses parallel computing to improve efficiency (parallel = TRUE). It is possible to control the number of cores used via the n_cores argument. If n_cores = NULL, the function will automatically detect the number of available cores and use all but one of them.

With two covariates

The testcov2() function performs nonparametric hypothesis testing to assess whether a covariate is associated with the cure probability, while adjusting for a second covariate in a mixture cure model. In other words, it evaluates whether a covariate has an effect on the cure rate, given that another covariate is already known to influence it.

\[ \mathcal{H}_0 : \mathbb{E}(\nu \mid \boldsymbol{X}, \boldsymbol{Z}) \equiv 1 - p(\boldsymbol{Z}) \quad \text{a.s.} \quad \text{vs} \quad \mathcal{H}_1 : \mathbb{E}(\nu \mid \boldsymbol{X}, \boldsymbol{Z}) \not\equiv 1 - p(\boldsymbol{Z}) \quad \text{a.s.}, \]

where \(\nu\) denotes the (unobserved) cure status, and \(p(\cdot)\) is the cure probability function.

This test extends the theory developed for a single covariate to the two-covariate setting. Estimation of the cure status relies on kernel smoothing and requires a bandwidth matrix \(\boldsymbol{H}\). If H = NULL, the bandwidth matrix is automatically selected using the rule-of-thumb method.

Goodnnes-of-fit test for the cure rate in a mixture cure model

The goft() function performs a goodness-of-fit test to assess whether a given parametric model—depending on a covariate—is appropriate for modeling the cure rate in a mixture cure model.

Key features include:

Automatic bandwidth selection when h = NULL,
A default initial value for the parameter vector \(\theta\) if theta0 = NULL,
Support for three parametric cure rate models via the model argument: "logit", "probit", and "cloglog".

In addition to statistical testing, the function plotCure() provides a visual comparison between the nonparametric and parametric estimations of the cure rate, enabling users to better assess the model’s adequacy.

Example

In order to illustrate the use of the functions defined above, we simulate data following the setting described in Müller and Van Keilegom (2019). In this example, the cure rate is generated from a logistic model given by

\[ p(x) = \frac{1}{1 + \exp(1 + x)}, \]

so it follows a logistic form and depends on the covariate \(X\). Additionally, we simulate a second covariate \(Z\), which is independent of both \(X\) and the cure rate.

set.seed(1234)

theta0 = c(1,1)
gamma0 = 0.5
gamma1 = 0.5

cT = 0

n <- 200
maxT = 0.02*n

X = runif(n,-1,1)
Z = runif(n,-1,1)
phi = exp(theta0[1]+theta0[2]*X)
phi = phi/(1+phi)   
B = (runif(n) <= phi)

aT = (X+1)^cT
lambdaX = exp(gamma0+gamma1*X)
bT = lambdaX^(-1/aT)
tau = qweibull(0.9,shape=mean(aT),scale=mean(bT))
Y = rep(100000,n)
count = 0
for (j in 1:n)
{if (B[j]==1)
{stop = 0
while (stop==0)  
{Y[j] = rweibull(1,shape=aT[j],scale=bT[j])
if ((Y[j] > tau)*(count <= maxT)) 
{Y[j] = tau
count = count + 1}
if (Y[j] <= tau) stop = 1}}}

aC = 1
bC = 1.5
C = rweibull(n,shape=aC,scale=bC)
C = replace(C,C>tau,tau+0.001)  

T = apply(cbind(Y,C),1,min)
Delta = as.numeric(Y <= C)

#Covariate hypothesis test for the cure rate with one covariate
testcov(X, T, Delta, method = "All", P = 499)
## ============== Covariate hypothesis test for the cure rate ============
## Method: All 
## 
## MDCU: Martingale difference correlation unbiased with permutations
## Test statistic: 0.03986 
## p-value: 0.002 
## Number of permutations: 499 
## 
## MDCV: Martingale difference correlation biased with permutations
## Test statistic: 0.04746 
## p-value: 0.008 
## Number of permutations: 499 
## 
## FMDCU: Fast Chi-square test based on MDC unbiased
## Test statistic: 0.03986 
## p-value: 0.0027426 
## 
## GOFT 
## Test statistic: 0.769 
## p-value: 0.03006 
## Number of bootstrap replicates: 499 
## 
## ========================================================================

#Covariate hypothesis test for the cure rate with two covariates
testcov2(X, T, Z, Delta, P = 499)
## ---------------------------------------------------------------------
## Covariate hypothesis test for the cure rate with two covariates 
## ---------------------------------------------------------------------
## Hypotheses:
##   H0: The conditional mean of the cure status given X adjusting on Z does not depend on X,
##       i.e., E[nu|X,Z] = E[nu|Z].
##   H1: The conditional mean of the cure status depends on X adjusting on Z
## 
## Test Statistic: 0.02869 
## p-value: 0.004 
## Number of permutations: 499
testcov2(Z, T, X, Delta, P = 499)
## ---------------------------------------------------------------------
## Covariate hypothesis test for the cure rate with two covariates 
## ---------------------------------------------------------------------
## Hypotheses:
##   H0: The conditional mean of the cure status given Z adjusting on X does not depend on Z,
##       i.e., E[nu|Z,X] = E[nu|X].
##   H1: The conditional mean of the cure status depends on Z adjusting on X
## 
## Test Statistic: -0.002226 
## p-value: 0.744 
## Number of permutations: 499

#Goodness-of-fit test for the cure rate 
goft(X, T, Delta, model = "logit")
## ---------------------------------------------------------------------
## Goodness-of-fit test for the cure rate in a mixture cure model
## -------------------------------------------------------------------
## 
## Model: logit 
## 
## Hypotheses:
##   H0: The model fits the data, i.e., p(x) = p_theta(x)
##   H1: The model does not fit the data, i.e., p(x) != p_theta(x)
## 
## Test Statistic: 0.1192 
## p-value: 0.4168 
## Number of bootstrap replications: 499
# plotCure(X, T, Delta, density = FALSE)

In the first instance, we test whether the cure fraction depends on the covariate \(X\). All available methods are applied. By default, the function uses method = "FMDCU", which corresponds to the fast chi-square test for the martingale difference correlation. In all cases, the resulting p-values are below 0.05, which aligns with the data-generating mechanism, the cure probability was simulated to depend on \(X\) through a logistic function. This provides evidence against the null hypothesis of no covariate effect on the cure rate.

Secondly, we test whether the cure fraction depends on two covariates. The order of covariates is crucial in this test: testcov2(X, Y, Z) is not equivalent to testcov2(Z, Y, X).

In the first call, testcov2(X, Y, Z), we test whether the cure fraction depends on \(X\), given that it depends on \(Z\). The resulting p-value is less than 0.05, indicating that \(X\) provides additional explanatory power. This aligns with the simulation setup, where the cure fraction depends on \(X\).
In the second call, testcov2(Z, Y, X), we test whether the cure fraction depends on \(Z\), given that it depends on \(X\). The p-value is greater than 0.05, indicating that \(Z\) does not provide additional information once \(X\) is accounted for. Again, this agrees with the simulation design, where the cure probability is independent of \(Z\).

This example illustrates the asymmetry of the test and the importance of specifying the covariates in the correct order.

Finally, in the third part we test if the data fit to a logistic model. Note that:

The p-value is greater than 0.05, which aligns with the simulation.
The plot shows both estimations:
- Parametric (red)
- Nonparametric (blue)
For the parametric case, the parameter \(\theta\) is given by default when theta = NULL and is obtained with the smcure() function from the smcure package.
Additionally, it is possible to add the density of the covariate when density = TRUE.

Amico, Mailis, and Ingrid Van Keilegom. 2018. “Cure Models in Survival Analysis.” Annual Review of Statistics and Its Application 5 (1): 311–42.

Amico, M, I Van Keilegom, and B Han. 2021. “Assessing Cure Status Prediction from Survival Data Using Receiver Operating Characteristic Curves.” Biometrika 108 (3): 727–40.

Gretton, Arthur, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. 2008. “A Kernel Statistical Test of Independence.” Advances in Neural Information Processing Systems 20: 585–92.

Gretton, Arthur, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf, and Aapo Hyvärinen. 2005. “Kernel Methods for Measuring Independence.” Journal of Machine Learning Research 6 (12): 2075–2129.

Heller, Ruth, Yair Heller, and Malka Gorfine. 2013. “A Consistent Multivariate Test of Association Based on Ranks of Distances.” Biometrika 100 (2): 503–10.

Laska, Eugene M, and Morris J Meisner. 1992. “Nonparametric Estimation and Testing in a Cure Model.” Biometrics 48 (4): 1223–34.

López-Cheda, Ana, Ricardo Cao, M Amalia Jácome, and Ingrid Van Keilegom. 2017. “Nonparametric Incidence Estimation and Bootstrap Bandwidth Selection in Mixture Cure Models.” Computational Statistics & Data Analysis 105: 144–65.

López-Cheda, Ana, M Amalia Jácome, and Ricardo Cao. 2017. “Nonparametric Latency Estimation for Mixture Cure Models.” Test 26 (2): 353–76.

López-Cheda, Ana, M Amalia Jácome, and Ignacio López-de-Ullibarri. 2021. “Npcure: An R Package for Nonparametric Inference in Mixture Cure Models.” R Journal 13 (1): 21–41.

Maller, Ross, Sidney Resnick, Soudabeh Shemehsavar, and Muzhi Zhao. 2024. “Mixture Cure Model Methodology in Survival Analysis: Some Recent Results for the One-Sample Case.” Statistic Surveys 18: 82–138.

Monroy-Castillo, Blanca E., M. A. Jácome, Ricardo Cao, and Ingrid Van Keilegom. 2025. “Ovariate Hypothesis Tests for the Cure Rate in Mixture Cure Models Based on Martingale Difference Correlation.” Submitted.

Morbiducci, Marta, Alessandra Nardi, and Carla Rossi. 2003. “Classification of ‘Cured’ Individuals in Survival Analysis: The Mixture Approach to the Diagnostic–Prognostic Problem.” Computational Statistics & Data Analysis 41 (3-4): 515–29.

Müller, Ursula U, and Ingrid Van Keilegom. 2019. “Goodness-of-Fit Tests for the Cure Rate in a Mixture Cure Model.” Biometrika 106 (1): 211–27.

Park, Trevor, Xiaofeng Shao, and Shun Yao. 2015. “Partial Martingale Difference Correlation.” Electronic Journal of Statistics 9 (1): 1492–1517.

Patilea, Valentin, and Ingrid Van Keilegom. 2020. “A General Approach for Cure Models in Survival Analysis.” The Annals of Statistics 48 (4): 2323–46.

Peng, Yingwei, and Binbing Yu. 2021. Cure Models: Methods, Applications, and Implementation. Chapman; Hall/CRC.

Pfister, Niklas, Peter Bühlmann, Bernhard Schölkopf, and Jonas Peters. 2018. “Kernel-Based Tests for Joint Independence.” Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (1): 5–31.

Shao, Xiaofeng, and Jingsi Zhang. 2014. “Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening.” Journal of the American Statistical Association 109 (507): 1302–18. https://doi.org/10.1080/01621459.2014.887012.

Shen, Cencheng, Sambit Panda, and Joshua T Vogelstein. 2022. “The Chi-Square Test of Distance Correlation.” Journal of Computational and Graphical Statistics 31 (1): 254–62.

Szekely, Gabor J., Maria L. Rizzo, and Nail K. Bakirov. 2007. “Measuring and Testing Dependence by Correlation of Distances.” The Annals of Statistics 35 (6): 2769–94.

MDCcure

Blanca Monroy-Castillo

Introduction

Motivation for Covariate Testing

Partial Martingale Difference Correlation

Nonparametric hypothesis tests for the cure rate

Goodness-of-Fit Test

MDCcure

Overview of the package

Main Features

1. Dependence Measures

2. Covariate Hypothesis Testing for the Cure Rate in Mixture Cure Models

3. Goodness-of-Fit Testing for the Cure Rate in Mixture Cure Models

Using the package

Dependence measures

Covariate hypothesis test for the cure rate in mixture cure models

With one covariate

Available Test Methods

Bandwidth Selection

With two covariates

Goodnnes-of-fit test for the cure rate in a mixture cure model

Example