Memory-efficient Cox proportional hazards regression via streaming
Newton-Raphson. Peak RAM is O(p^2) in the number of covariates and flat
in the number of rows n, so models fit on datasets that do not fit in
memory. Coefficients are identical to survival::coxph()
with Efron tie correction.
Development version from GitHub:
# install.packages("remotes")
remotes::install_github("tommycarstensen/coxstream-r")
In-memory fit, with the same formula interface as
coxph():
library(survival)
library(coxstream)
fit <- coxstream(Surv(time, status) ~ age + sex, data = lung)
coef(fit)
fit
Out-of-core fit, streaming a time-DESCENDING-sorted parquet file one
row group at a time (requires the optional arrow
package):
fit <- coxstream_arrow(
"events_sorted.parquet",
x_cols = c("age", "sex"),
time_col = "duration",
event_col = "event"
)
coef(fit)
The reader loads one row-group chunk at a time and frees it before the next, so peak RAM stays at O(batch_size * p), flat in n. Efron tie groups that span chunk boundaries are carried in running state, giving coefficients bit-identical to the in-memory fit.
Each Newton-Raphson iteration makes a single descending-time pass to accumulate the Cox partial-likelihood score and Hessian. Only running sums of size O(p) and O(p^2) are held, never the full risk set, so memory does not grow with n. The accumulation kernel is implemented in C++ via Rcpp.
coxstream_arrow()), testthat
(tests)MIT, except src/arrow_c_abi.h, which is vendored from
Apache Arrow under Apache-2.0; see inst/COPYRIGHTS.