Introduction to DEmixR

Farrokh Habibzadeh

2025-09-21

Introduction

The DEmixR package provides tools for fitting and evaluating two-component mixture models (normal and lognormal) using robust global optimization via Differential Evolution (DEoptim), with local refinement by quasi-Newton optimization.

It is designed to be more robust than standard EM-based methods, particularly in challenging situations where the mixture components overlap strongly or have complex likelihood surfaces.

Installation

You can install the released version of DEmixR from CRAN with:

install.packages("DEmixR")

Overview of Functions

prelim_plots() – Diagnostic plots (histogram, QQ, PP, log-QQ) select_best_mixture() – Compare lognormal vs normal mixtures using BIC fit_norm2() – Fit a two-component normal mixture fit_lognorm2() – Fit a two-component lognormal mixture bootstrap_mix2() – Bootstrap mixture parameters (parametric or nonparametric) evaluate_init() – Refine initial parameter values with local optimization

When to Use Which Function

Use prelim_plots() for initial data exploration and distribution assessment Use select_best_mixture() when unsure whether normal or lognormal distribution is appropriate Use fit_norm2()/fit_lognorm2() when you know the appropriate distribution family Use bootstrap_mix2() for uncertainty quantification and confidence intervals Use evaluate_init() for testing specific starting values or refining solutions

Two-Component Normal Mixture

We first generate synthetic data from a two-component normal mixture and draw the basic plots with DEmixR.

Diagnostic Plots

# Load the package
library(DEmixR)

# Simulation data
set.seed(123)
x <- c(rnorm(3000, mean = 0, sd = 1),
       rnorm(2000, mean = 3, sd = 1.2))

# Diagnostic plots
prelim_plots(x, which = c("hist", "qq"))

Model Selection

# Automatically choose the best mixture family
best_model <- select_best_mixture(x, n_runs = 3, NP = 50, itermax = 2000)
##   |                                                                              |                                                                      |   0%  |                                                                              |+++++++++++++++++++++++                                               |  33%  |                                                                              |+++++++++++++++++++++++++++++++++++++++++++++++                       |  67%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
best_model$best$family
## [1] "normal"
best_model$BICs
##   normal 
## 19547.24

Normal Mixture

# Fit a two-component normal mixture
fit <- fit_norm2(x, n_runs = 5, quiet = 2, parallelType = 0)
##   |                                                                              |                                                                      |   0%  |                                                                              |++++++++++++++                                                        |  20%  |                                                                              |++++++++++++++++++++++++++++                                          |  40%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++                            |  60%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++              |  80%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
fit$par      # estimated parameters
##           p          m1          s1          m2          s2 
## 0.599738895 0.005086783 0.983210109 2.984878390 1.181022679
fit$logLik   # log-likelihood
## [1] -9752.326
fit$AIC      # Akaike Information Criterion
## [1] 19514.65
fit$BIC      # Bayesian Information Criterion
## [1] 19547.24

Lognormal Mixture

# Simulation data
set.seed(123)
y <- c(rlnorm(3000, meanlog = 0, sdlog = 0.5),
       rlnorm(2000, meanlog = 1, sdlog = 0.7))

# Fit a two-component lognormal mixture
fit_ln <- fit_lognorm2(y, n_runs = 5, parallelType = 0)
##   |                                                                              |                                                                      |   0%  |                                                                              |++++++++++++++                                                        |  20%  |                                                                              |++++++++++++++++++++++++++++                                          |  40%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++                            |  60%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++              |  80%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
fit_ln$par
##           p          m1          s1          m2          s2 
##  0.57503793 -0.01180295  0.48857490  0.95267679  0.69951039

Bootstrap

# Bootstrap confidence intervals
boot_res <- bootstrap_mix2(fit_ln, B = 100, parametric = TRUE, parallelType = 0)
boot_res$central
##           p          m1          s1          m2          s2 
##  0.57428385 -0.01681622  0.48844615  0.95834689  0.69416098
boot_res$ci
##               p          m1        s1        m2        s2
## 2.5%  0.4115206 -0.06678695 0.4343919 0.7041057 0.6095682
## 97.5% 0.7041991  0.05896932 0.5272649 1.2061600 0.7729422

Evaluate Initialization Values

evaluate_init(par_init = c(0.5, 0, 0.6, 1.1, 0.8), y, family = "lognormal")
## $success
## [1] TRUE
## 
## $par
## [1]  0.57497232 -0.01185789  0.48853997  0.95258083  0.69950326
## 
## $logLik
## [1] -7547.12
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Practical Example: Biomarker Data Analysis

# Simulated biomarker data with two populations
set.seed(456)
biomarker <- c(rlnorm(4000, meanlog = 2.5, sdlog = 0.4),  # Healthy group
               rlnorm(1000, meanlog = 3.8, sdlog = 0.6))  # Disease group

# Initial exploration
prelim_plots(biomarker, which = c("hist", "logqq"))

# Fit lognormal mixture
fit_bio <- try(fit_lognorm2(biomarker, n_runs = 5, parallelType = 0, quiet = 2))
##   |                                                                              |                                                                      |   0%  |                                                                              |++++++++++++++                                                        |  20%  |                                                                              |++++++++++++++++++++++++++++                                          |  40%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++                            |  60%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++              |  80%  |                                                                              |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
if (!inherits(fit_bio, "try-error")) {
  cat("Biomarker analysis results:\n")
  print(fit_bio$par)
  
  # Estimate proportion in each group
  cat("\nEstimated proportion in group 1 (healthy):", 
      round(fit_bio$par[1], 3), "\n")
  cat("Estimated proportion in group 2 (disease):", 
      round(1 - fit_bio$par[1], 3), "\n")
  
  # Bootstrap for confidence intervals
  boot_bio <- try(bootstrap_mix2(fit_bio, B = 100, quiet = 2))
  if (!inherits(boot_bio, "try-error")) {
    cat("\n95% CI for proportion in group 1:\n")
    print(boot_bio$ci[, "p"])
  }
}
## Biomarker analysis results:
##         p        m1        s1        m2        s2 
## 0.7942352 2.5045852 0.3877220 3.7999468 0.6123135 
## 
## Estimated proportion in group 1 (healthy): 0.794 
## Estimated proportion in group 2 (disease): 0.206 
## 
## 95% CI for proportion in group 1:
##      2.5%     97.5% 
## 0.7543291 0.8244741

Advanced Usage

This section illustrates some of the optional parameters for fine-tuning performance and diagnostics.

Fitting with fit_*

# Fit lognormal mixture with custom optimization settings
fit_ln <- fit_lognorm2(
  x,
  NP = 150,           # population size for DEoptim
  n_runs = 20,        # independent runs to avoid local optima
  itermax = 200,      # maximum iterations per run
  parallelType = 1,   # use parallelization across platforms
  quiet = 1,          # print minimal progress info
  par_init = NULL     # optionally supply initial values
)

Key options: - NP: larger = better exploration, but slower. - n_runs: increase for more robustness; with 1000, it is very slow, but have ultra robustness - parallelType: 0 = serial, 1 = socket clusters (Windows/Mac/Linux), 2 = forking (Mac/Linux only). - quiet: 0 = verbose, 1 = minimal, 2 = silent.

Bootstrapping with bootstrap_mix2

boot_res <- bootstrap_mix2(
  fit_ln,
  B = 500,            # number of bootstrap replications
  parallelType = 1,   # parallel computation
  parametric = TRUE,  # parametric bootstrap
  ci_level = 0.90,    # confidence interval level
  quiet = 1           # show progress bar
)

boot_res$central   # means for normal and medians for lognormal
boot_res$ci        # bootstrap confidence intervals

Notes: - Supports both parametric and nonparametric bootstrap. - Uses an internal function to avoid label switching.

Preliminary Plots with prelim_plots

prelim_plots(
  x,
  which = c("hist", "qq", "pp", "logqq"),
  hist_bins = 40,
  col_hist = "grey85",
  col_density = "darkorange",
  col_qq = "grey60",
  col_line = "darkorange"
)

Options: - which: choose one or more of "hist", "qq", "pp", "logqq". - hist_bins: number of bins for histogram. - col_*: customize plot colors.

Model Selection with select_best_mixture

sel <- select_best_mixture(
  x,
  n_runs = 5,
  NP = 100,
  itermax = 150,
  quiet = 2
)

sel$best$family
sel$best$par

What it does: - Fits both Normal and Lognormal mixtures. - Compares them via BIC. - Returns the best-performing family.

Evaluating Initial Values with evaluate_init

init_eval <- evaluate_init(
  x,
  family = "normal",
  n_starts = 10,      # number of random initializations
  NP = 50,            # DEoptim population size
  itermax = 100,      # maximum iterations
  quiet = 2           # suppress output
)

print(init_eval)

Usage: - Helps check the stability of optimization. - Useful to diagnose poor convergence.

Summary

The DEmixR package provides robust and flexible tools for mixture model analysis, making it a strong alternative to existing EM-based approaches. Its use of global optimization helps avoid local minima and improve reliability, especially in complex or overlapping mixtures.

Note: Harmless socket connection warnings may appear when using parallelization.

Key Features

  1. Robust estimation: Differential Evolution optimization avoids local minima
  2. Comprehensive diagnostics: Visualization, model selection, and bootstrap uncertainty
  3. Flexible: Supports both normal and lognormal distributions
  4. User-friendly: Sensible defaults with options for customization
  5. Efficient: Optional parallel computation for faster results

Best Practices

  1. Always start with prelim_plots() to understand your data distribution
  2. Use select_best_mixture() when uncertain about distribution family
  3. Increase n_runs for challenging optimization problems
  4. Use bootstrap methods for uncertainty quantification
  5. Test different initial values with evaluate_init() if convergence is problematic For more information, see the help files for each function: ?fit_norm2, ?bootstrap_mix2, etc.

References

  1. Mullen, K.M., et al. (2011). DEoptim: An R Package for Global Optimization by Differential Evolution. Journal of Statistical Software, 40(6), 1-26.
  2. R Core Team (2024). R: A Language and Environment for Statistical Computing.