| Title: | Vis-NIR Spectral Analysis Wrapper | 
| Version: | 0.2.6 | 
| Maintainer: | Jenna Hershberger <jmh579@cornell.edu> | 
| Description: | Originally designed application in the context of resource-limited plant research and breeding programs, 'waves' provides an open-source solution to spectral data processing and model development by bringing useful packages together into a streamlined pipeline. This package is wrapper for functions related to the analysis of point visible and near-infrared reflectance measurements. It includes visualization, filtering, aggregation, preprocessing, cross-validation set formation, model training, and prediction functions to enable open-source association of spectral and reference data. This package is documented in a peer-reviewed manuscript in the Plant Phenome Journal <doi:10.1002/ppj2.20012>. Specialized cross-validation schemes are described in detail in Jarquín et al. (2017) <doi:10.3835/plantgenome2016.12.0130>. Example data is from Ikeogu et al. (2017) <doi:10.1371/journal.pone.0188918>. | 
| License: | MIT + file LICENSE | 
| URL: | https://GoreLab.github.io/waves/, https://github.com/GoreLab/waves | 
| BugReports: | https://github.com/GoreLab/waves/issues | 
| Depends: | R (≥ 3.5) | 
| Imports: | caret, dplyr, ggplot2, lifecycle, magrittr, pls, prospectr, randomForest, readr, rlang, scales, spectacles, stringr, tibble, tidyr (≥ 1.0), tidyselect | 
| Suggests: | testthat (≥ 2.1.0), knitr, rmarkdown | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| RoxygenNote: | 7.3.2 | 
| VignetteBuilder: | knitr, rmarkdown | 
| NeedsCompilation: | no | 
| Packaged: | 2025-10-30 22:46:12 UTC; Jenna | 
| Author: | Jenna Hershberger | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-30 23:50:02 UTC | 
Aggregate data based on grouping variables and a user-provided function
Description
Use grouping variables to collapse spectral data.frame by
mean or median. Recommended for use after filter_spectra
Usage
aggregate_spectra(df, grouping.colnames, reference.value.colname,
  agg.function)
Arguments
| df | 
 | 
| grouping.colnames | Names of columns to be used as grouping variables. Minimum 2 variables required. Default is c("trial", "plot"). | 
| reference.value.colname | Name of reference column to be aggregated along with spectra. Default is "reference" | 
| agg.function | Name of function (string format) to be used for sample aggregation. Must be either "mean" or "median". Default is "mean". | 
Value
data.frame object df aggregated based on grouping
column by agg.function
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
aggregated.test <- ikeogu.2017 %>%
  dplyr::select(-TCC) %>%
  na.omit() %>%
  aggregate_spectra(
    grouping.colnames = c("study.name"),
    reference.value.colname = "DMC.oven",
    agg.function = "mean"
  )
aggregated.test[1:5, 1:5]
Filter spectral data frame based on Mahalanobis distance
Description
Determine Mahalanobis distances of observations (rows) within a
given data.frame with spectral data. Option to filter out
observations based on these distances.
Usage
filter_spectra(df, filter, return.distances, num.col.before.spectra,
  window.size, verbose)
Arguments
| df | 
 | 
| filter | boolean that determines whether or not the input
 | 
| return.distances | boolean that determines whether a column of squared
Mahalanobis distances will be included in output  | 
| num.col.before.spectra | number of columns to the left of the spectral
matrix in  | 
| window.size | number defining the size of window to use when calculating the covariance of the spectra (required to calculate Mahalanobis distance). Default is 10. | 
| verbose | If  | 
Details
This function uses a chi-square distribution with 95% cutoff where
degrees of freedom = number of wavelengths (columns) in the input
data.frame.
Value
If filter is TRUE, returns filtered data frame
df and reports the number of rows removed. The Mahalanobis distance
with a cutoff of 95% of chi-square distribution (degrees of freedom =
number of wavelengths) is used as filtering criteria. If filter is
FALSE, returns full input df with column h.distances
containing the Mahalanobis distance for each row.
Author(s)
Jenna Hershberger jmh579@cornell.edu
References
Johnson, R.A., and D.W. Wichern. 2007. Applied Multivariate Statistical Analysis (6th Edition). pg 189
Examples
library(magrittr)
filtered.test <- ikeogu.2017 %>%
  dplyr::select(-TCC) %>%
  na.omit() %>%
  filter_spectra(
    df = .,
    filter = TRUE,
    return.distances = TRUE,
    num.col.before.spectra = 5,
    window.size = 15
  )
filtered.test[1:5, c(1:5, (ncol(filtered.test) - 5):ncol(filtered.test))]
Format multiple trials with or without overlapping genotypes into training and test sets according to user-provided cross validation scheme
Description
Standalone function that is also used within
train_spectra to divide trials or studies into training and
test sets based on overlap in trial environments and genotype entries
Usage
format_cv(
  trial1,
  trial2,
  trial3 = NULL,
  cv.scheme,
  stratified.sampling = TRUE,
  proportion.train = 0.7,
  seed = NULL,
  remove.genotype = FALSE
)
Arguments
| trial1 | 
 | 
| trial2 | 
 | 
| trial3 | 
 | 
| cv.scheme | A cross validation (CV) scheme from Jarquín et al., 2017.
Options for  
 | 
| stratified.sampling | If  | 
| proportion.train | Fraction of samples to include in the training set. Default is 0.7. | 
| seed | Number used in the function  | 
| remove.genotype | boolean that, if  | 
Details
Use of a cross-validation scheme requires a column in the input
data.frame named "genotype" to ensure proper sorting of training and
test sets. Variables trial1 and trial2 are required, while
trial 3 is optional.
Value
List of data.frames ($train.set, $test.set) compiled according to user-provided cross validation scheme.
Author(s)
Jenna Hershberger jmh579@cornell.edu
References
Jarquín, D., C. Lemes da Silva, R. C. Gaynor, J. Poland, A. Fritz, R. Howard, S. Battenfield, and J. Crossa. 2017. Increasing genomic-enabled prediction accuracy by modeling genotype × environment interactions in Kansas wheat. Plant Genome 10(2):1-15. <doi:10.3835/plantgenome2016.12.0130>
Examples
# Must have a column called "genotype", so we'll create a fake one for now
# We will use CV00, which does not require any overlap in genotypes
# In real scenarios, CV schemes that rely on genotypes should not be applied
# when genotypes are unknown, as in this case.
library(magrittr)
trials <- ikeogu.2017 %>%
  dplyr::mutate(genotype = 1:nrow(ikeogu.2017)) %>% # fake for this example
  dplyr::rename(reference = DMC.oven) %>%
  dplyr::select(
    study.name, sample.id, genotype, reference,
    tidyselect::starts_with("X")
  )
trial1 <- trials %>%
  dplyr::filter(study.name == "C16Mcal") %>%
  dplyr::select(-study.name)
trial2 <- trials %>%
  dplyr::filter(study.name == "C16Mval") %>%
  dplyr::select(-study.name)
cv.list <- format_cv(
  trial1 = trial1, trial2 = trial2, cv.scheme = "CV00",
  stratified.sampling = FALSE, remove.genotype = TRUE
)
cv.list$train.set[1:5, 1:5]
cv.list$test.set[1:5, 1:5]
Example vis-NIRS and reference dataset
Description
The 'ikeogu.2017' data set contains raw vis-NIRS scans, total carotenoid content, and cassava root dry matter content (using the oven method) from the 2017 PLOS One paper by Ikeogu et al. This dataset contains a subset of the original scans and reference values from the supplementary files of the paper. 'ikeogu.2017' is a 'data.frame' that contains the following columns:
- study.name = Name of the study as described in Ikeogu et al. (2017). 
- sample.id = Unique identifier for each individual root sample 
- DMC.oven = Cassava root dry matter content, the percentage of dry weight relative to fresh weight of a sample after oven drying. 
- TCC = Total carotenoid content ( - \mu g/g, unknown whether on a fresh or dry weight basis) as measured by high performance liquid chromatography
- X350:X2500 = spectral reflectance measured with the QualitySpec Trek: S-10016 vis-NIR spectrometer. Each cell represents the mean of 150 scans on a single root at a single wavelength. 
Usage
ikeogu.2017
Format
An object of class tbl_df (inherits from tbl, data.frame) with 175 rows and 2155 columns.
Author(s)
Original authors: Ikeogu, U.N., F. Davrieux, D. Dufour, H. Ceballos, C.N. Egesi, and J. Jannink. Reformatted by Jenna Hershberger.
References
Ikeogu, U.N., F. Davrieux, D. Dufour, H. Ceballos, C.N. Egesi, et al. 2017. Rapid analyses of dry matter content and carotenoids in fresh cassava roots using a portable visible and near infrared spectrometer (Vis/NIRS). PLOS One 12(12): 1–17. doi: 10.1371/journal.pone.0188918.
Examples
library(magrittr)
library(ggplot2)
data(ikeogu.2017)
ikeogu.2017[1:10, 1:10]
ikeogu.2017 %>%
  dplyr::select(-starts_with("X")) %>%
  dplyr::group_by(study.name) %>%
  tidyr::gather(trait, value, c(DMC.oven:TCC), na.rm = TRUE) %>%
  ggplot2::ggplot(aes(x = study.name, y = value, fill = study.name)) +
  facet_wrap(~trait, scales = "free_y", nrow = 2) +
  geom_boxplot()
Plot spectral data, highlighting outliers as identified using Mahalanobis distance
Description
Generates a ggplot object of given
spectra, with wavelength on the x axis and given spectral values on the y.
Mahalanobis distance is used to calculate outliers, which are both
identified on the plot. Rows from the original dataframe are printed to the
console for each outlier that is identified.
Usage
plot_spectra(
  df,
  num.col.before.spectra = 1,
  window.size = 10,
  detect.outliers = TRUE,
  color = NULL,
  alternate.title = "",
  verbose = TRUE,
  wavelengths = deprecated()
)
Arguments
| df | 
 | 
| num.col.before.spectra | Number of columns to the left of the spectral matrix (including unique ID). Default is 1. | 
| window.size | number defining the size of window to use when calculating the covariance of the spectra (required to calculate Mahalanobis distance). Default is 10. | 
| detect.outliers | Boolean indicating whether spectra should be filtered
before plotting. If  | 
| color | String or vector of strings indicating colors to be passed to
 | 
| alternate.title | String to be used as plot title. If
 | 
| verbose | If  | 
| wavelengths | DEPRECATED  | 
Value
If verbose, prints unique ID and metadata for rows identified as outliers. Returns plot of spectral data with non-outliers in blue and outliers in red. X-axis is wavelengths and y-axis is spectral values.
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
ikeogu.2017 %>%
  dplyr::rename(unique.id = sample.id) %>%
  dplyr::select(unique.id, dplyr::everything(), -TCC) %>%
  na.omit() %>%
  plot_spectra(
    df = .,
    num.col.before.spectra = 5,
    window.size = 15,
    detect.outliers = TRUE,
    color = NULL,
    alternate.title = NULL,
    verbose = TRUE
  )
Use provided model object to predict trait values with input dataset
Description
Loads an existing model and cross-validation performance
statistics (created with save_model) and makes predictions
based on new spectra.
Usage
predict_spectra(
  input.data,
  model.stats.location,
  model.location,
  model.method = "pls",
  wavelengths = deprecated()
)
Arguments
| input.data | 
 | 
| model.stats.location | String containing file path (including file name)
to save location of "(model.name)_stats.csv" as output from the
 | 
| model.location | String containing file path (including file name) to
location where the trained model ("(model.name).Rds") was saved as output
by the  | 
| model.method | Model type to use for training. Valid options include: 
 | 
| wavelengths | DEPRECATED  | 
Value
data.frame object of predictions for each sample (row). First
column is unique identifier supplied by input.data and second is
predicted values
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
## Not run: 
ikeogu.2017 %>%
  dplyr::select(sample.id, dplyr::starts_with("X")) %>%
  predict_spectra(
    input.data = .,
    model.stats.location = paste0(
      getwd(),
      "/my_model_stats.csv"
    ),
    model.location = paste0(getwd(), "/my_model.Rds")
  )
## End(Not run)
Pretreat spectral data according to user-designated method
Description
Pretreatment, also known as preprocessing, is often used to
increase the signal to noise ratio in vis-NIR datasets. The waves
function pretreat_spectra applies common spectral pretreatment
methods such as standard normal variate and the Savitzky-Golay filter.
Usage
pretreat_spectra(
  df,
  test.data = NULL,
  pretreatment = 1,
  preprocessing.method = deprecated(),
  wavelengths = deprecated()
)
Arguments
| df | 
 | 
| test.data | 
 | 
| pretreatment | Number or list of numbers 1:13 corresponding to desired pretreatment method(s): 
 | 
| preprocessing.method | DEPRECATED  | 
| wavelengths | DEPRECATED  | 
Value
Pretreated df' (or list of data.frames) with
reference column intact
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
pretreat_spectra(df = ikeogu.2017, pretreatment = 3)[1:5, 1:5]
Functions renamed in waves 0.2.0
Description
‘r lifecycle::badge(’deprecated')'
waves 0.2.0 renamed a number of functions to ensure that every function name adheres to the tidyverse style guide.
* 'AggregateSpectra()' -> 'aggregate_spectra()' * 'DoPreprocessing()' -> 'pretreat_spectra()' * 'FilterSpectra()' -> 'filter_spectra()' * 'FormatCV()' -> 'format_cv()' * 'PlotSpectra()' -> 'plot_spectra()' * 'PredictFromSavedModel()' -> 'predict_spectra()' * 'SaveModel()' -> 'save_model()' * 'TestModelPerformance()' -> 'test_spectra()' * 'TrainSpectralModel()' -> 'train_spectra()'
Usage
AggregateSpectra(
  df,
  grouping.colnames = c("unique.id"),
  reference.value.colname = "reference",
  agg.function = "mean"
)
DoPreprocessing(df, test.data = NULL, pretreatment = 1)
FilterSpectra(
  df,
  filter = TRUE,
  return.distances = FALSE,
  num.col.before.spectra = 4,
  window.size = 10,
  verbose = TRUE
)
FormatCV(
  trial1,
  trial2,
  trial3 = NULL,
  cv.scheme,
  stratified.sampling = TRUE,
  proportion.train = 0.7,
  seed = NULL,
  remove.genotype = FALSE
)
PlotSpectra(
  df,
  num.col.before.spectra = 1,
  window.size = 10,
  detect.outliers = TRUE,
  color = NULL,
  alternate.title = NULL,
  verbose = TRUE
)
PredictFromSavedModel(
  input.data,
  model.stats.location,
  model.location,
  model.method = "pls"
)
SaveModel(
  df,
  save.model = TRUE,
  pretreatment = 1,
  model.save.folder = NULL,
  model.name = "PredictionModel",
  best.model.metric = "RMSE",
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  num.iterations = 10,
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  verbose = TRUE
)
TestModelPerformance(
  train.data,
  num.iterations,
  test.data = NULL,
  pretreatment = 1,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  verbose = TRUE
)
TrainSpectralModel(
  df,
  num.iterations,
  test.data = NULL,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  verbose = TRUE
)
Save spectral prediction model and model performance statistics
Description
Given a set of pretreatment methods, saves the best spectral
prediction model and model statistics to model.save.folder as
model.name.Rds and model.name_stats.csv respectively. If only
one pretreatment method is supplied, results from that method are stored.
Usage
save_model(
  df,
  write.model = TRUE,
  pretreatment = 1,
  model.save.folder = NULL,
  model.name = "PredictionModel",
  best.model.metric = "RMSE",
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  num.iterations = 10,
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  seed = 1,
  verbose = TRUE,
  save.model = deprecated(),
  wavelengths = deprecated(),
  autoselect.preprocessing = deprecated(),
  preprocessing.method = deprecated()
)
Arguments
| df | 
 | 
| write.model | If  | 
| pretreatment | Number or list of numbers 1:13 corresponding to desired pretreatment method(s): 
 | 
| model.save.folder | Path to folder where model will be saved. If not provided, will save to working directory. | 
| model.name | Name that model will be saved as in
 | 
| best.model.metric | Metric used to decide which model is best. Must be either "RMSE" or "Rsquared" | 
| k.folds | Number indicating the number of folds for k-fold cross-validation during model training. Default is 5. | 
| proportion.train | Fraction of samples to include in the training set. Default is 0.7. | 
| tune.length | Number delineating search space for tuning of the PLSR
hyperparameter  | 
| model.method | Model type to use for training. Valid options include: 
 | 
| num.iterations | Number of training iterations to perform | 
| stratified.sampling | If  | 
| cv.scheme | A cross validation (CV) scheme from Jarquín et al., 2017.
Options for  
 | 
| trial1 | 
 | 
| trial2 | 
 | 
| trial3 | 
 | 
| seed | Integer to be used internally as input for  | 
| verbose | If  | 
| save.model | DEPRECATED  | 
| wavelengths | DEPRECATED  | 
| autoselect.preprocessing | DEPRECATED
 | 
| preprocessing.method | DEPRECATED  | 
Details
Wrapper that uses pretreat_spectra,
format_cv, and train_spectra functions.
Value
List of model stats (in data.frame) and trained model object.
If the parameter write.model is TRUE, both objects are saved to
model.save.folder. To use the optimally trained model for
predictions, use tuned parameters from $bestTune.
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
test.model <- ikeogu.2017 %>%
  dplyr::filter(study.name == "C16Mcal") %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  save_model(
    df = .,
    write.model = FALSE,
    pretreatment = 1:13,
    model.name = "my_prediction_model",
    tune.length = 3,
    num.iterations = 3
  )
summary(test.model$best.model)
test.model$best.model.stats
Test the performance of spectral models
Description
Wrapper that trains models based spectral data to predict reference values and reports model performance statistics
Usage
test_spectra(
  train.data,
  num.iterations,
  test.data = NULL,
  pretreatment = 1,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  seed = 1,
  verbose = TRUE,
  wavelengths = deprecated(),
  preprocessing = deprecated(),
  output.summary = deprecated(),
  rf.variable.importance = deprecated()
)
Arguments
| train.data | 
 | 
| num.iterations | Number of training iterations to perform | 
| test.data | 
 | 
| pretreatment | Number or list of numbers 1:13 corresponding to desired pretreatment method(s): 
 | 
| k.folds | Number indicating the number of folds for k-fold cross-validation during model training. Default is 5. | 
| proportion.train | Fraction of samples to include in the training set. Default is 0.7. | 
| tune.length | Number delineating search space for tuning of the PLSR
hyperparameter  | 
| model.method | Model type to use for training. Valid options include: 
 | 
| best.model.metric | Metric used to decide which model is best. Must be either "RMSE" or "Rsquared" | 
| stratified.sampling | If  | 
| cv.scheme | A cross validation (CV) scheme from Jarquín et al., 2017.
Options for  
 | 
| trial1 | 
 | 
| trial2 | 
 | 
| trial3 | 
 | 
| split.test | boolean that allows for a fixed training set and a split
test set. Example// train model on data from two breeding programs and a
stratified subset (70%) of a third and test on the remaining samples
(30%)  of the third. If  | 
| seed | Integer to be used internally as input for  | 
| verbose | If  | 
| wavelengths | DEPRECATED  | 
| preprocessing | DEPRECATED please use
 | 
| output.summary | DEPRECATED  | 
| rf.variable.importance | DEPRECATED
 | 
Details
Calls pretreat_spectra, format_cv,
and train_spectra functions.
Value
list of 5 objects:
- 'model.list' is a - listof trained model objects, one for each pretreatment method specified by the- pretreatmentargument. Each model is trained with all rows of- df.
- 'summary.model.performance' is a - data.framecontaining summary statistics across all model training iterations and pretreatments. See below for a description of the summary statistics provided.
- 'model.performance' is a - data.framecontaining performance statistics for each iteration of model training separately (see below).
- 'predictions' is a - data.framecontaining both reference and predicted values for each test set entry in each iteration of model training.
- 'importance' is a - data.framecontaining variable importance results for each wavelength at each iteration of model training. If- model.methodis not "pls" or "rf", this list item is- NULL.
'summary.model.performance' and 'model.performance' data.frames
summary statistics include:
- Tuned parameters depending on the model algorithm: -  Best.n.comp, the best number of components 
-  Best.ntree, the best number of trees in an RF model 
-  Best.mtry, the best number of variables to include at every decision point in an RF model 
 
-  
-  RMSECV, the root mean squared error of cross-validation 
-  R2cv, the coefficient of multiple determination of cross-validation for PLSR models 
-  RMSEP, the root mean squared error of prediction 
-  R2p, the squared Pearson’s correlation between predicted and observed test set values 
-  RPD, the ratio of standard deviation of observed test set values to RMSEP 
-  RPIQ, the ratio of performance to interquartile difference 
-  CCC, the concordance correlation coefficient 
-  Bias, the average difference between the predicted and observed values 
-  SEP, the standard error of prediction 
-  R2sp, the squared Spearman’s rank correlation between predicted and observed test set values 
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
ikeogu.2017 %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  test_spectra(
    train.data = .,
    tune.length = 3,
    num.iterations = 3,
    pretreatment = 1
  )
Train a model based predict reference values with spectral data
Description
Trains spectral prediction models using one of several algorithms and sampling procedures.
Usage
train_spectra(
  df,
  num.iterations,
  test.data = NULL,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  seed = 1,
  verbose = TRUE,
  save.model = deprecated(),
  rf.variable.importance = deprecated(),
  output.summary = deprecated(),
  return.model = deprecated()
)
Arguments
| df | 
 | 
| num.iterations | Number of training iterations to perform | 
| test.data | 
 | 
| k.folds | Number indicating the number of folds for k-fold cross-validation during model training. Default is 5. | 
| proportion.train | Fraction of samples to include in the training set. Default is 0.7. | 
| tune.length | Number delineating search space for tuning of the PLSR
hyperparameter  | 
| model.method | Model type to use for training. Valid options include: 
 | 
| best.model.metric | Metric used to decide which model is best. Must be either "RMSE" or "Rsquared" | 
| stratified.sampling | If  | 
| cv.scheme | A cross validation (CV) scheme from Jarquín et al., 2017.
Options for  
 | 
| trial1 | 
 | 
| trial2 | 
 | 
| trial3 | 
 | 
| split.test | boolean that allows for a fixed training set and a split
test set. Example// train model on data from two breeding programs and a
stratified subset (70%) of a third and test on the remaining samples
(30%)  of the third. If  | 
| seed | Integer to be used internally as input for  | 
| verbose | If  | 
| save.model | DEPRECATED  | 
| rf.variable.importance | DEPRECATED
 | 
| output.summary | DEPRECATED  | 
| return.model | DEPRECATED  | 
Value
list of the following:
-  modelis a model object trained with all rows ofdf.
-  summary.model.performanceis adata.framewith model performance statistics in summary format (2 rows, one with mean and one with standard deviation of all training iterations).
-  full.model.performanceis adata.framewith model performance statistics in long format (number of rows =num.iterations)
-  predictionsis adata.framecontaining predicted values for each test set entry at each iteration of model training.
-  importanceis adata.framethat contains variable importance for each wavelength. Only available formodel.methodoptions "rf" and "pls".
Included summary statistics:
- Tuned parameters depending on the model algorithm: -  Best.n.comp, the best number of components 
-  Best.ntree, the best number of trees in an RF model 
-  Best.mtry, the best number of variables to include at every decision point in an RF model 
 
-  
-  RMSECV, the root mean squared error of cross-validation 
-  R2cv, the coefficient of multiple determination of cross-validation for PLSR models 
-  RMSEP, the root mean squared error of prediction 
-  R2p, the squared Pearson’s correlation between predicted and observed test set values 
-  RPD, the ratio of standard deviation of observed test set values to RMSEP 
-  RPIQ, the ratio of performance to interquartile difference 
-  CCC, the concordance correlation coefficient 
-  Bias, the average difference between the predicted and observed values 
-  SEP, the standard error of prediction 
-  R2sp, the squared Spearman’s rank correlation between predicted and observed test set values 
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
ikeogu.2017 %>%
  dplyr::filter(study.name == "C16Mcal") %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  train_spectra(
    df = .,
    tune.length = 3,
    num.iterations = 3,
    best.model.metric = "RMSE",
    stratified.sampling = TRUE
  ) %>%
  summary()