moimport(file,...)###Basic data import
The function is called with moimport(<file>,...),
where the mandatory argument <file> is a string
specifying the path of the file to be imported. Consider the following
dataset:
provided by the package in the Excel file ‘testDatabasic.xlsx’. The
first column contains the sample IDs followed by marker columns. Marker
labels are in the first row. All marker columns are of type ‘STR’
(default for argument molecular). The code below imports
and transforms this dataset to standard format:
infile <- system.file("extdata", "testDatabasic.xlsx", package = "MLMOI")
Outfile <- moimport(file = infile)
#> Warning: There are no warnings!The first six rows of the imported data are printed using the command:
head(Outfile)
#>   Sample IDs  M1   M2   M3  M4
#> 1   s-id1005 258  200  200 175
#> 2   s-id1009 258  200  197 175
#> 3   s-id1009 260 <NA> <NA> 177
#> 4   s-id1016 258  200  197 175
#> 5   s-id1018 258  202  197 177
#> 6   s-id1024 258  200  197 175###Exporting the dataset in standard format
If export is not an absolute path, the dataset will be
stored in the current working directory:
###Importing datasets with metadata {#meta}
The number of metadata columns are specified via argument
nummtd. If Users want to retain metadata,
keepmtd = TRUE need to be set. The dataset
‘testDatametadata.xlsx’ contains three metadata variables, as shown
below:
The appropriate code for importing this dataset is:
infile <- system.file("extdata", "testDatametadata.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 3, keepmtd = TRUE)
#> Warning: There are no warnings!The first six rows of the imported data are printed using the command:
head(Outfile)
#>   Sample IDs     Location year  clonality locus1 locus2 locus3 locus4
#> 1      sam73 Lower Maroni 2006 polyclonal    188    256    102    316
#> 2      sam73 Lower Maroni 2006 polyclonal    166   <NA>   <NA>    336
#> 3      sam11 Lower Maroni 2008 monoclonal    188    250    102    336
#> 4      sam26 Lower Maroni 2010 polyclonal    160    256    102    320
#> 5      sam26 Lower Maroni 2010 polyclonal   <NA>   <NA>    110   <NA>
#> 6     sam427 Lower Maroni 2010 polyclonal    160    250    102    300If users forget to set nummtd, the import function
regards the metadata columns as marker columns. This usually causes
various warnings.
###Importing different types of molecular markers
The Excel file ‘testDatamt.xlsx’ provided by the package contains different types of molecular data. This dataset is shown below
Marker columns start from third column. They are from left to right
of data type ‘STR’, ‘amino-acid’, ‘SNP’, ‘SNP’, ‘amino-acid’ and
‘codon’. The argument molecular needs to be set as a
vector:
infile <- system.file("extdata", "testDatamt.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 1, molecular = c('str','amino','snp','snp','amino','codon'))
#> Warning: There are no warnings!The first six rows of the imported data are printed using the command:
head(Outfile)
#>   Sample IDs m1STR m1Amino m1SNP m2SNP m2Amino m1Codon
#> 1       sam1   132     HIS     C     A     ARG     AGC
#> 2       sam1   136    <NA>  <NA>  <NA>    <NA>    <NA>
#> 3       sam2   136     HIS     C     T     CYS     AGG
#> 4       sam3   140     HIS     G     T     GLY     GGG
#> 5       sam3  <NA>    <NA>  <NA>  <NA>     ARG    <NA>
#> 6       sam4   140     ASP     G     A     ARG     GGGSince the coding class of markers in this dataset are the default
ones, the argument coding is not specified.
###Different molecular types with different coding classes
If a dataset contains a molecular marker whose coding is not assumed
as default in moimport, the argument coding
needs to be set. Consider the following dataset:
called ‘testDatamtc.xlsx’ provided by the package. The molecular
markers in columns 3 (‘SNP’), 5 (‘STR’) and 6 (‘amino-acid’) are not in
default coding. Therefore, the argument coding needs to be
set:
infile <- system.file("extdata", "testDatamtc.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 1,
         molecular = c('snp', 'str', 'str', 'amino', 'snp', 'codon'),
         coding = c('iupac', 'integer', 'nearest', 'full', '4let', 'triplet'))
#> Warning: The cell on row 13 and marker 'snp-iupac' contains an unidentified
#> entry: 'Z'.Here a warning is issued because the entry ‘z’ is not in coding ‘iupac’. The last six rows of the imported data are printed using the command:
tail(Outfile)
#>    Sample IDs snp-iupac ms-integer ms-real aa-amino snp-nucleo aa-codon
#> 9        xid6      <NA>        148     192      ALA       <NA>      ATT
#> 10       xid7         T        148     188      ALA          G      ATT
#> 11       xid8         A        164     182      GLY          C      ATT
#> 12       xid8         C       <NA>     192      ALA          G      ATG
#> 13       xid9         T        160     188      GLY          C      ATG
#> 14      xid10         A        160     180      ALA          C      ATGIt can be seen that the entry ‘z’ is deleted.
###Data from several worksheets
The file ‘testDatacomplex.xlsx’ provided by the package contains datasets in multiple worksheets as shown below:
The numbering determines the order of worksheets in the Excel file.
Clearly, this is the case of multiple worksheet dataset. Hence the
option multsheets = TRUE needs to be set. Notice that
worksheet (2) is entered in transposed format. Thus, the option
transposed needs to be set as a logical vector, specifying
which worksheets are in transposed format. The following code imports
the dataset:
infile <- system.file("extdata", "testDatacomplex.xlsx", package = "MLMOI")
Outfile <- moimport(infile, multsheets = TRUE, keepmtd = TRUE, nummtd = c(0, 2, 2, 0), transposed = c(FALSE, TRUE, FALSE, FALSE),
         molecular = list(c('str', 'str', 'snp', 'snp'),
                          c('amino', 'codon', 'snp', 'amino', 'codon'),
                          c('codon', 'str'),
                          c('str', 'amino', 'snp', 'str')),
         coding = list(rep(c('integer', '4let'), each = 2), 
                       c('1let', 'triplet', 'iupac', '3let', 'triplet'),
                       c('triplet', 'integer'),
                       c('integer', '1let', 'iupac', 'integer')))
#> Warning: In worksheet 1: The cell in row 3 and marker 'x1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 5 and marker 'x1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 5 and marker 'y1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 7 and marker 'y1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 4 and marker 'x1np' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 4 and marker 'y1np' contains the
#> unexpected character '/'.
#> Warning: In worksheet 2: The cell in column 5 and marker 'x2mino' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 7 and marker 'x2odon' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 5 and marker 'x2odon' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 6 and marker 'y2mino' contains the
#> unexpected character '?'.Several warnings are issued because data in the first and second worksheets have several entry errors. The first seven rows of the imported data are printed using the command:
head(Outfile, n = 7)
#>   Sample IDs  site testDate year x1tr y1tr x1np y1np x2mino x2odon x2np y2mino
#> 1       sid4 Aioun    41794 1997  124   95    A    A    TYR   <NA>    C    CYS
#> 2       sid4 Aioun    41794 1997  134   98 <NA> <NA>   <NA>   <NA>    G    PHE
#> 3       sid4 Aioun    41794 1997 <NA> <NA> <NA> <NA>   <NA>   <NA> <NA>   <NA>
#> 4       sid8 Aioun     <NA> 1997 <NA>   95    A    A    TYR    CCG    A   <NA>
#> 5       sid8 Aioun     <NA> 1997 <NA> <NA> <NA> <NA>   <NA>    CGA    G   <NA>
#> 6       sid3  <NA>    41795 1997  120  102 <NA> <NA>   <NA>   <NA> <NA>   <NA>
#> 7       sid3  <NA>    41795 1997 <NA> <NA> <NA> <NA>   <NA>   <NA> <NA>   <NA>
#>   y2odon x3codon x3tr x4tr x4amino x4np y4tr
#> 1    AAA     CTA   95  280     THR    G  230
#> 2   <NA>     TTA  112  260    <NA>    T <NA>
#> 3   <NA>     CTT <NA> <NA>    <NA> <NA> <NA>
#> 4    AAG    <NA> <NA>  320     GLU    A  230
#> 5   <NA>    <NA> <NA> <NA>     THR <NA> <NA>
#> 6   <NA>     CCT  119  320     GLU    A  190
#> 7   <NA>    <NA> <NA>  280    <NA> <NA>  205First warning of worksheet (2), addresses an entry with consecutive amino acids. Notice that, in spite of the warning the consecutive amino acids got converted to their equivalent three-letter designations after all.
moimerge(file1, file2,...)###Merging datasets {#merge}
Consider the datasets ‘testDatamerge1.xlsx’ and ‘testDatamerge2.xlsx’ from the package:
These datasets are already in standard format. Running the following code imports and merges the datasets:
infile1 <- system.file("extdata", "testDatamerge1.xlsx", package = "MLMOI")
infile2 <- system.file("extdata", "testDatamerge2.xlsx", package = "MLMOI")
outfile <- moimerge(infile1, infile2, nummtd1 = 1, nummtd2 = 2, keepmtd = TRUE)
#> Warning:  Metadata of the following sample is not unique:
#>  'village' of sample 'samid10'.A warning is issued because the metadata information of sample ‘samid10’ differs in two datasets, implying a data entry error. The first six rows of the imported data are printed using the command:
head(outfile, n = 6)
#>   Sample IDs village patient marker1 marker2 marker3 position1 position2
#> 1 "samid10"  "1"     "idy"   "223"   "155"   "Lys"   "A"       "CCC"    
#> 2 "samid12"  "1"     "idx"   "223"   "202"   "Gln"   "T"       "CCA"    
#> 3 "samid12"  "1"     "idx"   NA      NA      NA      NA        "CCC"    
#> 4 "samid23"  "2"     NA      "223"   "109"   "Gln"   NA        NA       
#> 5 "samid23"  "2"     NA      NA      NA      "Lys"   NA        NA       
#> 6 "samid37"  "1"     "idz"   NA      NA      NA      "T"       "CCC"Users can set export = <Outfile> to export the
merged dataset.
moimle(file,...)###Estimating parameters
Consider again the dataset ‘testDatametadata.xlsx’ in section Importing datasets with metadata. The following code derives the maximum-likelihood estimates (MLE) of the MOI parameter and lineage frequencies for each marker separately
infile <- system.file("extdata", "testDatametadata.xlsx", package = "MLMOI")
mle <- moimle(file = infile, nummtd = 3)
mle$locus1
#> $`Sample Size`
#>       
#> N =  8
#> 
#> $`Allele Prevalence Counts`
#>       160 166 188
#> Nk =    4   2   3
#> 
#> $`Observed Prevalences`
#>         160  166   188
#> Nk/N =  0.5 0.25 0.375
#> 
#> $`Log likelihood at MLE`
#>                            
#> log likelihood =  -11.46237
#> 
#> $`MOI Parameter MLEstimate and Average MOI`
#>           lambda      psi
#> MLE =  0.3848811 1.204755
#> 
#> $`Lineage Frequencies MLEstimates`
#>                160       166       188
#> MLE_p =  0.4521839 0.2162672 0.3315488###Constraining parameters
Consider again the dataset ‘testDatasetmerge2’ from section Merging datasets.This dataset contains pathological
data on marker ‘position2’, resulting in meaningless estimate \(\lambda = 0\). By setting boundaries via
argument bounds, the MLE for lineage frequencies is derived
from profiling at the bounds using \(\lambda_{min}\) or \(\lambda_{max}\)
infile <- system.file("extdata", "testDatamerge2.xlsx", package = "MLMOI")
moimle(file = infile, nummtd = 2, bounds = c(1.5, 2))
#> Warning: The MLE of MOI parameter at marker 'position1'(1.3863) is lower than
#> minimum lambda.
#> $position1
#> $position1$`Sample Size`
#>       
#> N =  6
#> 
#> $position1$`Allele Prevalence Counts`
#>       A T
#> Nk =  4 4
#> 
#> $position1$`Observed Prevalences`
#>                 A         T
#> Nk/N =  0.6666667 0.6666667
#> 
#> $position1$`Log likelihood at MLE`
#>                            
#> log likelihood =  -6.599933
#> 
#> $position1$`MOI Parameter MLEstimate and Average MOI`
#>          lambda      psi
#> Fixed =     1.5 1.930825
#> 
#> $position1$`Lineage Frequencies MLEstimates`
#>            A   T
#> MLE_p =  0.5 0.5
#> 
#> 
#> $position2
#> $position2$`Sample Size`
#>       
#> N =  6
#> 
#> $position2$`Allele Prevalence Counts`
#>       CCA CCC CGC
#> Nk =    1   6   1
#> 
#> $position2$`Observed Prevalences`
#>               CCA CCC       CGC
#> Nk/N =  0.1666667   1 0.1666667
#> 
#> $position2$`Log likelihood at MLE`
#>                            
#> log likelihood =  -5.779715
#> 
#> $position2$`MOI Parameter MLEstimate and Average MOI`
#>          lambda      psi
#> Fixed =       2 2.313035
#> 
#> $position2$`Lineage Frequencies MLEstimates`
#>                 CCA       CCC        CGC
#> MLE_p =  0.07333523 0.8533295 0.07333523