| Title: | Ascent Training Datasets |
| Version: | 1.0.0 |
| Description: | Datasets to be used primarily in conjunction with Ascent training materials but also for the book 'SAMS Teach Yourself R in 24 Hours' (ISBN: 978-0-672-33848-9). Version 1.0-7 is largely for use with the book; however, version 1.1 has a much greater focus on use with training materials, whilst retaining compatibility with the book. |
| URL: | https://www.ascent.io/ |
| Depends: | R (≥ 3.5.0) |
| Suggests: | testthat |
| License: | GPL-2 |
| LazyLoad: | yes |
| LazyData: | yes |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.1.1 |
| BugReports: | https://github.com/HarryJAlexander/ascentTraining/issues |
| NeedsCompilation: | no |
| Packaged: | 2022-03-24 18:18:40 UTC; harry.alexander |
| Author: | Ascent [aut], Harry Alexander [aut, cre, ctb, dtc, rev] |
| Maintainer: | Harry Alexander <harry.alexander@ascent.io> |
| Repository: | CRAN |
| Date/Publication: | 2022-04-27 07:20:05 UTC |
Ascent Training Datasets
Description
Datasets designed to be used in conjunction with Ascent training materials.
Details
Datasets designed to be used in conjunction with Ascent training materials and book, SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9). The data covers a range of applications and has been collected together from a number of sources. The airquality dataset, from the Core R datasets package is also provided in xlsx format in the extdata directory of this package.
Author(s)
Ascent
Contact: Ascent rin24hours@mango-solutions.com
Auto MPG Data Set
Description
Data concerns city-cycle fuel consumption - revised from CMU StatLib library.
Usage
auto_mpg
Format
A matrix containing 398 observations and 10 attributes.
mpgMiles per gallon of the engine. Predictor attribute
cylindersNumber of cylinders in the engine
displacementEngine displacement
horsepowerHorsepower of the car
weightWeight of the car (lbs)
accelerationAcceleration of the car (seconds taken for 0-60mph)
model_yearModel year of the car in the 1900s
originCar origin
makeCar manufacturer
car_nameName of the car
Source
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
References
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
BBC articles data
Description
A collection of BBC news articles from the business or politics sections. There are a total of 927 articles used.
Usage
bbc_articles
Format
A tibble with 201,571 observations, each a word on a document.
wordA word in an article
documentThe document/article ID where the word was taken from
Source
Full BBC Articles data
Description
Full BBC Articles data
Usage
bbc_articles_full
Format
A tibble, with 927 observations of separate documents and their contents. This results in two columns.
wordsThe words from a given article
documentThe 'document' (article) ID
Details
A collection of business and politics BBC news articles. Each row represents each article (document),
with a document ID and a string of the text content with stop words removed. This is a 'dirty' version of the
bbc_articles dataset, where we now have a string of text for each observation, as opposed to a single word.
Source
BBC Business article data
Description
A single BBC Business article (not included in the full BBC articles dataset), given in tidy, one word per row format.
Usage
bbc_business_123
Format
A tibble with 107 observations, each a word on a document.
wordA word in an article
documentThe document/article ID where the word was taken from. Note: this only has one unique value, however the column is kept for comparison with other BBC datasets.
Source
BBC Politics article data
Description
A single BBC Politics article (not included in the full BBC articles dataset), given in tidy, one word per row format.
Usage
bbc_politics_123
Format
A tibble with 86 observations, each a word on a document.
wordA word in an article
documentThe document/article ID where the word was taken from. Note: this only has one unique value, however the column is kept for comparison with other BBC datasets.
Source
Body image dataset
Description
Body image dataset
Usage
body_image
Format
A tibble of 246 observations on 8 attributes.
ethnicitySubject's ethnicity (Asian, Europn, Maori, Pacific)
marriedHow many times have they been married?
bodyimSubject's rating of themselves (slight.uw, right, slight.ow, mod.ow, very.ow)
sm.everHave they ever smoked?
weightWeight in kilograms
heightHeight in centimetres
ageAge in years
stressgpWhat stress group are they in?
Details
A simulated dataset containing data on the self-image of subjects with differing body aesthetics
Source
Simulated data
Gutenberg Project books dataset
Description
A mixed up collection of words from different book sections of two books.
Usage
book_sections
Format
A tibble with 108,657 observations, each a word on a document. This data set is designed to show how LDA can be used to separate a set of mixed documents into two distinct "topics" (or books).
wordWords from a given section within a book.
documentThe book section ID that the word came from.
Source
Data taken from two books of the Gutenberg Project
Boston housing dataset
Description
Dataset containing housing values in the suburbs of Boston.
Usage
boston
Format
This data frame contains the following columns:
tractCensus tract
medvMedian value of owner-occupied homes in $1,000s.
crimPer capita crime rate by town.
znProportion of residential land zoned for lots over 25,000 sq.ft.
indusProportion of non-retail business acres per town.
chasCharles River dummy variable (= 1 if tract bounds river; 0 otherwise).
noxNitrogen oxides concentration (parts per 10 million).
rmAverage number of rooms per dwelling.
ageProportion of owner-occupied units built prior to 1940.
disWeighted mean of distances to five Boston employment centres.
radIndex of accessibility to radial highways.
taxFull-value property-tax rate per $10,000.
ptratioPupil-teacher ratio by town.
b1000(Bk - 0.63)^2whereBkis the proportion of blacks by town.lstatLower status of the population (percent).
Details
The boston data frame has 506 rows and 15 columns.
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Wisconsin Diagnostic Breast Cancer (WDBC)
Description
The data contain measurements on cells in suspicious lumps in a women's breast. Features are computed from a digitised image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. All samples are classified as either benign or malignant.
Usage
breast_cancer
Format
breast_cancer is a tibble with 22 columns. The first column
is an ID column. The second indicates whether the sample is classified as benign or malignant.
The remaining columns contain measurements for 20 features. Ten real-valued features are computed
for each cell nucleus. The references listed below contain detailed descriptions of how these features
are computed. The mean, and "worst" (or largest - mean of the three largest values) of these features were computed
for each image, resulting in 20 features. Below are descriptions of these features where *
should be replaced by either mean or worst.
*_radiusmean of distances from center to points on the perimeter
*_texturestandard deviation of gray-scale values
*_perimeterperimeter value
*_areaarea value
*_smoothnesslocal variation in radius lengths
*_compactnessperimeter^2 / area - 1.0
*_concavityseverity of concave portions of the contour
*_concave_pointsnumber of concave portions of the contour
*_symmetrysymmetry value
*_fractal_dimension"coastline approximation" - 1
Note
This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
Source
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository.
Irvine, CA: University of California, School of Information and Computer
Science.
References
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via
linear programming",
SIAM News, Volume 23, Number 5, September 1990, pp 1
& 18. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December
1990, pp 9193-9196. K. P. Bennett & O. L. Mangasarian: "Robust linear
programming discrimination of two linearly inseparable sets",
Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science
Publishers).
Wisconsin Breast Cancer Database
Description
Wisconsin Breast Cancer Database
Usage
breast_cancer_clean_features
Format
A list containing a training and test dataset. These come from a data frame with 699 observations on 11 variables, however the ID and class columns have been removed. There is a train to test ratio of 0.8.
Cl.thicknessClump Thickness
Cell.sizeUniformity of Cell Size
Cell.shapeUniformity of Cell Shape
Marg.adhesionMarginal Adhesion
Epith.c.sizeSingle Epithelial Cell Size
Bare.nucleiBare Nuclei
Bl.cromatinBland Chromatin
Normal.nucleoliNormal Nucleoli
MitosesMitoses
Source
Creator: Dr. WIlliam H. Wolberg (physician); University of Wisconsin Hospital ;Madison; Wisconsin; USA
Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received: David W. Aha (aha@cs.jhu.edu)
These data have been taken from the UCI Repository Of Machine Learning Databases at
and were converted to R format by Evgenia Dimitriadou.
References
1. Wolberg,W.H., \& Mangasarian,O.L. (1990). Multisurface method
of pattern separation for medical diagnosis applied to breast cytology. In
Proceedings of the National Academy of Sciences, 87, 9193-9196.
- Size of
data set: only 369 instances (at that point in time)
- Collected
classification results: 1 trial only
- Two pairs of parallel hyperplanes
were found to be consistent with 50% of the data
- Accuracy on remaining
50% of dataset: 93.5%
- Three pairs of parallel hyperplanes were found
to be consistent with 67% of data
- Accuracy on remaining 33% of
dataset: 95.9%
2. Zhang,J. (1992). Selecting typical instances in instance-based learning.
In Proceedings of the Ninth International Machine Learning Conference (pp.
470-479). Aberdeen, Scotland: Morgan Kaufmann.
- Size of data set: only
369 instances (at that point in time)
- Applied 4 instance-based learning
algorithms
- Collected classification results averaged over 10 trials
- Best accuracy result:
- 1-nearest neighbor: 93.7%
- trained on 200
instances, tested on the other 169
- Also of interest:
- Using only
typical instances: 92.2% (storing only 23.1 instances)
- trained on 200
instances, tested on the other 169
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Wisconsin Breast Cancer Database
Description
Wisconsin Breast Cancer Database
Usage
breast_cancer_clean_target
Format
A list containing a training and test dataset. These come from a data frame with 699 observations on 11 variables, however only the target classes have been kept. There is a train to test ratio of 0.8.
Class.BenignWhether the sample was classified as benign
Class.malignantWhether the sample was classified as malignant
2. Zhang,J. (1992). Selecting typical instances in instance-based learning.
In Proceedings of the Ninth International Machine Learning Conference (pp.
470-479). Aberdeen, Scotland: Morgan Kaufmann.
- Size of data set: only
369 instances (at that point in time)
- Applied 4 instance-based learning
algorithms
- Collected classification results averaged over 10 trials
- Best accuracy result:
- 1-nearest neighbor: 93.7%
- trained on 200
instances, tested on the other 169
- Also of interest:
- Using only
typical instances: 92.2% (storing only 23.1 instances)
- trained on 200
instances, tested on the other 169
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Source
Creator: Dr. WIlliam H. Wolberg (physician); University of Wisconsin Hospital ;Madison; Wisconsin; USA
Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received: David W. Aha (aha@cs.jhu.edu)
These data have been taken from the UCI Repository Of Machine Learning Databases at
and were converted to R format by Evgenia Dimitriadou.
Carrier data
Description
This data comes from the RITA/Transtats database
Usage
carriers
Format
A dataframe with 1492 observations and 2 variables
CodeA character string giving the IATA code for the carrier
DescriptionCarrier name/description
R For Data Science tidytuesday commute dataset
Description
Data from the ACS Survey detailing the use of different transport modes
Usage
commute
Format
A tibble containing 3,496 observations of 9 variables
cityCity
stateState
city_sizeCity Size -
Small = 20K to 99,999
Medium = 100K to 199,999
Large = >= 200K
modeMode of transport, either walk or bike
nNumber of individuals
percentPercent of total individuals
moeMargin of Error (percent)
state_abbAbbreviated state name
state_regionACS State region
Source
American Community Survey, United States Census Bureau
R For Data Science repository: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-11-05
Article and underlying data can be found at: https://www.census.gov/library/publications/2014/acs/acs-25.html?#
Demographics data
Description
A simulated dataset containing demographic data about a number of subjects.
Usage
demo_data
demoData
Format
A data frame with 33 observations on the following 7 demographic variables. This data is designed so that it can be merged with the dataset pk_data.
SubjectA numeric vector giving the subject identifier
SexA factor with levels
FMAgeA numeric vector giving the age of the subject
WeightA numeric vector giving weight in kg
HeightA numeric vector giving height in cm
BMIA numeric vector giving the subject body mass index
SmokesA factor with levels
NoYes
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated data
Dow Jones Index Data
Description
Dataset containing the Dow Jones Index between 2014-01-01 and 2015-01-01, which is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States.
Usage
dow_jones_data
dowJonesData
Format
A data frame with 252 observations on the following 7 variables containing data from 2014-01-01 to 2015-01-01.
DateDate of observation in character string format "%m/%d/%Y"
DJI.OpenOpening value of DJI on the specified date
DJI.HighHigh value of the DJI on the specified date
DJI.LowLow value of the DJI on the specified date
DJI.CloseClosing value of the DJI on the specified date
DJI.Volumethe number of shares or contracts traded
DJI.Adj.CloseClose price adjusted for dividends and splits
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Data obtained using yahooSeries from the fImport package.
Repeated Measures Drug data
Description
Repeated Measures Drug data
Usage
drugs
Format
A data frame with 20 observations on the following 3 variables.
SubjA numeric vector, giving the subject ID
DrugA numeric vector giving the drug ID, numbered 1 to 4
ValueA numeric vector, giving the observation value
Source
Generated from example data used in https://www.stattutorials.com/SAS/TUTORIAL-PROC-GLM-REPEAT.htm
Data that can be used to fit or plot Emax models
Description
Data that can be used to fit or plot Emax models
Usage
emax_data
emaxData
Format
A data frame with 64 observations on the following 6 variables.
Subjecta numeric vector giving the unique subject ID
Dosea numeric vector giving the dose group
Ea numeric vector giving the Emax
Gendera numeric vector giving the gender
Agea numeric vector giving the age of the subject
Weighta numeric vector giving the weight
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated data
Function to calculate Emax
Description
Calculation used for Emax in Ascent materials. Note: This function has be renamed using tidyverse-style snake_case naming conventions. However the original name of the function has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Usage
emax_fun(Dose, E0 = 0, ED50 = 50, Emax = 100)
Arguments
Dose |
User provided dose values |
E0 |
Effect at time 0 |
ED50 |
50% of maximum effect |
Emax |
Maximum effect |
Value
Numeric value/vector representing the response value.
Examples
emax_fun(Dose = 100)
Function to fit logistic model
Description
Simple logistic function as used in Ascent training materials. Note: This function has be renamed using tidyverse-style snake_case naming conventions. However the original name of the function has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Usage
logistic_fun(Dose, E0 = 0, EC50 = 50, Emax = 1, rc = 5)
Arguments
Dose |
The dose value to calculate at |
E0 |
Effect at time 0 |
EC50 |
50% of maximum effect |
Emax |
Maximum effect |
rc |
rate constant |
Value
Numeric value/vector representing the response value.
Examples
logistic_fun(Dose = 50)
Messy clinical trial data
Description
Simulated dataset for examples of reshaping data
Usage
messy_data
messyData
Format
A data frame with 33 observations on the following 7 variables. This data has been designed to show reshaping/tidying of data.
SubjectA numeric vector giving the subject ID
Placebo.1A numeric vector giving the subjects observed value on treatment Placebo at time 1
Placebo.2A numeric vector giving the subjects observed value on treatment Placebo at time 2
Drug1.1A numeric vector giving the subjects observed value on treatment Drug1 at time 1
Drug1.2A numeric vector giving the subjects observed value on treatment Drug1 at time 2
Drug2.1A numeric vector giving the subjects observed value on treatment Drug2 at time 1
Drug2.2A numeric vector giving the subjects observed value on treatment Drug2 at time 2
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated data
Clinical trial data
Description
Clinical trial data
Usage
missing_pk
missingPk
Format
A data frame with 165 observations on the following 4 variables.
Subjecta numeric vector giving the subject identifier
Dosea numeric vector giving the dose group
Timea numeric vector giving the observation times
Conca numeric vector giving the observed concentration
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated from 'pk_data'
Typical PK data
Description
Typical PK data
Usage
pk_data
pkData
Format
A data frame with 165 observations on the following 4 variables.
Subjecta numeric vector giving the subject identifier
Dosea numeric vector giving the dose group
Timea numeric vector giving the observation times
Conca numeric vector giving the observed concentration
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated data
Insurance Policy Data
Description
Insurance Policy Data
Usage
policy_data
policyData
Format
A data frame with 926 observations on the following 13 variables.
YearThe four digit year of the policy
PolicyNoThe policy number
TotalPremiumThe total insurance premium
BonusMalusDiscount level
WeightClassThe weight class of the car
RegionRegion of the car owner
AgeAge of the main driver
MileageEstimated annual mileage
UsageCar usage
PremiumClassClass of the car
NoClaimsNumber of previous claims
GrossIncurredClaim cost
ExposureHow long they have been driving
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated based on details of how to simulate car insurance data in Modern Actuarial Risk Theory Using R 2nd Edition (Rob Kaas, Marc Goovaerts, Jan Dhaene, Michel Denuit)
Typical PK data
Description
Typical PK data
Usage
qtpk2
Format
A data frame with 2061 observations on the following 8 variables.
subjidA numeric vector giving the subject ID
treatA factor giving the treatment
timeA numeric vector giving the observation times
qtA numeric vector giving the QT interval value
qtcbA numeric vector giving corrected QT interval
hrA numeric vector giving the heart rate
rrA numeric vector giving the R-R interval
sexA factor giving the subject sex
Source
A subset of the data qtpk originally provided in the QT package
An example of NONMEM run data
Description
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Usage
run_data
runData
Format
A data frame with 73 observations on the following 10 variables.
IDa numeric vector giving the subject ID
DAYa numeric vector giving the day of the observation
CLa numeric vector giving the clearance value
Va numeric vector giving the volume of distribution
WTa numeric vector giving the weight
DVa numeric vector giving the dependent variable
IPREa numeric vector giving the individual prediction
PREDa numeric vector giving the population prediction
RESa numeric vector giving the residual
WRESa numeric vector giving the weighted residual
Source
Simulated Data
Students simulated data
Description
Students simulated data
Usage
students
Format
A tibble with 146 observations of 15 variables.
GradeFinal grade (A, B, C, D)
PassDid they pass the course? (No, Yes)
ExamMark in final exam (out of 100)
DegreeThe degree type undertaken by the student
GenderGender of the student
AttendDid they regularly attend class? (No, Yes)
AssignScore obtained in mid-term assignment (out of 20)
TestScore obtained in previous term test (out of 20)
BMark for short answer section (out of 20)
CMark for long answer section (out of 20)
MCMark for multiple choice sectionC (out of 30)
ColourColour of exam booklet (Blue, Green, Pink, Yellow)
Stage1Stage one grade (A, B, C)
Years.SinceNumber of years since doing Stage 1
RepeatWhere they repeating the paper? (No, Yes)
Source
Simulated data
London Tube Performance data
Description
London Tube Performance data
Usage
tube_data
tubeData
Format
A data frame with 1050 observations on the following 9 variables.
LineA factor with 10 levels, one for each London tube line
MonthA numeric vector indicating the month of the observation
ScheduledA numeric vector giving the scheduled running time
ExcessA numeric vector giving the excess running time
TOTALA numeric vector giving the total running time
OpenedA numeric vector giving the year the line opened
LengthA numeric vector giving the line length
TypeA factor indicating the type of tube line
StationsA numeric vector giving the number of stations on the line
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
This data was taken from "https://data.london.gov.uk/dataset/tube-network-performance-data-transport-committee-report"
Iris predictors data for Species classification
Description
This data was taken from Edgar Anderson's famous iris data set. This gives the measurements (in centimeters)
of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
The species are Iris setosa, versicolor, and virginica. However, the species is seen as the target variable, and as such
has been removed from this dataset, whilst being added to the counterpart y_iris dataset. Furthermore, the 4 remaining
'predictor' variables have been separated into a training and test set with a ratio of 4:1, followed by centering and scaling.
Usage
x_iris
Format
A list of two named matrices, 'train' and 'test', representing the training and test sets for the predictors. These have 4 columns each, with 120 and 30 rows respectively.
Sepal.LengthSepal length
Sepal.WidthSepal width
Petal.LengthPetal length
Petal.WidthPetal width
Source
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188. The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Typical NONMEM data
Description
Typical NONMEM data
Usage
xp_data
xpData
Format
A data frame with 1061 observations on the following 23 variables.
IDa numeric vector giving the subject ID
SEXa numeric vector giving the subject sex
RACEa numeric vector giving the subject race
SMOKa numeric vector giving the subject smoking status
HCTZa numeric vector giving the treatment status
PROPa numeric vector giving the treatment status
CONa numeric vector giving the treatment status
DVa numeric vector giving the dependent variable
PREDa numeric vector giving population prediction
RESa numeric vector giving the residual
WRESa numeric vector giving the weighted residual
AGEa numeric vector giving the subject age
HTa numeric vector giving the subject height
WTa numeric vector giving the subject weight
SECRa numeric vector giving the serum creatinine value
OCCa numeric vector giving the occasion
TIMEa numeric vector giving the time of the observation time
IPREa numeric vector giving individual prediction
IWREa numeric vector giving the individual weighted residual
SIDa numeric vector giving the site ID
CLa numeric vector giving the clearance
Va numeric vector giving the volume of distribution
KAa numeric vector giving the absorption rate constant
Details
This dataset has be renamed using tidyverse-style snake_case naming conventions. However the original name of the dataset has been kept to ensure backwards compatibility with the book SAMS Teach Yourself R in 24 Hours (ISBN: 978-0-672-33848-9).
Source
Simulated Data
Iris class data for Species classification
Description
This data was taken from Edgar Anderson's famous iris data set. This gives the measurements (in centimeters)
of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
The species are Iris setosa, versicolor, and virginica. This is the target dataset (as a counterpart to the x_iris dataset)
and thus only retains the Species information. As with the x_iris dataset, the data has been split into a training and test
set with a ratio of 4:1. Following this the species class has been one-hot encoded to give three columns, one for each species level.
Usage
y_iris
Format
A list of two named matrices, 'train' and 'test', representing the training and test sets for the predictors. These have 3 indicator columns each, with 120 and 30 rows respectively.
Species.setosaIndicator column for the species class setosa
Species.versicolorIndicator column for the species class versicolor
Species.virginicaIndicator column for the species class virginica
Source
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188. The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2-5
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.