Help for package clickR

Type:

Package

Title:

Semi-Automatic Preprocessing of Messy Data with Change Tracking for Dataset Cleaning

Version:

0.9.45

Maintainer:

David Hervas Marin <ddhervas@yahoo.es>

Imports:

beeswarm, future, future.apply, methods, stringdist

Description:

Tools for assessing data quality, performing exploratory analysis, and semi-automatic preprocessing of messy data with change tracking for integral dataset cleaning.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Encoding:

UTF-8

RoxygenNote:

7.3.2

NeedsCompilation:

Packaged:

2024-12-05 12:50:20 UTC; aghil

LazyData:

true

Author:

David Hervas Marin [aut, cre]

Depends:

R (≥ 3.5.0)

Repository:

CRAN

Date/Publication:

2024-12-05 13:20:03 UTC

geq & not NA

Description

'>=' operator where NA values return FALSE

Usage

x %>=NA% y

Arguments

x

Vector for the left side of the operator

y

A Scalar or vector of the same length as x for the right side of the operator

Value

A logical vector of the same length as x

greater & NA

Description

'>' operator where NA values return FALSE

Usage

x %>NA% y

Arguments

x

Vector for the left side of the operator

y

A Scalar or vector of the same length as x for the right side of the operator

Value

A logical vector of the same length as x

leq & not NA

Description

'<=' operator where NA values return FALSE

Usage

x %<=NA% y

Arguments

x

Vector for the left side of the operator

y

A Scalar or vector of the same length as x for the right side of the operator

Value

A logical vector of the same length as x

less & NA

Description

'<' operator where NA values return FALSE

Usage

x %<NA% y

Arguments

x

Vector for the left side of the operator

y

A Scalar or vector of the same length as x for the right side of the operator

Value

A logical vector of the same length as x

between operator

Description

Operator equivalent to x >= lower.value & x <= upper.value

Usage

x %between% y

Arguments

x

Vector for the left side of the operator

y

A vector of length two with the lower and upper values of the interval

Value

A logical vector of the same length as x

between operator & not NA

Description

Operator equivalent to x >= lower.value & x <= upper.value & !is.na(x)

Usage

x %betweenNA% y

Arguments

x

Vector for the left side of the operator

y

A vector of length two with the lower and upper values of the interval

Value

A logical vector of the same length as x

Computes Goodman and Kruskal's tau

Description

Returns Goodman and Kruskal's tau measure of association between two categorical variables

Usage

GK_assoc(x, y)

Arguments

x

A categorical variable

y

A categorical variable

Value

Goodman and Kruskal's tau

Examples

data(infert)
GK_assoc(infert$education, infert$case)
GK_assoc(infert$case, infert$education) #Not the same

Get anti-mode

Description

Returns the least repeated value

Usage

antimoda(x)

Arguments

x

A categorical variable

Value

The anti-mode (least repeated value)

Check for bivariate outliers

Description

Checks for bivariate outliers in a data.frame

Usage

bivariate_outliers(x, threshold_r = 10, threshold_b = 1.5)

Arguments

x

A data.frame object

threshold_r

Threshold for the case of two continuous variables

threshold_b

Threshold for the case of one continuous and one categorical variable

Value

A data frame with all the observations considered as bivariate outliers

Examples

bivariate_outliers(iris)

Checks data quality of a variable

Description

Returns different data quality details of a numeric or categorical variable

Usage

check_quality(
  x,
  id = 1:length(x),
  plot = TRUE,
  numeric = NULL,
  k = 5,
  n = ifelse(is.numeric(x) | ttrue(numeric) | class(x) %in% "Date", 5, 2),
  output = FALSE,
  ...
)

Arguments

x

A variable from a data.frame

id

ID column to reference the found extreme values

plot

If the variable is numeric, should a boxplot be drawn?

numeric

If set to TRUE, forces the variable to be considered numeric

k

Number of different numeric values in a variable to be considered as numeric

n

Number of extreme values to extract

output

Format of the output. If TRUE, optimize for exporting as csv

...

further arguments passed to boxplot()

Value

A list of a data.frame with information about data quality of the variable

Examples

check_quality(airquality$Ozone)  #For one variable
lapply(airquality, check_quality)  #For a data.frame
lapply(airquality, check_quality, output=TRUE)  #For a data.frame, one row per variable

Clustering of variables

Description

Displays associations between variables in a data.frame in a heatmap with clustering

Usage

cluster_var(x, margins = c(8, 1))

Arguments

x

A data.frame

margins

Margins for the plot

Value

A heatmap with the variable associations

Examples

cluster_var(iris)
cluster_var(mtcars)

Detailed summary of the data

Description

Creates a detailed summary of the data

Usage

descriptive(x, z = 3, ignore.na = TRUE, by = NULL, print = TRUE)

Arguments

x

A data.frame

z

Number of decimal places

ignore.na

If TRUE NA values will not count for relative frequencies calculations

by

Factor variable definining groups for the summary

print

Should results be printed?

Value

Summary of the data

Examples

descriptive(iris)
descriptive(iris, by="Species")

Extreme values from a numeric vector

Description

Returns the nth lowest and highest values from a vector

Usage

extreme_values(x, n = 5, id = NULL)

Arguments

x

A vector

n

Number of extreme values to return

id

ID column to reference the found extreme values

Value

A matrix with the lowest and highest values from a vector

Find and replace

Description

Searches a data.frame for a specific character string and replaces it with another one

Usage

f_replace(
  x,
  string,
  replacement,
  complete = TRUE,
  select = 1:ncol(x),
  track = TRUE
)

Arguments

x

A data.frame

string

A character string to search in the data.frame

replacement

A character string to replace the old string (can be NA)

complete

If TRUE, search for complete strings only. If FALSE, search also for partial strings.

select

Numeric vector with the positions (all by default) to be affected by the function

track

Track changes?

Examples

iris2 <- f_replace(iris, "setosa", "ensata")
track_changes(iris2)

fix_NA

Description

Fixes miscoded missing values

Usage

fix_NA(
  x,
  na.strings = c("^$", "^ $", "^\\?$", "^-$", "^\\.$", "^NaN$", "^NULL$", "^N/A$"),
  track = TRUE,
  parallel = TRUE
)

Arguments

x

A data.frame

na.strings

Strings to be considered NA

track

Track changes?

parallel

Should the computations be performed in parallel? Set up strategy first with future::plan()

Examples

mydata <- data.frame(prueba = c("", NA, "A", 4, " ", "?", "-", "+"),
casa = c("", 1, 2, 3, 4, " ", 6, 7))
fix_NA(mydata)

fix_all

Description

Tries to automatically fix all problems in the data.frame

Usage

fix_all(x, select = 1:ncol(x), track = TRUE)

Arguments

x

A data.frame

select

Numeric vector with the positions (all by default) to be affected by the function

track

Track changes?

fix_concat

Description

Fixes concatenated values in a variable

Usage

fix_concat(x, varname, sep = ", |; | ", track = TRUE)

Arguments

x

A data.frame

varname

Variable name

sep

Separator for the different values

track

Track changes?

Examples

mydata <- data.frame(concat=c("a", "b", "a b" , "a b, c", "a; c"),
numeric = c(1, 2, 3, 4, 5))
fix_concat(mydata, "concat")

Fix dates

Description

Fixes dates. Dates can be recorded in numerous formats depending on the country, the traditions and the field of knowledge. fix.dates tries to detect all possible date formats and transforms all of them in the ISO standard favored by R (yyyy-mm-dd).

Usage

fix_dates(
  x,
  max.NA = 0.8,
  min.obs = nrow(x) * 0.05,
  use.probs = TRUE,
  select = 1:ncol(x),
  track = TRUE,
  parallel = TRUE
)

Arguments

x

A data.frame

max.NA

Maximum allowed proportion of NA values created by coercion. If the coercion to date creates more NA values than those specified in max.NA, then all changes will be reverted and the variable will remain unchanged.

min.obs

Minimum number of non-NA observations allowed per variable. If the variable has fewer non-NA observations, then it will be ignored by fix.dates.

use.probs

When there are multiple date formats in the same column, there can be ambiguities. For example, 04-06-2015 can be interpreted as 2015-06-04 or as 2015-04-06. If use.probs=TRUE, ambiguities will be solved by assigning to the most frequent date format in the column.

select

Numeric vector with the positions (all by default) to be affected by the function

track

Track changes?

parallel

Should the computations be performed in parallel? Set up strategy first with future::plan()

Examples

mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"),
                   Dates2=c("01/01/85", "04/04/1982", "07/12-2016", "September 24, 2020"),
                   Numeric1=rnorm(4))
fix_dates(mydata)

Fix factors imported as numerics

Description

Fixes factors imported as numerics. It is usual in some fields to encode factor variables as integers. This function detects such variables and transforms them into factors. When drop=TRUE (by default) it detects multiple versions of the same levels due to different capitalization, whitespaces or non-ASCII characters.

Usage

fix_factors(x, k = 5, select = 1:ncol(x), drop = TRUE, track = TRUE)

Arguments

x

A data.frame

k

Maximum number of different numeric values to be converted to factor

select

Numeric vector with the positions (all by default) to be affected by the function

drop

Drop similar levels?

track

Keep track of changes?

Examples

# mtcars data has all variables encoded as numeric, even the factor variables.
descriptive(mtcars)
# After using fix_factors, factor variables are recognized as such.
descriptive(fix_factors(mtcars))

Fix levels

Description

Fixes levels of a factor

Usage

fix_levels(
  data,
  factor_name,
  method = "dl",
  levels = NULL,
  plot = FALSE,
  k = ifelse(!is.null(levels), length(levels), 2),
  track = TRUE,
  ...
)

Arguments

data

data.frame with the factor to fix

factor_name

Name of the factor to fix (as character)

method

Method from stringdist package to estimate distances

levels

Optional vector with the levels names. If "auto", levels are assigned based on frequency

plot

Optional: Plot cluster dendrogram?

k

Number of levels for clustering

track

Keep track of changes?

...

Further parameters passed to stringdist::stringdistmatrix function

Examples

mydata <- data.frame(factor1=factor(c("Control", "Treatment", "Tretament", "Tratment", "treatment",
"teatment", "contrl", "cntrol", "CONTol", "not available", "na")))
fix_levels(mydata, "factor1", k=4, plot=TRUE)   #Chose k to select matching levels
fix_levels(mydata, "factor1", levels=c("Control", "Treatment"), k=4)

Fix numeric data

Description

Fixes numeric data. In many cases, numeric data are not recognized by R because there are data inconsistencies (wrong decimal separator, whitespaces, typos, thousand separator, etc.). fix_numerics detects and corrects these variables, making them numeric again.

Usage

fix_numerics(
  x,
  k = 8,
  max.NA = 0.2,
  select = 1:ncol(x),
  track = TRUE,
  parallel = TRUE
)

Arguments

x

A data.frame

k

Minimum number of different values a variable has to have to be considered numerical

max.NA

Maximum allowed proportion of NA values created by coercion. If the coercion to numeric creates more NA values than those specified in max.NA, then all changes will be reverted and the variable will remain unchanged.

select

Numeric vector with the positions (all by default) to be affected by the function

track

Keep track of changes?

parallel

Should the computations be performed in parallel? Set up strategy first with future::plan()

Examples

mydata<-data.frame(Numeric1=c(7.8, 9.2, "5.4e+2", 3.3, "6,8", "3..3"),
                   Numeric2=c(3.1, 1.2, "3.4s", "48,500.04 $", 7, "$  6.4"))
descriptive(mydata)
descriptive(fix_numerics(mydata, k=5))

Forge

Description

Reshapes a data frame from wide to long format

Usage

forge(data, affixes, force.fixed = NULL, var.name = "time")

Arguments

data

data.frame

affixes

Affixes for repeated measures

force.fixed

Variables with matching affix to be excluded

var.name

Name for the new created variable (repetitions)

Examples

#Data frame in wide format
df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4),
                  score2 = c(2,1,3,1), score3 = c(1,1,0,1))
df1
#Data frame in long format
forge(df1, affixes= c("1", "2", "3"))

#Data frame in wide format with two repeated measured variables
df2 <- data.frame(df1, var1 = c(15, 20, 16, 19), var3 = c(12, 15, 15, 17))
df2
#Missing times are filled with NAs
forge(df2, affixes = c("1", "2", "3"))

#Use of parameter force.fixed
df3 <- df2[, -7]
df3
forge(df3, affixes=c("1", "2", "3"))
forge(df3, affixes=c("1", "2", "3"), force.fixed = c("var1"))

Internal function to fix_dates

Description

Function to format dates

Usage

fxd(d, use.probs = TRUE)

Arguments

d

A character vector

use.probs

Solve ambiguities by similarity to the most frequent formats

Good to go

Description

Loads all libraries used in scripts inside the selected path

Usage

good2go(path = getwd(), info = TRUE, load = TRUE)

Arguments

path

Path where the scripts are located

info

List the libraries found?

load

Should the libraries found be loaded?

Improved boxplot

Description

Creates an improved boxplot with individual data points

Usage

ipboxplot(formula, boxwex = 0.6, ...)

Arguments

formula

Formula for the boxplot

boxwex

Width of the boxes

...

further arguments passed to beeswarm()

Examples

ipboxplot(Sepal.Length ~ Species, data=iris)
ipboxplot(mpg ~ gear, data=mtcars)

Kill factors

Description

Changes factor variables to character

Usage

kill.factors(dat, k = 10)

Arguments

dat

A data.frame

k

Maximum number of levels for factors

Examples

d <- data.frame(Letters=letters[1:20], Nums=1:20)
d$Letters
d <- kill.factors(d)
d$Letters

Computes kurtosis

Description

Calculates kurtosis of a numeric variable

Usage

kurtosis(x)

Arguments

x

A numeric variable

Value

kurtosis value

Tracked manual fixes to data

Description

Tracks manual fixes performed on a variable in a data.frame

Usage

manual_fix(data, variable, subset, newvalues = NULL)

Arguments

data

A data.frame

variable

A character string with the name of the variable to be fixed

subset

A logical expression for selecting the cases to be fixed

newvalues

New value or values that will take the cases selected by subset parameter.

Examples

iris2 <- manual_fix(iris, "Petal.Length", Petal.Length < 1.2, 0)
track_changes(iris2)

Checks if each value might be numeric

Description

Checks if each value from a vector might be numeric

Usage

may.numeric(x)

Arguments

x

A vector

Value

A logical vector

Mine plot

Description

Creates a heatmap-like plot for exploring the data

Usage

mine.plot(
  x,
  fun = is.na,
  spacing = 5,
  sort = F,
  show.x = TRUE,
  show.y = TRUE,
  ...
)

Arguments

x

A data.frame

fun

A function that evaluates a vector and returns a logical vector

spacing

Numerical separation between lines at the y-axis

sort

If TRUE, variables are sorted according to their results

show.x

Should the x-axis be plotted?

show.y

Should the y-axis be plotted?

...

further arguments passed to order()

Examples

mine.plot(airquality)   #Displays missing data
mine.plot(airquality, fun=outliers)   #Shows extreme values

Get mode

Description

Returns the most repeated value

Usage

moda(x)

Arguments

x

A categorical variable

Value

The mode

Estimates number of modes

Description

Estimates the number of modes

Usage

moda_cont(x)

Arguments

x

A numeric variable

Value

Estimated number of modes.

Multiple tapply

Description

Modification of the tapply function to use with data.frames. Consider using aggregate()

Usage

mtapply(x, group, fun)

Arguments

x

A data.frame

group

Grouping variable

fun

Function to apply by group

Examples

mtapply(mtcars, mtcars$gear, mean)

Messy Motor Trend Car Road Tests Dataset

Description

Modified version of the mtcars dataset with different types of errors in the data. The dataset has 13 variables and 32 observations.

Usage

mtcars_messy

Format

A data frame with 32 observations and 13 variables

Source

datasets package

References

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

Examples

descriptive(mtcars_messy)

Internal function for descriptive()

Description

Finds positions for substitution of characters in Distribution column

Usage

nearest(x, to = seq(0, 1, length.out = 30))

Arguments

x

A numeric value between 0-1

to

Range of reference values

Value

The nearest position to the input value

Nice names

Description

Changes names of a data frame to ease work with them

Usage

nice_names(x, select = 1:ncol(x), tolower = TRUE, track = TRUE)

Arguments

x

A data.frame

select

Numeric vector with the positions (all by default) to be affected by the function

tolower

Set all names to lower case?

track

Track changes?

Value

The input data.frame x with the fixed names

Examples

d <- data.frame('Variable 1'=NA, '% Response'=NA, ' Variable     3'=NA,check.names=FALSE)
names(d)
names(nice_names(d))

Brute numeric coercion

Description

If possible, coerces values from a vector to numeric

Usage

numeros(x)

Arguments

x

A vector

Value

A numeric vector

outliers

Description

Function for detecting outliers based on the boxplot method

Usage

outliers(x, threshold = 1.5)

Arguments

x

A vector

threshold

Threshold (as multiple of the IQR) to consider an observation as outlier

Examples

outliers(iris$Petal.Length)
outliers(airquality$Ozone)

Peek

Description

Takes a peek into a data.frame returning a concise visualization about it

Usage

peek(x, n = 10, which = 1:ncol(x))

Arguments

x

A data.frame

n

Number of rows to include in output

which

Columns to include in output

Examples

peek(iris)

Gets proportion of most repeated value

Description

Returns the proportion for the most repeated value

Usage

prop_may(x, ignore.na = TRUE)

Arguments

x

A categorical variable

ignore.na

Should NA values be ignored for computing proportions?

Value

A proportion

Gets proportion of least repeated value

Description

Returns the proportion for the least repeated value

Usage

prop_min(x, ignore.na = TRUE)

Arguments

x

A categorical variable

ignore.na

Should NA values be ignored for computing proportions?

Value

A proportion

remove_empty

Description

Removes empty rows or columns from data.frames

Usage

remove_empty(x, remove_rows = TRUE, remove_cols = TRUE, track = TRUE)

Arguments

x

A data.frame

remove_rows

Remove empty rows?

remove_cols

Remove empty columns?

track

Track changes?

Examples

mydata <- data.frame(a = c(NA, NA, NA, NA, NA), b = c(1, NA, 3, 4, 5),
c=c(NA, NA, NA, NA, NA), d=c(4, NA, 5, 6, 3))
remove_empty(mydata)

Restore changes

Description

Restores original values after using a fix function

Usage

restore_changes(tracking)

Arguments

tracking

A data.frame generated by track_changes() function

Examples

mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"),
                   Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA),
                   Numeric1=rnorm(4))
mydata <- fix_dates(mydata)
mydata
tracking <- track_changes(mydata)
mydata_r <- restore_changes(tracking)
mydata_r

Scales data between 0 and 1

Description

Escale data to 0-1

Usage

scale_01(x)

Arguments

x

A numeric variable

Value

Scaled data

Search scripts

Description

Searches for strings in R script files

Usage

search_scripts(string, path = getwd(), recursive = TRUE)

Arguments

string

Character string to search

path

Character vector with the path name

recursive

Logical. Should the search be recursive into subdirectories?

Value

A list with each element being one of the files containing the search string

Computes skewness

Description

Calculates skewness of a numeric variable

Usage

skewness(x)

Arguments

x

A numeric variable

Value

skewness value

Internal function for dates with text

Description

Function to transform text into dates

Usage

text_date(date, format = "%d/%Y %b")

Arguments

date

A date

format

Format of the date

track_changes

Description

Gets a data.frame with all the changes performed by the different fix functions

Usage

track_changes(x, subset)

Arguments

x

A data.frame

subset

Logical expression for subsetting the data.frame with the changes

Examples

mydata<-data.frame(Dates1=c("25/06/1983", "25-08/2014", "2001/11/01", "2008-10-01"),
                   Dates2=c("01/01/85", "04/04/1982", "07/12-2016", NA),
                   Numeric1=rnorm(4))
mydata <- fix_dates(mydata)
mydata
track_changes(mydata)

True TRUE

Description

Makes possible vectorized logical comparisons against NULL and NA values

Usage

ttrue(x)

Arguments

x

A logical vector

Value

A logical vector

Un-Forge

Description

Reshapes a data frame from long to wide format

Usage

unforge(data, origin, variables, prefix = origin)

Arguments

data

data.frame

origin

Character vector with variable names in data containing the values to be assigned to the different new variables

variables

Variable in data containing the variable names to be created

prefix

Vector with prefixes for the new variable names

Examples

#Data frame in wide format
df1 <- data.frame(id = 1:4, age = c(20, 30, 30, 35), score1 = c(2,2,3,4),
                  score2 = c(2,1,3,1), score3 = c(1,1,0,1))
df1
#Data frame in long format
df2 <- forge(df1, affixes= c("1", "2", "3"))
df2
#Data frame in wide format again
df3 <- unforge(df2, "score", "time", prefix="score")

Internal function to track_changes

Description

Function to track_changes

Usage

v_df_changes(x, y)

Arguments

x

Original data.frame

y

New data.frame

Explores global environment workspace

Description

Returns information regarding the different objects in global environment

Usage

workspace(table = FALSE)

Arguments

table

If TRUE a table with the frequencies of each type of object is given

Value

A list of object names by class or a table with frequencies if table = TRUE

Examples

df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2))
df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2))
workspace(table=TRUE)  #Frequency table of the different object classes
workspace()  #All objects in the global object separated by class

Applies a function over objects of a specific class

Description

Applies a function over all objects of a specific class in the global environment

Usage

workspace_sapply(object_class, action = "summary")

Arguments

object_class

Class of the objects where the function is to be applied

action

Name of the function to apply

Value

Results of the function

Examples

df1 <- data.frame(x=rnorm(10), y=rnorm(10, 1, 2))
df2 <- data.frame(x=rnorm(20), y=rnorm(20, 1, 2))
workspace_sapply("data.frame", "summary")  #Gives a summary of each data.frame

Estimate sample scores

Description

Calculates different scores to measure how much extreme are the different data points

Usage

xscores(x, type = "z")

Arguments

x

A vector

type

'z' calculates standard normal scores, 'z-out' calculates standard normal scores excluding each data point when computing the mean and the standard deviation, 't' calculates t scores, 'chisq' calculates chisquared scores, 'tukey' calculates scores based on the boxplot method, 'mad' calculates scores using median and mad instead of mean and sd.

Examples

xscores(iris$Sepal.Length, type="z-out")