This package contains R functions corresponding to useful Stata commands.
The package includes: - panel data functions (monthly/quarterly dates, lead/lag, fillin) - data.frame functions (tabulate, merge) - vector functions (xtile, pctile, winsorize) - graph functions (binscatter)
sum_up prints detailed summary statistics (corresponds
to Stata summarize)
N <- 100
df <- tibble(
id = 1:N,
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE)
)
sum_up(df)
df %>% sum_up(starts_with("v"), d = TRUE)
df %>% group_by(v1) %>% sum_up()tab prints distinct rows with their count. Compared to
the dplyr function count, this command adds frequency,
percent, and cumulative percent.
N <- 1e2 ; K = 10
df <- tibble(
id = sample(c(NA,1:5), N/K, TRUE),
v1 = sample(1:5, N/K, TRUE)
)
tab(df, id)
tab(df, id, na.rm = TRUE)
tab(df, id, v1)join is a wrapper for dplyr merge functionalities, with
two added functions
The option check checks there are no duplicates in
the master or using data.tables (as in Stata).
# merge m:1 v1
join(x, y, kind = "full", check = m~1) The option gen specifies the name of a new variable
that identifies non matched and matched rows (as in Stata).
# merge m:1 v1, gen(_merge)
join(x, y, kind = "full", gen = "_merge") The option update allows to update missing values of
the master dataset by the value in the using dataset
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
v <- c(NA, 1:10)
pctile(v, probs = c(0.3, 0.7), na.rm = TRUE)
# xtile creates integer variable for quantile categories (corresponds to Stata xtile)
v <- c(NA, 1:10)
xtile(v, n_quantiles = 3) # 3 groups based on terciles
xtile(v, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(v, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
# winsorize (default based on 5 x interquartile range)
v <- c(1:4, 99)
winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))The classes “monthly” and “quarterly” print as dates and are
compatible with usual time extraction (ie month,
year, etc). Yet, they are stored as integers representing
the number of elapsed periods since 1970/01/0 (resp in week, months,
quarters). This is particularly handy for simple algebra:
# elapsed dates
library(lubridate)
date <- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))
datem <- as.monthly(date)
# displays as a period
datem
#> [1] "1992m04" "1992m01" "1992m03"
# behaves as an integer for numerical operations:
datem + 1
#> [1] "1992m05" "1992m02" "1992m04"
# behaves as a date for period extractions:
year(datem)
#> [1] 1992 1992 1992tlag/tlead a vector with respect to a
number of periods, not with respect to the number of
rows
year <- c(1989, 1991, 1992)
value <- c(4.1, 4.5, 3.3)
tlag(value, 1, time = year)
library(lubridate)
date <- mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))
datem <- as.monthly(date)
value <- c(4.1, 4.5, 3.3)
tlag(value, time = datem) In constrast to comparable functions in zoo and
xts, these functions can be applied to any vector and be
used within a dplyr chain:
df <- tibble(
id = c(1, 1, 1, 2, 2),
year = c(1989, 1991, 1992, 1991, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id) %>% mutate(value_l = tlag(value, time = year))is.panel checks whether a dataset is a panel i.e. the
time variable is never missing and the combinations (id, time) are
unique.
df <- tibble(
id1 = c(1, 1, 1, 2, 2),
id2 = 1:5,
year = c(1991, 1993, NA, 1992, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)
df %>% group_by(id1) %>% is.panel(year)
df1 <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year)fill_gap transforms a unbalanced panel into a balanced panel. It
corresponds to the stata command tsfill. Missing
observations are added as rows with missing values.
df <- tibble(
id = c(1, 1, 1, 2),
datem = as.monthly(mdy(c("04/03/1992", "01/04/1992", "03/15/1992", "05/11/1992"))),
value = c(4.1, 4.5, 3.3, 3.2)
)
df %>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, full = TRUE)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest")stat_binmean() (a stat for ggplot2) returns
the mean of y and x within 20 bins of
x. It’s a barebone version of the Stata command binscatter
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean()
# change number of bins
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10)
# add regression line
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean() + stat_smooth(method = "lm", se = FALSE)You can install