The rcdf
package is a powerful toolkit for securely
working with RCDF (Encrypted Parquet) files in R. RCDF is a custom data
format designed to provide strong encryption and metadata management for
sensitive datasets. With rcdf
, users can easily handle
encrypted data, including reading, writing, and exporting data stored in
this secure format.
This vignette will walk you through the key features of the package, including how to encrypt and save your data in RCDF format, how to decrypt and read RCDF files, and how to export data to other common formats.
To use the rcdf
package, you’ll need to install it
first. You can install the package directly from GitHub using the
devtools
package:
Once installed, you can load the package and start working with RCDF files.
The core function for writing data to the RCDF format is
write_rcdf()
. This function encrypts your data using AES
encryption, generates encrypted metadata for version control using RSA
encryption, and saves the data as encrypted Parquet files inside a zip
archive. This ensures that the data is stored securely and can only be
decrypted using the correct key.
Usage:
Parameters:
data
: A list of data frames or tables to be written to
RCDF format. Each element of the list represents a record.path
: The path where the RCDF file will be written. The
file will be saved with a .rcdf
extension if not already
specified.pub_key
: The public RSA key used to encrypt the AES
encryption keys....
: Additional arguments passed to helper functions if
needed.metadata
: A list of metadata to be included in the RCDF
file. Can contain system information or other relevant details.# Sample data (list of data frames)
data <- rcdf_list()
data$table1 = data.frame(x = 1:10, y = letters[1:10])
data$table2 = data.frame(a = rnorm(10), b = rnorm(10))
# Sample public RSA key (for encryption)
pub_key <- file.path(system.file("extdata", package = "rcdf"), "sample-public-key.pem")
# Write the data to an RCDF file
write_rcdf(data = data, path = "path/to/rcdf_file.rcdf", pub_key = pub_key)
In this example:
data
is a list containing two data frames. These will
be encrypted and saved as separate Parquet files within the RCDF.pub_key
is the RSA public key used to encrypt the AES
keys. The AES keys are used for encrypting the data in a fast and secure
manner.The write_rcdf()
function will create a zip archive
containing the encrypted Parquet files and metadata, then save it to
path.
To read and decrypt an RCDF file, you can use the
read_rcdf()
function. This function extracts the encrypted
Parquet files from the RCDF archive, decrypts them using the provided
decryption key, and loads the data back into R as an RCDF object.
Usage:
Parameters:
path
: A string specifying the path to the RCDF archive
(zip file).decryption_key
: The key used to decrypt the RCDF
contents. This can be an RSA or AES key, depending on how the RCDF was
encrypted....
: Additional parameters passed to other functions,
if needed.password
: A password used for RSA decryption
(optional).metadata
: An optional metadata object containing data
dictionaries and value sets. This metadata is applied to the data if
provided.# Using sample RCDF data
dir <- system.file("extdata", package = "rcdf")
rcdf_path <- file.path(dir, 'mtcars.rcdf')
private_key <- file.path(dir, 'sample-private-key.pem')
rcdf_data <- read_rcdf(path = rcdf_path, decryption_key = private_key)
rcdf_data
# Using encrypted/password protected private key
rcdf_path_pw <- file.path(dir, 'mtcars-pw.rcdf')
private_key_pw <- file.path(dir, 'sample-private-key-pw.pem')
pw <- '1234'
rcdf_data_with_pw <- read_rcdf(
path = rcdf_path_pw,
decryption_key = private_key_pw,
password = pw
)
rcdf_data_with_pw
In this example:
path
is the path to the RCDF file that contains the
encrypted data.decryption_key
is the key used to decrypt the AES keys
and Parquet files. If the RCDF was encrypted using RSA, you’ll need the
private RSA key to decrypt it.The read_rcdf()
function returns an RCDF object, which
is essentially a list of decrypted Parquet files (one for each data
frame in the original data) along with metadata about the file.
Once the data has been decrypted and read into R, you can export it
to other formats using the write_rcdf_as()
or
write_rcdf_*()
family of functions. These function support
a wide variety of common formats, including CSV, TSV, JSON, Excel,
Stata, SPSS, and SQLite.
The write_rcdf_csv()
function allows you to export data
stored in an RCDF object to CSV files. This is useful when you need to
share or process the data in a non-encrypted, readable format.
Usage:
Parameters:
data
: The RCDF object that contains the decrypted data.
This is the data you obtained from calling read_rcdf()
or
other decryption methods.path
: The target directory or file where the CSV files
will be saved....
: Additional arguments passed to the
write.csv()
function for customizing the CSV export (e.g.,
setting delimiters, row names, etc.).parent_dir
: An optional parent directory to be included
in the path where the files will be written.This will save each table in the RCDF object as a separate CSV file in the specified directory.
The write_rcdf_tsv()
function is similar to the CSV
export function but writes the data as tab-separated values (TSV)
files.
Usage:
Parameters:
data
: The decrypted RCDF object containing the data to
export.path
: The target directory or file for the output TSV
files....
: Additional arguments for customizing the TSV
export passed to the write.table()
function (e.g., setting
delimiters, handling row names).parent_dir
: An optional parent directory to be included
in the path where the files will be written.This function will save each data frame in the RCDF object as a separate TSV file in the target location.
The write_rcdf_json()
function allows you to export the
decrypted RCDF data to JSON format. This is useful when working with
APIs or other systems that require data in JSON.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The target directory or file for saving the JSON
files....
: Additional arguments to customize the JSON export
passed to jsonlite::toJSON()
(such as specifying
indentation or compactness of the JSON output).parent_dir
: An optional parent directory to be included
in the path where the files will be written.This will convert each data frame in the RCDF object into a separate
JSON file and save them in the specified directory. The
pretty = TRUE
option ensures that the output JSON files are
human-readable with proper indentation.
The write_rcdf_parquet()
function exports the decrypted
data back into the Parquet format. Parquet is a columnar storage format
that is highly efficient for big data processing.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The directory or file path where the Parquet
files will be saved....
: Additional arguments passed to the
write_parquet()
function for customization, such as
specifying compression type.parent_dir
: An optional parent directory to be included
in the path where the files will be written.This function will write each data frame in the RCDF object into separate Parquet files, storing them in the specified directory.
The write_rcdf_xlsx()
function is used to export the
decrypted RCDF data to Excel (.xlsx) format. It’s helpful when sharing
data with users who prefer spreadsheet software.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The directory or file path where the Excel file
will be saved....
: Additional arguments to customize the Excel file
export in the openxlsx
package.parent_dir
: An optional parent directory to be included
in the path where the files will be written.The write_rcdf_dta()
function allows you to export the
data to Stata’s .dta file format. This is useful for users who need to
work with the data in Stata.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The path where the Stata .dta file will be
saved....
: Additional arguments passed to the
write.dta()
function (e.g., specifying version of
Stata).parent_dir
: An optional parent directory to be included
in the path where the files will be written.The write_rcdf_sav()
function is for exporting the
decrypted RCDF data to SPSS’s .sav file format.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The path where the .sav file will be saved....
: Additional arguments for customizing the SPSS file
export.parent_dir
: An optional parent directory to be included
in the path where the files will be written.The write_rcdf_sqlite()
function allows you to export
the decrypted RCDF data to an SQLite database (with .db extension). Each
data frame is saved as a table within the SQLite database.
Usage:
Parameters:
data
: The decrypted RCDF object.path
: The path where the SQLite database file will be
created....
: Additional arguments for customizing the SQLite
export.parent_dir
: An optional parent directory to be included
in the path where the files will be written.The write_rcdf_as()
function allows you to export
decrypted RCDF data into multiple file formats simultaneously.
Usage:
Parameters:
data
: A named list or RCDF object. Each element should
be a table or tibble-like object (typically a dbplyr
or
dplyr
table).path
: The target directory where output files should be
saved.formats
: A character vector of file formats to export
to. Supported formats include: "csv"
, "tsv"
,
"json"
, "parquet"
, "xlsx"
,
"dta"
, "sav"
, and "sqlite"
....
: Additional arguments passed to the respective
writer functions.