Working with RCDF Files in R

Introduction

The rcdf package is a powerful toolkit for securely working with RCDF (Encrypted Parquet) files in R. RCDF is a custom data format designed to provide strong encryption and metadata management for sensitive datasets. With rcdf, users can easily handle encrypted data, including reading, writing, and exporting data stored in this secure format.

This vignette will walk you through the key features of the package, including how to encrypt and save your data in RCDF format, how to decrypt and read RCDF files, and how to export data to other common formats.

Installation

To use the rcdf package, you’ll need to install it first. You can install the package directly from GitHub using the devtools package:

# Install the package from GitHub
devtools::install_github("yng-me/rcdf")

Once installed, you can load the package and start working with RCDF files.

library(rcdf)

Writing data to RCDF format

The core function for writing data to the RCDF format is write_rcdf(). This function encrypts your data using AES encryption, generates encrypted metadata for version control using RSA encryption, and saves the data as encrypted Parquet files inside a zip archive. This ensures that the data is stored securely and can only be decrypted using the correct key.


Usage:

write_rcdf(data, path, pub_key, ..., metadata = list())

Parameters:

# Sample data (list of data frames)
data <- rcdf_list()
data$table1 = data.frame(x = 1:10, y = letters[1:10])
data$table2 = data.frame(a = rnorm(10), b = rnorm(10))

# Sample public RSA key (for encryption)
pub_key <- file.path(system.file("extdata", package = "rcdf"), "sample-public-key.pem")

# Write the data to an RCDF file
write_rcdf(data = data, path = "path/to/rcdf_file.rcdf", pub_key = pub_key)

In this example:

The write_rcdf() function will create a zip archive containing the encrypted Parquet files and metadata, then save it to path.

Reading and decrypting RCDF data

To read and decrypt an RCDF file, you can use the read_rcdf() function. This function extracts the encrypted Parquet files from the RCDF archive, decrypts them using the provided decryption key, and loads the data back into R as an RCDF object.


Usage:

read_rcdf(path, decryption_key, ..., password = NULL, metadata = NULL)

Parameters:

# Using sample RCDF data
dir <- system.file("extdata", package = "rcdf")
rcdf_path <- file.path(dir, 'mtcars.rcdf')
private_key <- file.path(dir, 'sample-private-key.pem')

rcdf_data <- read_rcdf(path = rcdf_path, decryption_key = private_key)
rcdf_data

# Using encrypted/password protected private key
rcdf_path_pw <- file.path(dir, 'mtcars-pw.rcdf')
private_key_pw <- file.path(dir, 'sample-private-key-pw.pem')
pw <- '1234'

rcdf_data_with_pw <- read_rcdf(
  path = rcdf_path_pw,
  decryption_key = private_key_pw,
  password = pw
)
rcdf_data_with_pw

In this example:

The read_rcdf() function returns an RCDF object, which is essentially a list of decrypted Parquet files (one for each data frame in the original data) along with metadata about the file.

Exporting data to other formats

Once the data has been decrypted and read into R, you can export it to other formats using the write_rcdf_as() or write_rcdf_*() family of functions. These function support a wide variety of common formats, including CSV, TSV, JSON, Excel, Stata, SPSS, and SQLite.

Exporting data to CSV format

The write_rcdf_csv() function allows you to export data stored in an RCDF object to CSV files. This is useful when you need to share or process the data in a non-encrypted, readable format.


Usage:

write_rcdf_csv(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_csv(data = rcdf_data, path = "path/to/output", row.names = FALSE)

This will save each table in the RCDF object as a separate CSV file in the specified directory.

Exporting data to TSV format

The write_rcdf_tsv() function is similar to the CSV export function but writes the data as tab-separated values (TSV) files.


Usage:

write_rcdf_tsv(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_tsv(data = rcdf_data, path = "path/to/output", row.names = FALSE)

This function will save each data frame in the RCDF object as a separate TSV file in the target location.

Exporting data to JSON format

The write_rcdf_json() function allows you to export the decrypted RCDF data to JSON format. This is useful when working with APIs or other systems that require data in JSON.


Usage:

write_rcdf_json(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_json(data = rcdf_data, path = "path/to/output", pretty = TRUE)

This will convert each data frame in the RCDF object into a separate JSON file and save them in the specified directory. The pretty = TRUE option ensures that the output JSON files are human-readable with proper indentation.

Exporting data to Parquet format

The write_rcdf_parquet() function exports the decrypted data back into the Parquet format. Parquet is a columnar storage format that is highly efficient for big data processing.


Usage:

write_rcdf_parquet(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_parquet(data = rcdf_data, path = "path/to/output")

This function will write each data frame in the RCDF object into separate Parquet files, storing them in the specified directory.

Exporting data to Excel format

The write_rcdf_xlsx() function is used to export the decrypted RCDF data to Excel (.xlsx) format. It’s helpful when sharing data with users who prefer spreadsheet software.


Usage:

write_rcdf_xlsx(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_excel(data = rcdf_data, path = "path/to/output.xlsx", sheetName = "Sheet1")

Exporting data to Stata format

The write_rcdf_dta() function allows you to export the data to Stata’s .dta file format. This is useful for users who need to work with the data in Stata.


Usage:

write_rcdf_dta(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_dta(data = rcdf_data, path = "path/to/output")

Exporting data to SPSS format

The write_rcdf_sav() function is for exporting the decrypted RCDF data to SPSS’s .sav file format.


Usage:

write_rcdf_sav(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_sav(data = rcdf_data, path = "path/to/output")

Exporting data to SQLite database format

The write_rcdf_sqlite() function allows you to export the decrypted RCDF data to an SQLite database (with .db extension). Each data frame is saved as a table within the SQLite database.


Usage:

write_rcdf_sqlite(data, path, ..., parent_dir = NULL)

Parameters:

write_rcdf_sqlite(data = rcdf_data, path = "path/to/output")

Exporting data to multiple formats simultaneously

The write_rcdf_as() function allows you to export decrypted RCDF data into multiple file formats simultaneously.


Usage:

write_rcdf_as(data, path, formats, ...)

Parameters: