This document outlines the design decisions guiding the development
strategies of the {readepi}
R package, the reasoning behind
them, as well as the possible pros and cons of each decision.
Importing data from various sources into the R environment is the first step in the workflow of outbreak analysis. Health data are often stored in individual files of different formats, in relational database management systems (RDBMS), and more importantly, many health organizations store their data in health information systems (HIS) that are wrapped under hood of a specific Application Programming Interfaces (APIs).
Many R packages have been developed over the years to read data stored in a file or in a directory containing multiple files. We recommend the {rio} package for importing data that are relatively small in size and the {data.table} package for large files. For retrieving data from RDBMS, we recommend the {DBI} package.
There are several R packages for reading data from HIS such as
{fingertipsR}, {REDCapR}, {godataR}, and {globaldothealth}, which are
used to fetch data from Fingertips, REDCap, Go.Data, and Global.Health respectively. However,
these packages are usually designed to read from specific HIS and can’t
be used to query others. This increases the dependency on many other
packages and introduces the challenge of having a unified framework for
importing data from multiple HIS. As such, we propose
{readepi}
, a centralized tool that will provide users with
the capability of importing data from various HIS and RDBMS.
{readepi}
aims at importing data from several potential
sources in the same way. The data sources include distributed health
information systems and public databases as shown in the figure
below.
The {readepi}
package is designed to import data from
two common sources of institutional health-related data: HIS wrapped
with specific APIs and RDBMS that run on specific servers.
To import data from these sources, users must have read access and
provide the relevant query parameters to fetch the target data. The
current version of {readepi}
supports importing data from:
- HISs: DHIS2 and SORMAS,
In next releases, we plan to include features for reading data from additional HISs like GoData, Globaldothealth, and ODK, as well as RDBMS such as MS Access.
The main functions of the {readepi} package return a
data frame
object that contains the data fetched from the
target source with the specified request parameters. The
login()
function returns a connection object that is used
in the subsequent queries.
The aim of {readepi} is to simplify and standardize the process of
fetching data from APIs and servers. We strive to make this easy for
users by limiting the number of required arguments to access and
retrieve the data of interest from the target source. As such, the
package is structured around few main functions:
read_dhis2()
, read_sormas()
, and
read_rdbms()
; and one auxiliary functions
(login()
).
The login()
function is used to establish connection
with the data source. It verifies the user’s identity and determines if
they are authorized to access the requested database or API.
Establishing this connection is crucial for ensuring successful data
import. However, the basic authentication does not work for SORMAS. To
maintain the design of the package across all HIS, the login function
returns a object when importing data from SORMAS.
Once authentication credentials are provided, they are securely
stored within the connection object. This prevents the need to re-supply
them for subsequent requests in other functions. The Figure below lists
the arguments needed to call the login()
function.
The type
argument refers to the name of the data source
of interest. The current version of the package covers the following
types:
i) RDBMS: “ms sql”, “mysql”, “postgresql”, “sqlite”
ii) APIs: “dhis2”, “sormas”
You can use one of the functions below depending on the data source.
read_rdbms()
: for importing data from RDBMS. It takes
the following arguments:
Pool
object obtained from the
login()
functionread_rdbms()
.read_dhis2()
: for importing data from DHIS2. This
function expect the following arguments:
httr2_response
object
returned by the login()
functionread_sormas()
: for importing data from SORMAS. It takes
the following arguments:
list
object returned by the
login()
sormas_get_diseases()
function.Note that, when reading from RDBMS, the query
argument
could be an SQL query or a list with a vector of table names,
fields and rows to subset on. For HIS, we strongly recommend reading
the vignette on the query_parameters for more details
about the request parameters that are supported in the current version
of the package.
read_rdbms()
function depends on the following
packages:
These functions also require system dependencies for OS-X and Linux systems, detailed in the install drivers vignette vignette.
Additionally, the development of the package necessitates the inclusion of other required packages: - {checkmate} - {httptest2} - {bookdown} - {rmarkdown} - {testthat} (>= 3.0.0) - {knitr} - {cli} - {DiagrammeR} - {cyclocomp}
There are no special requirements to contributing to {readepi}, please follow the package contributing guide.