hms::hms() and hms::as_hms() bindings
implemented to create and manipulate time of day variables
(#46206).atan(), sinh(), cosh(),
tanh(), asinh(), acosh(), and
tanh(), and expm1() bindings added
(#44953).hms::hms() and hms::as_hms() bindings
implemented to create and manipulate time of day variables
(#46206).atan(), sinh(), cosh(),
tanh(), asinh(), acosh(), and
tanh(), and expm1() bindings added
(#44953).check_directory_existence_before_creation in
S3FileSystem to reduce I/O calls on cloud storage (@HaochengLIU,
#41998).case_when() now correctly detects objects that are not
in the global environment (@etiennebacher, #46667).blob::blob in addition
to arrow_binary when converted
to R objects. This change is the first step in eventually
deprecating the arrow_binary class in favor of the
blob class in the blob
package (See GH-45709).This release primarily updates the underlying Arrow C++ version used by the package to version 19.0.1 and includes all changes from the 19.0.0 and 19.0.1 releases. For what’s changed in Arrow C++ 19.0.0, please see the blog post and changelog. For what’s changed in Arrow C++ 19.0.1, please see the blog post and changelog.
%in% (#43446)str_sub binding to properly handle negative
end values (@coussens, #44141)time_hours <- function(mins) mins / 60 worked, but
time_hours_rounded <- function(mins) round(mins / 60)
did not; now both work. These are automatic translations rather than
true user-defined functions (UDFs); for UDFs, see
register_scalar_function(). (#41223)mutate() expressions can now include aggregations, such
as x - mean(x). (#41350)summarize() supports more complex expressions, and
correctly handles cases where column names are reused in expressions.
(#41223)na_matches argument to the
dplyr::*_join() functions is now supported. This argument
controls whether NA values are considered equal when
joining. (#41358)pull on grouped
datasets, it now returns the expected column. (#43172)base::prod have been added so you can now
use it in your dplyr pipelines (i.e.,
tbl |> summarize(prod(col))) without having to pull the
data into R (@m-muecke, #38601).dimnames or colnames on
Dataset objects now returns a useful result rather than
just NULL (#38377).code() method on Schema objects now takes an
optional namespace argument which, when TRUE,
prefixes names with arrow:: which makes the output more
portable (@orgadish,
#38144).SystemRequirements (#39602).sub, gsub,
stringr::str_replace, stringr::str_replace_all
are passed a length > 1 vector of values in pattern
(@abfleishman,
#39219).?open_dataset
documenting how to use the ND-JSON support added in arrow 13.0.0 (@Divyansh200102,
#38258).s3_bucket, S3FileSystem), the debug log
level for S3 can be set with the AWS_S3_LOG_LEVEL
environment variable. See ?S3FileSystem for more
information. (#38267)to_duckdb()) no longer
results in warnings when quitting your R session. (#38495)LIBARROW_BINARY=true for old behavior (#39861).ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true)
and requires atleast Arrow C++ 13.0.0 (#39739).open_dataset(), the partition variables are now included in
the resulting dataset (#37658).write_csv_dataset() now wraps
write_dataset() and mirrors the syntax of
write_csv_arrow() (@dgreiss, #36436).open_delim_dataset() now accepts quoted_na
argument to empty strings to be parsed as NA values (#37828).schema() can now be called on data.frame
objects to retrieve their inferred Arrow schema (#37843).read_csv2_arrow() (#38002).CsvParseOptions object creation now
contains more information about default values (@angela-li, #37909).fixed(),
regex() etc.) now allow variables to be reliably used in
their arguments (#36784).ParquetReaderProperties, allowing users to
work with Parquet files with unusually large metadata (#36992).add_filename() are
improved (@amoeba,
#37372).create_package_with_all_dependencies() now properly
escapes paths on Windows (#37226).data.frame and no
other classes now have the class attribute dropped,
resulting in now always returning tibbles from file reading functions
and arrow_table(), which results in consistency in the type
of returned objects. Calling as.data.frame() on Arrow
Tabular objects now always returns a data.frame object
(#34775)open_dataset() now works with ND-JSON files
(#35055)schema() on multiple Arrow objects now returns
the object’s schema (#35543).by/by argument now supported in
arrow implementation of dplyr verbs (@eitsupi, #35667)dplyr::case_when() now accepts
.default parameter to match the update in dplyr 1.1.0
(#35502)arrow_array() can be used to
create Arrow Arrays (#36381)scalar() can be used to create
Arrow Scalars (#36265)RecordBatchReader::ReadNext() from DuckDB from the
main R thread (#36307)set_io_thread_count() with
num_threads < 2 (#36304)strptime() in arrow will return a timezone-aware
timestamp if %z is part of the format string (#35671)group_by() and
across() now matches dplyr (@eitsupi, #35473)read_parquet() and read_feather()
functions can now accept URL arguments (#33287, #34708).json_credentials argument in
GcsFileSystem$create() now accepts a file path containing
the appropriate authentication token (@amoeba, #34421, #34524).$options member of GcsFileSystem
objects can now be inspected (@amoeba, #34422, #34477).read_csv_arrow() and read_json_arrow()
functions now accept literal text input wrapped in I() to
improve compatability with readr::read_csv() (@eitsupi, #18487,
#33968).$ and
[[ in dplyr expressions (#18818, #19706).FetchNode and
OrderByNode to improve performance and simplify building
query plans from dplyr expressions (#34437, #34685).arrow_table() (#35038,
#35039).data.frame with NULL column names to a
Table (#15247, #34798).open_csv_dataset() family of functions (#33998,
#34710).dplyr::n() function is now mapped to the
count_all kernel to improve performance and simplify the R
implementation (#33892, #33917).s3_bucket()
filesystem helper with endpoint_override and fixed
surprising behaviour that occurred when passing some combinations of
arguments (@cboettig, #33904, #34009).schema is supplied and
col_names = TRUE in open_csv_dataset()
(#34217, #34092).open_csv_dataset() allows a schema to be specified.
(#34217)dplyr:::check_names() (#34369)map_batches() is lazy by default; it now returns a
RecordBatchReader instead of a list of
RecordBatch objects unless lazy = FALSE.
(#14521)open_csv_dataset(),
open_tsv_dataset(), and open_delim_dataset()
all wrap open_dataset()- they don’t provide new
functionality, but allow for readr-style options to be supplied, making
it simpler to switch between individual file-reading and dataset
functionality. (#33614)col_names parameter allows specification of
column names when opening a CSV dataset. (@wjones127, #14705)parse_options, read_options, and
convert_options parameters for reading individual files
(read_*_arrow() functions) and datasets
(open_dataset() and the new open_*_dataset()
functions) can be passed in as lists. (#15270)read_csv_arrow(). (#14930)join_by() has been
implemented for dplyr joins on Arrow objects (equality conditions only).
(#33664)dplyr::group_by()/dplyr::summarise() calls are
used. (#14905)dplyr::summarize() works with division when divisor is
a variable. (#14933)dplyr::right_join() correctly coalesces keys.
(#15077)lubridate::with_tz() and
lubridate::force_tz() (@eitsupi, #14093)stringr::str_remove() and
stringr::str_remove_all() (#14644)POSIXlt objects.
(#15277)Array$create() can create Decimal arrays. (#15211)StructArray$create() can be used to create StructArray
objects. (#14922)lubridate::as_datetime() on Arrow objects can
handle time in sub-seconds. (@eitsupi, #13890)head() can be called after
as_record_batch_reader(). (#14518)as.Date() can go from timestamp[us] to
timestamp[s]. (#14935)check_dots_empty(). (@daattali, #14744)Minor improvements and fixes:
.data pronoun in
dplyr::group_by() (#14484)Several new functions can be used in queries:
dplyr::across() can be used to apply the same
computation across multiple columns, and the where()
selection helper is supported in across();add_filename() can be used to get the filename a row
came from (only available when querying ?Dataset);slice_* family:
dplyr::slice_min(), dplyr::slice_max(),
dplyr::slice_head(), dplyr::slice_tail(), and
dplyr::slice_sample().The package now has documentation that lists all dplyr
methods and R function mappings that are supported on Arrow data, along
with notes about any differences in functionality between queries
evaluated in R versus in Acero, the Arrow query engine. See
?acero.
A few new features and bugfixes were implemented for joins:
keep argument is now supported, allowing separate
columns for the left and right hand side join keys in join output. Full
joins now coalesce the join keys (when keep = FALSE),
avoiding the issue where the join keys would be all NA for
rows in the right hand side without any matches on the left.Some changes to improve the consistency of the API:
dplyr::pull() will return
a ?ChunkedArray instead of an R vector by default. The
current default behavior is deprecated. To update to the new behavior
now, specify pull(as_vector = FALSE) or set
options(arrow.pull_as_vector = FALSE) globally.dplyr::compute() on a query that is grouped
returns a ?Table instead of a query object.Finally, long-running queries can now be cancelled and will abort their computation immediately.
as_arrow_array() can now take blob::blob
and ?vctrs::list_of, which convert to binary and list
arrays, respectively. Also fixed an issue where
as_arrow_array() ignored type argument when passed a
StructArray.
The unique() function works on ?Table,
?RecordBatch, ?Dataset, and
?RecordBatchReader.
write_feather() can take
compression = FALSE to choose writing uncompressed
files.
Also, a breaking change for IPC files in
write_dataset(): passing "ipc" or
"feather" to format will now write files with
.arrow extension instead of .ipc or
.feather.
As of version 10.0.0, arrow requires C++17 to build.
This means that:
R >= 4.0. Version 9.0.0 was the
last version to support R 3.6.arrow,
but you first need to install a newer compiler than the default system
compiler, gcc 4.8. See
vignette("install", package = "arrow") for guidance. Note
that you only need the newer compiler to build arrow:
installing a binary package, as from RStudio Package Manager, or loading
a package you’ve already installed works fine with the system
defaults.dplyr::union and dplyr::union_all
(#13090)dplyr::glimpse (#13563)show_exec_plan() can be added to the end of a dplyr
pipeline to show the underlying plan, similar to
dplyr::show_query(). dplyr::show_query() and
dplyr::explain() also work and show the same output, but
may change in the future. (#13541)register_scalar_function() to create them. (#13397)map_batches() returns a RecordBatchReader
and requires that the function it maps returns something coercible to a
RecordBatch through the as_record_batch() S3
function. It can also run in streaming fashion if passed
.lazy = TRUE. (#13170, #13650)stringr::, lubridate::) within queries.
For example, stringr::str_length will now dispatch to the
same kernel as str_length. (#13160)lubridate::parse_date_time() datetime parser: (#12589,
#13196, #13506)
orders with year, month, day, hours, minutes, and
seconds components are supported.orders argument in the Arrow binding works as
follows: orders are transformed into formats
which subsequently get applied in turn. There is no
select_formats parameter and no inference takes place (like
is the case in lubridate::parse_date_time()).lubridate date and datetime parsers such as
lubridate::ymd(), lubridate::yq(), and
lubridate::ymd_hms() (#13118, #13163, #13627)lubridate::fast_strptime() (#13174)lubridate::floor_date(),
lubridate::ceiling_date(), and
lubridate::round_date() (#12154)strptime() supports the tz argument to
pass timezones. (#13190)lubridate::qday() (day of quarter)exp() and sqrt(). (#13517)read_ipc_file() and
write_ipc_file() are added. These functions are almost the
same as read_feather() and write_feather(),
but differ in that they only target IPC files (Feather V2 files), not
Feather V1 files.read_arrow() and write_arrow(), deprecated
since 1.0.0 (July 2020), have been removed. Instead of these, use the
read_ipc_file() and write_ipc_file() for IPC
files, or, read_ipc_stream() and
write_ipc_stream() for IPC streams. (#13550)write_parquet() now defaults to writing Parquet format
version 2.4 (was 1.0). Previously deprecated arguments
properties and arrow_properties have been
removed; if you need to deal with these lower-level properties objects
directly, use ParquetFileWriter, which
write_parquet() wraps. (#13555)write_dataset() preserves all schema metadata again. In
8.0.0, it would drop most metadata, breaking packages such as sfarrow.
(#13105)write_csv_arrow()) will automatically (de-)compress data if
the file path contains a compression extension
(e.g. "data.csv.gz"). This works locally as well as on
remote filesystems like S3 and GCS. (#13183)FileSystemFactoryOptions can be provided to
open_dataset(), allowing you to pass options such as which
file prefixes to ignore. (#13171)S3FileSystem will not create or delete
buckets. To enable that, pass the configuration option
allow_bucket_creation or
allow_bucket_deletion. (#13206)GcsFileSystem and gs_bucket() allow
connecting to Google Cloud Storage. (#10999, #13601)$num_rows() method returns a
double (previously integer), avoiding integer overflow on larger tables.
(#13482, #13514)arrow.dev_repo for nightly builds of the R package
and prebuilt libarrow binaries is now https://nightlies.apache.org/arrow/r/.open_dataset():
skip argument for skipping
header rows in CSV datasets.UnionDataset.{dplyr} queries:
RecordBatchReader. This allows, for
example, results from DuckDB to be streamed back into Arrow rather than
materialized before continuing the pipeline.dplyr::rename_with().dplyr::count() returns an ungrouped dataframe.write_dataset() has more options for controlling row
group and file sizes when writing partitioned datasets, such as
max_open_files, max_rows_per_file,
min_rows_per_group, and
max_rows_per_group.write_csv_arrow() accepts a Dataset or an
Arrow dplyr query.option(use_threads = FALSE) no longer crashes R. That
option is set by default on Windows.dplyr joins support the suffix argument to
handle overlap in column names.is.na() no longer
misses any rows.map_batches() correctly accepts Dataset
objects.read_csv_arrow()’s readr-style type T is
mapped to timestamp(unit = "ns") instead of
timestamp(unit = "s").{lubridate}
features and fixes:
lubridate::tz() (timezone),lubridate::semester(),lubridate::dst() (daylight savings time boolean),lubridate::date(),lubridate::epiyear() (year according to epidemiological
week calendar),lubridate::month() works with integer inputs.lubridate::make_date() &
lubridate::make_datetime() +
base::ISOdatetime() & base::ISOdate() to
create date-times from numeric representations.lubridate::decimal_date() and
lubridate::date_decimal()lubridate::make_difftime() (duration constructor)?lubridate::duration helper functions, such as
lubridate::dyears(), lubridate::dhours(),
lubridate::dseconds().lubridate::leap_year()lubridate::as_date() and
lubridate::as_datetime()base::difftime and
base::as.difftime()base::as.Date() to convert to datebase::format()strptime() returns NA instead of erroring
in case of format mismatch, just like
base::strptime().as_arrow_array() and as_arrow_table() for main
Arrow objects. This includes, Arrow tables, record batches, arrays,
chunked arrays, record batch readers, schemas, and data types. This
allows other packages to define custom conversions from their types to
Arrow objects, including extension arrays.?new_extension_type.vctrs::vec_is() returns TRUE (i.e.,
any object that can be used as a column in a
tibble::tibble()), provided that the underlying
vctrs::vec_data() can be converted to an Arrow Array.Arrow arrays and tables can be easily concatenated:
concat_arrays() or, if
zero-copy is desired and chunking is acceptable, using
ChunkedArray$create().c().cbind().rbind(). concat_tables() is
also provided to concatenate tables while unifying schemas.sqrt(), log(), and
exp() with Arrow arrays and scalars.read_* and write_* functions support R
Connection objects for reading and writing files.median() and quantile() will warn only
once about approximate calculations regardless of interactivity.Array$cast() can cast StructArrays into another struct
type with the same field names and structure (or a subset of fields) but
different field types.set_io_thread_count() would set
the CPU count instead of the IO thread count.RandomAccessFile has a $ReadMetadata()
method that provides useful metadata provided by the filesystem.grepl binding returns FALSE for
NA inputs (previously it returned NA), to
match the behavior of base::grepl().create_package_with_all_dependencies() works on Windows
and Mac OS, instead of only Linux.{lubridate} features: week(),
more of the is.*() functions, and the label argument to
month() have been implemented.summarize(), such as
ifelse(n() > 1, mean(y), mean(z)), are supported.tibble and data.frame to create columns of
tibbles or data.frames respectively
(e.g. ... %>% mutate(df_col = tibble(a, b)) %>% ...).factor type) are supported inside
of coalesce().open_dataset() accepts the partitioning
argument when reading Hive-style partitioned files, even though it is
not required.map_batches() function for custom
operations on dataset has been restored.encoding argument when
reading).open_dataset() correctly ignores byte-order marks
(BOMs) in CSVs, as already was true for reading single
fileshead() no longer hangs on large CSV datasets.write_csv_arrow() now follows the signature of
readr::write_csv().$code() method on a
schema or type. This allows you to easily get
the code needed to create a schema from an object that already has
one.Duration type has been mapped to R’s
difftime class.decimal256() type is supported. The
decimal() function has been revised to call either
decimal256() or decimal128() based on the
value of the precision argument.write_parquet() uses a reasonable guess at
chunk_size instead of always writing a single chunk. This
improves the speed of reading and writing large Parquet files.write_parquet() no longer drops attributes for grouped
data.frames.proxy_options.pkg-config to search
for system dependencies (such as libz) and link to them if
present. This new default will make building Arrow from source quicker
on systems that have these dependencies installed already. To retain the
previous behavior of downloading and building all dependencies, set
ARROW_DEPENDENCY_SOURCE=BUNDLED.glue, which
arrow depends on transitively, has dropped support for
it.str_count() in dplyr queriesThere are now two ways to query Arrow data:
dplyr::summarize(), both grouped and ungrouped, is now
implemented for Arrow Datasets, Tables, and RecordBatches. Because data
is scanned in chunks, you can aggregate over larger-than-memory datasets
backed by many files. Supported aggregation functions include
n(), n_distinct(), min(),
max(), sum(), mean(),
var(), sd(), any(), and
all(). median() and quantile()
with one probability are also supported and currently return approximate
results using the t-digest algorithm.
Along with summarize(), you can also call
count(), tally(), and distinct(),
which effectively wrap summarize().
This enhancement does change the behavior of summarize()
and collect() in some cases: see “Breaking changes” below
for details.
In addition to summarize(), mutating and filtering
equality joins (inner_join(), left_join(),
right_join(), full_join(),
semi_join(), and anti_join()) with are also
supported natively in Arrow.
Grouped aggregation and (especially) joins should be considered somewhat experimental in this release. We expect them to work, but they may not be well optimized for all workloads. To help us focus our efforts on improving them in the next release, please let us know if you encounter unexpected behavior or poor performance.
New non-aggregating compute functions include string functions like
str_to_title() and strftime() as well as
compute functions for extracting date parts (e.g. year(),
month()) from dates. This is not a complete list of
additional compute functions; for an exhaustive list of available
compute functions see list_compute_functions().
We’ve also worked to fill in support for all data types, such as
Decimal, for functions added in previous releases. All type
limitations mentioned in previous release notes should be no longer
valid, and if you find a function that is not implemented for a certain
data type, please report an
issue.
If you have the duckdb package
installed, you can hand off an Arrow Dataset or query object to DuckDB for further querying using the
to_duckdb() function. This allows you to use duckdb’s
dbplyr methods, as well as its SQL interface, to aggregate
data. Filtering and column projection done before
to_duckdb() is evaluated in Arrow, and duckdb can push down
some predicates to Arrow as well. This handoff does not copy
the data, instead it uses Arrow’s C-interface (just like passing arrow
data between R and Python). This means there is no serialization or data
copying costs are incurred.
You can also take a duckdb tbl and call
to_arrow() to stream data to Arrow’s query engine. This
means that in a single dplyr pipeline, you could start with an Arrow
Dataset, evaluate some steps in DuckDB, then evaluate the rest in
Arrow.
arrange() the query result. For calls to
summarize(), you can set
options(arrow.summarise.sort = TRUE) to match the current
dplyr behavior of sorting on the grouping columns.dplyr::summarize() on an in-memory Arrow Table or
RecordBatch no longer eagerly evaluates. Call compute() or
collect() to evaluate the query.head() and tail() also no longer eagerly
evaluate, both for in-memory data and for Datasets. Also, because row
order is no longer deterministic, they will effectively give you a
random slice of data from somewhere in the dataset unless you
arrange() to specify sorting.sf::st_as_binary(col)) or using the sfarrow package
which handles some of the intricacies of this conversion process. We
have plans to improve this and re-enable custom metadata like this in
the future when we can implement the saving in a safe and efficient way.
If you need to preserve the pre-6.0.0 behavior of saving this metadata,
you can set
options(arrow.preserve_row_level_metadata = TRUE). We will
be removing this option in a coming release. We strongly recommend
avoiding using this workaround if possible since the results will not be
supported in the future and can lead to surprising and inaccurate
results. If you run into a custom class besides sf columns that are
impacted by this please report an
issue.LIBARROW_MINIMAL=true. This will have the core
Arrow/Feather components but excludes Parquet, Datasets, compression
libraries, and other optional features.create_package_with_all_dependencies() function
(also available on GitHub without installing the arrow package) will
download all third-party C++ dependencies and bundle them inside the R
source package. Run this function on a system connected to the network
to produce the “fat” source package, then copy that .tar.gz package to
your offline machine and install. Special thanks to @karldw for the huge amount
of work on this.libz) by setting ARROW_DEPENDENCY_SOURCE=AUTO.
This is not the default in this release (BUNDLED,
i.e. download and build all dependencies) but may become the default in
the future.read_json_arrow()) are now
optional and still on by default; set ARROW_JSON=OFF before
building to disable them.options(arrow.use_altrep = FALSE)Field objects can now be created as non-nullable, and
schema() now optionally accepts a list of
Fieldswrite_parquet() no longer errors when used with a
grouped data.framecase_when() now errors cleanly if an expression is not
supported in Arrowopen_dataset() now works on CSVs without header
rowsT
and t were reversed in read_csv_arrow()log(..., base = b) where b is something
other than 2, e, or 10Table$create() now has alias
arrow_table()This patch version contains fixes for some sanitizer and compiler warnings.
There are now more than 250 compute functions available for use
in dplyr::filter(), mutate(), etc. Additions
in this release include:
strsplit() and
str_split(); strptime(); paste(),
paste0(), and str_c(); substr()
and str_sub(); str_like();
str_pad(); stri_reverse()lubridate methods such as
year(), month(), wday(), and so
onlog() et al.); trigonometry
(sin(), cos(), et al.); abs();
sign(); pmin() and pmax();
ceiling(), floor(), and
trunc()ifelse() and if_else() for all but
Decimal types; case_when() for logical,
numeric, and temporal types only; coalesce() for all but
lists/structs. Note also that in this release, factors/dictionaries are
converted to strings in these functions.is.* functions are supported and can be used inside
relocate()The print method for arrow_dplyr_query now includes
the expression and the resulting type of columns derived by
mutate().
transmute() now errors if passed arguments
.keep, .before, or .after, for
consistency with the behavior of dplyr on
data.frames.
write_csv_arrow() to use Arrow to write a data.frame to
a single CSV filewrite_dataset(format = "csv", ...) to write a Dataset
to CSVs, including with partitioningreticulate::py_to_r() and r_to_py()
methods. Along with the addition of the
Scanner$ToRecordBatchReader() method, you can now build up
a Dataset query in R and pass the resulting stream of batches to another
tool in process.Array$export_to_c(),
RecordBatch$import_from_c()), similar to how they are in
pyarrow. This facilitates their use in other packages. See
the py_to_r() and r_to_py() methods for usage
examples.data.frame to an Arrow
Table uses multithreading across columnsoptions(arrow.use_altrep = FALSE)is.na() now evaluates to TRUE on
NaN values in floating point number fields, for consistency
with base R.is.nan() now evaluates to FALSE on
NA values in floating point number fields and
FALSE on all values in non-floating point fields, for
consistency with base R.Array,
ChunkedArray, RecordBatch, and
Table: na.omit() and friends,
any()/all()RecordBatch$create() and
Table$create() are recycledarrow_info() includes details on the C++ build, such as
compiler versionmatch_arrow() now converts x into an
Array if it is not a Scalar,
Array or ChunkedArray and no longer dispatches
base::match().LIBARROW_MINIMAL=false) includes both
jemalloc and mimalloc, and it has still has jemalloc as default, though
this is configurable at runtime with the
ARROW_DEFAULT_MEMORY_POOL environment variable.LIBARROW_MINIMAL,
LIBARROW_DOWNLOAD, and NOT_CRAN are now
case-insensitive in the Linux build script.Many more dplyr verbs are supported on Arrow
objects:
dplyr::mutate() is now supported in Arrow for many
applications. For queries on Table and
RecordBatch that are not yet supported in Arrow, the
implementation falls back to pulling data into an in-memory R
data.frame first, as in the previous release. For queries
on Dataset (which can be larger than memory), it raises an
error if the function is not implemented. The main mutate()
features that cannot yet be called on Arrow objects are (1)
mutate() after group_by() (which is typically
used in combination with aggregation) and (2) queries that use
dplyr::across().dplyr::transmute() (which calls
mutate())dplyr::group_by() now preserves the .drop
argument and supports on-the-fly definition of columnsdplyr::relocate() to reorder columnsdplyr::arrange() to sort rowsdplyr::compute() to evaluate the lazy expressions and
return an Arrow Table. This is equivalent to
dplyr::collect(as_data_frame = FALSE), which was added in
2.0.0.Over 100 functions can now be called on Arrow objects inside a
dplyr verb:
nchar(), tolower(), and
toupper(), along with their stringr spellings
str_length(), str_to_lower(), and
str_to_upper(), are supported in Arrow dplyr
calls. str_trim() is also supported.sub(),
gsub(), and grepl(), along with
str_replace(), str_replace_all(), and
str_detect(), are supported.cast(x, type) and dictionary_encode()
allow changing the type of columns in Arrow objects;
as.numeric(), as.character(), etc. are exposed
as similar type-altering conveniencesdplyr::between(); the Arrow version also allows the
left and right arguments to be columns in the
data and not just scalarsdplyr verb. This enables you to access Arrow functions that
don’t have a direct R mapping. See list_compute_functions()
for all available functions, which are available in dplyr
prefixed by arrow_.dplyr::filter(arrow_dataset, string_column == 3) will error
with a message about the type mismatch between the numeric
3 and the string type of string_column.open_dataset() now accepts a vector of file paths (or
even a single file path). Among other things, this enables you to open a
single very large file and use write_dataset() to partition
it without having to read the whole file into memory.write_dataset() now defaults to
format = "parquet" and better validates the
format argumentschema in open_dataset()
is now correctly handledScanner$Scan() method has been removed; use
Scanner$ScanBatches()value_counts() to tabulate values in an
Array or ChunkedArray, similar to
base::table().StructArray objects gain data.frame-like methods,
including names(), $, [[, and
dim().<-) with either $ or
[[Schema can now be edited by assigning in new
types. This enables using the CSV reader to detect the schema of a file,
modify the Schema object for any columns that you want to
read in as a different type, and then use that Schema to
read the data.Table with a schema,
with columns of different lengths, and with scalar value recycling\0) characters, the error message now informs you that you
can set options(arrow.skip_nul = TRUE) to strip them out.
It is not recommended to set this option by default since this code path
is significantly slower, and most string data does not contain
nuls.read_json_arrow() now accepts a schema:
read_json_arrow("file.json", schema = schema(col_a = float64(), col_b = string()))vignette("install", package = "arrow") for details. This
allows a faster, smaller package build in cases where that is useful,
and it enables a minimal, functioning R package build on Solaris.FORCE_BUNDLED_BUILD=true.arrow now uses the mimalloc memory
allocator by default on macOS, if available (as it is in CRAN binaries),
instead of jemalloc. There are configuration
issues with jemalloc on macOS, and benchmark
analysis shows that this has negative effects on performance,
especially on memory-intensive workflows. jemalloc remains
the default on Linux; mimalloc is default on Windows.ARROW_DEFAULT_MEMORY_POOL environment
variable to switch memory allocators now works correctly when the Arrow
C++ library has been statically linked (as is usually the case when
installing from CRAN).arrow_info() function now reports on the additional
optional features, as well as the detected SIMD level. If key features
or compression libraries are not enabled in the build,
arrow_info() will refer to the installation vignette for
guidance on how to install a more complete build, if desired.vignette("developing", package = "arrow").ARROW_HOME to point to a specific directory where the Arrow
libraries are. This is similar to passing INCLUDE_DIR and
LIB_DIR.flight_get() and
flight_put() (renamed from push_data() in this
release) can handle both Tables and RecordBatchesflight_put() gains an overwrite argument
to optionally check for the existence of a resource with the same
namelist_flights() and flight_path_exists()
enable you to see available resources on a Flight serverSchema objects now have r_to_py and
py_to_r methods+, *, etc.) are
supported on Arrays and ChunkedArrays and can be used in filter
expressions in Arrow dplyr pipelines<-) with either $ or [[names()rlang pronouns .data and
.env are now fully supported in Arrow dplyr
pipelines.arrow.skip_nul (default FALSE, as
in base::scan()) allows conversion of Arrow string
(utf8()) type data containing embedded nul \0
characters to R. If set to TRUE, nuls will be stripped and
a warning is emitted if any are found.arrow_info() for an overview of various run-time and
build-time Arrow configurations, useful for debuggingARROW_DEFAULT_MEMORY_POOL
before loading the Arrow package to change memory allocators. Windows
packages are built with mimalloc; most others are built
with both jemalloc (used by default) and
mimalloc. These alternative memory allocators are generally
much faster than the system memory allocator, so they are used by
default when available, but sometimes it is useful to turn them off for
debugging purposes. To disable them, set
ARROW_DEFAULT_MEMORY_POOL=system.sf tibbles to faithfully preserved and
roundtripped (#8549).schema() for more details.write_parquet() can now write RecordBatchesreadr’s problems attribute is removed when
converting to Arrow RecordBatch and table to prevent large amounts of
metadata from accumulating inadvertently (#9092)SubTreeFileSystem gains a useful print method and no
longer errors when printingr-arrow
package are available with
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrowcmake
versionsvignette("install", package = "arrow"), especially for
known CentOS issuesdistro package. If
your OS isn’t correctly identified, please report an issue there.write_dataset() to Feather or Parquet files with
partitioning. See the end of
vignette("dataset", package = "arrow") for discussion and
examples.head(), tail(), and take
([) methods. head() is optimized but the
others may not be performant.collect() gains an as_data_frame argument,
default TRUE but when FALSE allows you to
evaluate the accumulated select and filter
query but keep the result in Arrow, not an R
data.frameread_csv_arrow() supports specifying column types, both
with a Schema and with the compact string representation
for types used in the readr package. It also has gained a
timestamp_parsers argument that lets you express a set of
strptime parse strings that will be tried to convert
columns designated as Timestamp type.libcurl and
openssl, as well as a sufficiently modern compiler. See
vignette("install", package = "arrow") for details.read_parquet(),
write_feather(), et al.), as well as
open_dataset() and write_dataset(), allow you
to access resources on S3 (or on file systems that emulate S3) either by
providing an s3:// URI or by providing a
FileSystem$path(). See
vignette("fs", package = "arrow") for examples.copy_files() allows you to recursively copy directories
of files from one file system to another, such as from S3 to your local
machine.Flight
is a general-purpose client-server framework for high performance
transport of large datasets over network interfaces. The
arrow R package now provides methods for connecting to
Flight RPC servers to send and receive data. See
vignette("flight", package = "arrow") for an overview.
==, >, etc.) and boolean
(&, |, !) operations, along
with is.na, %in% and match
(called match_arrow()), on Arrow Arrays and ChunkedArrays
are now implemented in the C++ library.min(), max(), and
unique() are implemented for Arrays and ChunkedArrays.dplyr filter expressions on Arrow Tables and
RecordBatches are now evaluated in the C++ library, rather than by
pulling data into R and evaluating. This yields significant performance
improvements.dim() (nrow) for dplyr queries on
Table/RecordBatch is now supportedarrow now depends on cpp11, which brings
more robust UTF-8 handling and faster compilationInt64 type when all
values fit with an R 32-bit integer now correctly inspects all chunks in
a ChunkedArray, and this conversion can be disabled (so that
Int64 always yields a bit64::integer64 vector)
by setting options(arrow.int64_downcast = FALSE).ParquetFileReader has additional methods for accessing
individual columns or row groups from the fileParquetFileWriter; invalid ArrowObject pointer
from a saved R object; converting deeply nested structs from Arrow to
Rproperties and arrow_properties
arguments to write_parquet() are deprecated%in% expression now faithfully returns all relevant
rows. or _; files and subdirectories starting
with those prefixes are still ignoredopen_dataset("~/path") now correctly expands the
pathversion option to write_parquet() is
now correctly implementedparquet-cpp library has been
fixedcmake
is more robust, and you can now specify a /path/to/cmake by
setting the CMAKE environment variablevignette("arrow", package = "arrow") includes tables
that explain how R types are converted to Arrow types and vice
versa.uint64, binary,
fixed_size_binary, large_binary,
large_utf8, large_list, list of
structs.character vectors that exceed 2GB are converted to
Arrow large_utf8 typePOSIXlt objects can now be converted to Arrow
(struct)attributes() are preserved in Arrow metadata when
converting to Arrow RecordBatch and table and are restored when
converting from Arrow. This means that custom subclasses, such as
haven::labelled, are preserved in round trip through
Arrow.batch$metadata$new_key <- "new value"int64, uint32, and
uint64 now are converted to R integer if all
values fit in boundsdate32 is now converted to R Date
with double underlying storage. Even though the data values
themselves are integers, this provides more strict round-trip
fidelityfactor, dictionary
ChunkedArrays that do not have identical dictionaries are properly
unifiedRecordBatch{File,Stream}Writer will
write V5, but you can specify an alternate
metadata_version. For convenience, if you know the consumer
you’re writing to cannot read V5, you can set the environment variable
ARROW_PRE_1_0_METADATA_VERSION=1 to write V4 without
changing any other code.ds <- open_dataset("s3://...").
Note that this currently requires a special C++ library build with
additional dependencies–this is not yet available in CRAN releases or in
nightly packages.sum() and
mean() are implemented for Array and
ChunkedArraydimnames() and as.list()reticulatecoerce_timestamps option to
write_parquet() is now correctly implemented.type
definition if provided by the userread_arrow and write_arrow are now
deprecated; use the read/write_feather() and
read/write_ipc_stream() functions depending on whether
you’re working with the Arrow IPC file or stream format,
respectively.FileStats,
read_record_batch, and read_table have been
removed.jemalloc included, and Windows packages
use mimallocCC and
CXX values that R usesdplyr 1.0reticulate::r_to_py() conversion now correctly works
automatically, without having to call the method yourselfThis release includes support for version 2 of the Feather file
format. Feather v2 features full support for all Arrow data types, fixes
the 2GB per-column limitation for large amounts of string data, and it
allows files to be compressed using either lz4 or
zstd. write_feather() can write either version
2 or version 1 Feather
files, and read_feather() automatically detects which file
version it is reading.
Related to this change, several functions around reading and writing
data have been reworked. read_ipc_stream() and
write_ipc_stream() have been added to facilitate writing
data to the Arrow IPC stream format, which is slightly different from
the IPC file format (Feather v2 is the IPC file format).
Behavior has been standardized: all
read_<format>() return an R data.frame
(default) or a Table if the argument
as_data_frame = FALSE; all
write_<format>() functions return the data object,
invisibly. To facilitate some workflows, a special
write_to_raw() function is added to wrap
write_ipc_stream() and return the raw vector
containing the buffer that was written.
To achieve this standardization, read_table(),
read_record_batch(), read_arrow(), and
write_arrow() have been deprecated.
The 0.17 Apache Arrow release includes a C data interface that allows
exchanging Arrow data in-process at the C level without copying and
without libraries having a build or runtime dependency on each other.
This enables us to use reticulate to share data between R
and Python (pyarrow) efficiently.
See vignette("python", package = "arrow") for
details.
dim() method, which sums rows across
all files (#6635, @boshek)UnionDataset with the c() methodNA as FALSE,
consistent with dplyr::filter()vignette("dataset", package = "arrow") now has correct,
executable codeNOT_CRAN=true. See
vignette("install", package = "arrow") for details and more
options.unify_schemas() to create a Schema
containing the union of fields in multiple schemasread_feather() and other reader functions close any
file connections they openR.oo package is also loadedFileStats is renamed to FileInfo, and the
original spelling has been deprecatedinstall_arrow() now installs the latest release of
arrow, including Linux dependencies, either for CRAN
releases or for development builds (if nightly = TRUE)LIBARROW_DOWNLOAD or NOT_CRAN
environment variable is setwrite_feather(), write_arrow() and
write_parquet() now return their input, similar to the
write_* functions in the readr package (#6387,
@boshek)list and create a
ListArray when all list elements are the same type (#6275, @michaelchirico)This release includes a dplyr interface to Arrow
Datasets, which let you work efficiently with large, multi-file datasets
as a single entity. Explore a directory of data files with
open_dataset() and then use dplyr methods to
select(), filter(), etc. Work will be done
where possible in Arrow memory. When necessary, data is pulled into R
for further computation. dplyr methods are conditionally
loaded if you have dplyr available; it is not a hard
dependency.
See vignette("dataset", package = "arrow") for
details.
A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.
See vignette("install", package = "arrow") for
details.
Tables and RecordBatches also have
dplyr methods.dplyr, [ methods
for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural
row extraction operations. These use the C++ Filter,
Slice, and Take methods for efficient access,
depending on the type of selection vector.array_expression
class has also been added, enabling among other things the ability to
filter a Table with some function of Arrays, such as
arrow_table[arrow_table$var1 > 5, ] without having to
pull everything into R first.write_parquet() now supports compressioncodec_is_available() returns TRUE or
FALSE whether the Arrow C++ library was built with support
for a given compression library (e.g. gzip, lz4, snappy)character (as R factor levels are required to
be) instead of raising an errorClass$create() methods. Notably,
arrow::array() and arrow::table() have been
removed in favor of Array$create() and
Table$create(), eliminating the package startup message
about masking base functions. For more information, see the
new vignette("arrow").ARROW_PRE_0_15_IPC_FORMAT=1.as_tibble argument in the read_*()
functions has been renamed to as_data_frame (#5399, @jameslamb)arrow::Column class has been removed, as it was
removed from the C++ libraryTable and RecordBatch objects have S3
methods that enable you to work with them more like
data.frames. Extract columns, subset, and so on. See
?Table and ?RecordBatch for examples.read_csv_arrow() supports more parsing options,
including col_names, na,
quoted_na, and skipread_parquet() and read_feather() can
ingest data from a raw vector (#5141)~/file.parquet (#5169)double()), and time types can be created
with human-friendly resolution strings (“ms”, “s”, etc.). (#5198,
#5201)Initial CRAN release of the arrow package. Key features
include: