| Title: | Alt String Implementation |
| Version: | 0.19.0 |
| Date: | 2026-04-18 |
| Maintainer: | Travers Ching <traversc@gmail.com> |
| Description: | Provides an extendable, performant and multithreaded 'alt-string' implementation backed by 'C++' vectors and strings. |
| License: | GPL-3 |
| Biarch: | true |
| Encoding: | UTF-8 |
| Depends: | R (≥ 3.6.0) |
| SystemRequirements: | GNU make, C++17 |
| LinkingTo: | Rcpp (≥ 0.12.18.3), RcppParallel (≥ 5.1.4) |
| Imports: | Rcpp, RcppParallel |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.3 |
| Copyright: | Copyright for the bundled 'PCRE2' library is held by University of Cambridge, Zoltan Herczeg and Tilera Corporation (Stack-less Just-In-Time compiler); Copyright for the bundled 'xxHash' code is held by Yann Collet. |
| URL: | https://github.com/traversc/stringfish |
| BugReports: | https://github.com/traversc/stringfish/issues |
| NeedsCompilation: | yes |
| Packaged: | 2026-04-18 03:31:12 UTC; ted |
| Author: | Travers Ching [aut, cre, cph], Phillip Hazel [ctb] (Bundled PCRE2 code), Zoltan Herczeg [ctb, cph] (Bundled PCRE2 code), University of Cambridge [cph] (Bundled PCRE2 code), Tilera Corporation [cph] (Stack-less Just-In-Time compiler bundled with PCRE2), Yann Collet [ctb, cph] (Yann Collet is the author of the bundled xxHash code) |
| Repository: | CRAN |
| Date/Publication: | 2026-04-21 11:00:03 UTC |
convert_to_sf_vector
Description
Converts a character vector to an 'sf_vec'-backed stringfish vector 'convert_to_sf()' is a compatibility alias for 'convert_to_sf_vector()'.
Usage
convert_to_sf_vector(x, length.out = length(x))
Arguments
x |
A character vector |
length.out |
Optional output length used to recycle 'x' |
Details
Converts a character vector to a stringfish vector backed by 'sf_vec'. If 'length.out' is supplied, 'x' is recycled to that length before conversion. The opposite of 'materialize'.
Value
The converted character vector
Examples
x <- convert_to_sf_vector(letters)
convert_to_slice_store
Description
Converts a character vector to a slice-store-backed stringfish vector
Usage
convert_to_slice_store(x, length.out = length(x))
Arguments
x |
A character vector |
length.out |
Optional output length used to recycle 'x' |
Details
Converts a character vector to a stringfish vector backed by 'slice_store'. If 'length.out' is supplied, 'x' is recycled to that length before conversion. The converter pre-sizes the first 'slice_store' slice from the normalized string bytes when possible. The opposite of 'materialize'.
Value
The converted character vector
Examples
x <- convert_to_slice_store(letters)
get_string_type
Description
Returns the type of the character vector
Usage
get_string_type(x)
Arguments
x |
the vector |
Details
A function that returns the type of character vector. Possible values are "normal vector", "stringfish vector", "stringfish vector (materialized)", "stringfish slice store", "stringfish slice store (materialized)" or "other alt-rep vector"
Value
The type of vector
Examples
x <- sf_vector(10)
get_string_type(x) # returns "stringfish vector"
x <- character(10)
get_string_type(x) # returns "normal vector"
materialize
Description
Materializes an alt-rep object
Usage
materialize(x)
Arguments
x |
An alt-rep object |
Details
Materializes any alt-rep object and then returns it. Note: the object is materialized regardless of whether the return value is assigned to a variable.
Value
x
Examples
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
x <- materialize(x)
random_strings
Description
A function that generates random strings
Usage
random_strings(N, string_size = 50L, charset = "abcdefghijklmnopqrstuvwxyz",
vector_mode = "stringfish")
Arguments
N |
The number of strings to generate |
string_size |
Either a single non-negative integer applied to every output string, or a non-negative integer vector of length 'N'. |
charset |
The characters used to generate the random strings (default: abcdefghijklmnopqrstuvwxyz) |
vector_mode |
The type of character vector to generate (either stringfish or normal, default: stringfish) |
Details
A convenience function for generating test strings.
Value
A character vector of the random strings
Examples
set.seed(1)
x <- random_strings(1e6, 80L, "ACGT", vector_mode = "stringfish")
y <- random_strings(4, c(1L, 2L, 4L, 8L), "ACGT")
sf_assign
Description
Assigns a new string to a stringfish vector or any other character vector
Usage
sf_assign(x, i, e)
Arguments
x |
the vector |
i |
the index to assign to |
e |
the new string to replace at i in x |
Details
A function to assign a new element to an existing character vector. If the the vector is a stringfish vector, it does so without materialization.
Value
No return value, the function assigns an element to an existing stringfish vector
Examples
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
sf_collapse
Description
Pastes a series of strings together separated by the 'collapse' parameter
Usage
sf_collapse(x, collapse)
Arguments
x |
A character vector |
collapse |
A single string |
Details
This works the same way as 'paste0(x, collapse=collapse)'
Value
A single string with all values in 'x' pasted together, separated by 'collapse'.
See Also
paste0, paste
Examples
x <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c")
Encoding(x) <- "UTF-8"
sf_collapse(x, " ") # "hello world" in Japanese
sf_collapse(letters, "") # returns the alphabet
sf_compare
Description
Returns a logical vector testing equality of strings from two string vectors
Usage
sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L))
sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector of length 1 or the same non-zero length as y |
y |
Another character vector of length 1 or the same non-zero length as y |
nthreads |
Number of threads to use |
Details
Note: the function tests semantic string equality. Non-byte text is normalized to a UTF-8 working representation, while 'CE_BYTES' strings are compared byte-for-byte.
Value
A logical vector
Examples
sf_compare(letters, "a")
sf_concat
Description
Appends vectors together
Usage
sf_concat(...)
sfc(...)
Arguments
... |
Any number of vectors, coerced to character vector if necessary |
Value
A concatenated stringfish vector
Examples
sf_concat(letters, 1:5)
sf_ends
Description
A function for detecting a pattern at the end of a string
Usage
sf_ends(subject, pattern, ...)
Arguments
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
Value
A logical vector true if there is a match, false if no match, NA is the subject was NA
See Also
endsWith, sf_starts
Examples
x <- c("alpha", "beta", "gamma", "delta", "epsilon")
sf_ends(x, "a")
sf_grepl
Description
A function that matches patterns and returns a logical vector
Usage
sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
The subject character vector to search |
pattern |
The pattern to search for |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Details
The function uses the PCRE2 library, which is also used internally by R. The encoding is based on the pattern string (or forced via the encode_mode parameter). Note: the order of paramters is switched compared to the 'grepl' base R function, with subject being first.
Value
A logical vector with the same length as subject
See Also
grepl
Examples
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
pattern <- "^hello"
sf_grepl(x, pattern)
sf_gsub
Description
A function that performs pattern substitution
Usage
sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
The subject character vector to search |
pattern |
The pattern to search for |
replacement |
The replacement string |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Details
The function uses the PCRE2 library, which is also used internally by R. However, syntax may be slightly different. E.g.: capture groups: "\1" in R, but "$1" in PCRE2 (as in Perl). The encoding of the output is determined by the pattern (or forced using encode_mode parameter) and encodings should be compatible. E.g: mixing ASCII and UTF-8 is okay, but not UTF-8 and latin1. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first.
Value
A stringfish vector of the replacement string
See Also
gsub
Examples
x <- "hello world"
pattern <- "^hello (.+)"
replacement <- "goodbye $1"
sf_gsub(x, pattern, replacement)
sf_iconv
Description
Converts encoding of one character vector to another
Usage
sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
An alt-rep object |
from |
the encoding to assume of 'x' |
nthreads |
Number of threads to use |
to |
the new encoding |
Details
This is an analogue to the base R function 'iconv'. It converts a string from one encoding (e.g. latin1 or UTF-8) to another
Value
the converted character vector as a stringfish vector
See Also
iconv
Examples
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
sf_iconv(x, "latin1", "UTF-8")
sf_match
Description
Returns a vector of the positions of x in table
Usage
sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector to search for in table |
table |
A character vector to be matched against x |
nthreads |
Number of threads to use |
Details
Note: similarly to the base R function, long "table" vectors are not supported. This is due to the maximum integer value that can be returned ('.Machine$integer.max')
Value
An integer vector of the indicies of each x element's position in table
See Also
match
Examples
sf_match("c", letters)
sf_nchar
Description
Counts the number of characters in a character vector
Usage
sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector |
type |
The type of counting to perform ("chars" or "bytes", default: "chars") |
nthreads |
Number of threads to use |
Details
Returns the number of characters per string. The type of counting only matters for UTF-8 strings, where a character can be represented by multiple bytes.
Value
An integer vector of the number of characters
See Also
nchar
Examples
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x <- sf_iconv(x, "latin1", "UTF-8")
sf_paste
Description
Pastes a series of strings together
Usage
sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))
Arguments
... |
Any number of character vector strings |
sep |
The seperating string between strings |
nthreads |
Number of threads to use |
Details
This works the same way as 'paste0(..., sep=sep)'
Value
A character vector where elements of the arguments are pasted together
See Also
paste0, paste
Examples
x <- letters
y <- LETTERS
sf_paste(x,y, sep = ":")
sf_readLines
Description
A function that reads a file line by line
Usage
sf_readLines(file, encoding = "UTF-8")
Arguments
file |
The file name |
encoding |
The encoding to use (Default: UTF-8) |
Details
A function for reading in text data using 'std::ifstream'.
Value
A stringfish vector of the lines in a file
See Also
readLines
Examples
file <- tempfile()
sf_writeLines(letters, file)
sf_readLines(file)
sf_split
Description
A function to split strings by a delimiter
Usage
sf_split(subject, split, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
A character vector |
split |
A delimiter to split the string by |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the split parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Value
A list of stringfish character vectors
See Also
strsplit
Examples
sf_split(datasets::state.name, "\\s") # split U.S. state names by any space character
sf_starts
Description
A function for detecting a pattern at the start of a string
Usage
sf_starts(subject, pattern, ...)
Arguments
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
Value
A logical vector true if there is a match, false if no match, NA is the subject was NA
See Also
startsWith, sf_ends
Examples
x <- c("alpha", "beta", "gamma", "delta", "epsilon")
sf_starts(x, "a")
sf_substr
Description
Extracts substrings from a character vector
Usage
sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector |
start |
The begining to extract from |
stop |
The end to extract from |
nthreads |
Number of threads to use |
Details
This works the same way as 'substr', but in addition allows negative indexing. Negative indicies count backwards from the end of the string, with -1 being the last character.
Value
A stringfish vector of substrings
See Also
substr
Examples
x <- c("fa\xE7ile", "hello world")
Encoding(x) <- "latin1"
x <- sf_iconv(x, "latin1", "UTF-8")
sf_substr(x, 4, -1) # extracts from the 4th character to the last
## [1] "ile" "lo world"
sf_tolower
Description
A function converting a string to all lowercase
Usage
sf_tolower(x)
Arguments
x |
A character vector |
Details
Note: the function only converts ASCII characters.
Value
A stringfish vector where all uppercase is converted to lowercase
See Also
tolower
Examples
x <- LETTERS
sf_tolower(x)
sf_toupper
Description
A function converting a string to all uppercase
Usage
sf_toupper(x)
Arguments
x |
A character vector |
Details
Note: the function only converts ASCII characters.
Value
A stringfish vector where all lowercase is converted to uppercase
See Also
toupper
Examples
x <- letters
sf_toupper(x)
sf_trim
Description
A function to remove leading/trailing whitespace
Usage
sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)
Arguments
subject |
A character vector |
which |
"both", "left", or "right" determines which white space is removed |
whitespace |
Whitespace characters (default: "[ \\t\\r\\n]") |
... |
Parameters passed to sf_gsub |
Value
A stringfish vector of trimmed whitespace
See Also
trimws
Examples
x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ")
sf_trim(x)
sf_vector
Description
Creates a new empty stringfish vector
Usage
sf_vector(len)
Arguments
len |
length of the new vector |
Details
This is a backwards-compatible alias for 'sf_vector_create(len)'. It creates a new stringfish vector, an alt-rep character vector backed by a C++ "std::vector" as the internal memory representation. The vector type is "sfstring", which is a simple C++ class containing a "std::string" and a single byte (uint8_t) representing the encoding.
Value
A new empty stringfish vector
Examples
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
sf_vector_create
Description
Creates a new empty stringfish vector
Usage
sf_vector_create(len)
Arguments
len |
length of the new vector |
Details
This function creates a new empty 'sf_vec'-backed stringfish vector. If you want to fill the vector from character data, use 'convert_to_sf_vector'.
Value
A new stringfish vector
Examples
x <- sf_vector_create(4)
sf_writeLines
Description
A function that writes text line by line
Usage
sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")
Arguments
text |
A character to write to file |
file |
Name of the file to write to |
sep |
The line separator character(s) |
na_value |
What to write in case of a NA string |
encode_mode |
"UTF-8" or "byte". If "UTF-8", text strings are normalized to UTF-8 while 'CE_BYTES' strings are written as raw bytes. |
Details
A function for writing text data using 'std::ofstream'.
See Also
writeLines
Examples
file <- tempfile()
sf_writeLines(letters, file)
sf_readLines(file)
slice_store_create
Description
Creates a new empty slice-store-backed stringfish vector
Usage
slice_store_create(len)
Arguments
len |
length of the new vector |
Details
This function creates a new stringfish vector backed by 'slice_store', which stores string bytes in append-only slices plus per-element records. If you want to fill the vector from character data, use 'convert_to_slice_store'.
Value
A new slice-store-backed stringfish vector
Examples
x <- slice_store_create(4)
slice_store_create_with_size
Description
Creates a new empty slice-store-backed stringfish vector with a fixed initial slice size
Usage
slice_store_create_with_size(len, initial_slice_size)
Arguments
len |
length of the new vector |
initial_slice_size |
Initial size of the first underlying 'slice_store' slice. |
Details
This function creates a new stringfish vector backed by 'slice_store', and uses 'initial_slice_size' for the first slice allocation instead of the default heuristic. If you want to fill the vector from character data, use 'convert_to_slice_store'.
Value
A new slice-store-backed stringfish vector
Examples
x <- slice_store_create_with_size(4, 256)
string_identical
Description
Compare strings semantically or exactly
Usage
string_identical(x, y, mode = c("semantic", "exact"))
Arguments
x |
A character vector |
y |
Another character vector to compare to x |
mode |
Either '"semantic"' to compare text after normalizing non-byte strings to UTF-8, or '"exact"' to additionally require matching encoding. Strings marked as '"bytes"' are always compared exactly. |
Value
TRUE if strings are identical under the selected comparison mode
See Also
identical
Examples
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
y <- iconv(x, "latin1", "UTF-8")
identical(x, y) # TRUE
string_identical(x, y) # TRUE
string_identical(x, y, mode = "exact") # FALSE