tok: Fast Text Tokenization

Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.

Version: 0.2.0
Depends: R (≥ 4.2.0)
Imports: R6, cli
Suggests: rmarkdown, testthat (≥ 3.0.0), hfhub (≥ 0.1.1), withr
Published: 2025-08-27
Author: Daniel Falbel [aut, cre], Regouby Christophe [ctb], Posit [cph]
tok author details
Maintainer: Daniel Falbel <daniel at posit.co>
BugReports: https://github.com/mlverse/tok/issues
License: MIT + file LICENSE
URL: https://github.com/mlverse/tok
NeedsCompilation: yes
SystemRequirements: Cargo (Rust's package manager), rustc >= 1.75
Materials: README, NEWS
CRAN checks: tok results

Documentation:

Reference manual: tok.html , tok.pdf

Downloads:

Package source: tok_0.2.0.tar.gz
Windows binaries: r-devel: not available, r-release: not available, r-oldrel: not available
macOS binaries: r-release (arm64): not available, r-oldrel (arm64): not available, r-release (x86_64): not available, r-oldrel (x86_64): not available
Old sources: tok archive

Linking:

Please use the canonical form https://CRAN.R-project.org/package=tok to link to this page.