tok: Fast Text Tokenization

Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.

Version: 0.1.4
Depends: R (≥ 4.2.0)
Imports: R6, cli
Suggests: rmarkdown, testthat (≥ 3.0.0), hfhub (≥ 0.1.1), withr
Published: 2024-09-04
DOI: 10.32614/CRAN.package.tok
Author: Daniel Falbel [aut, cre], Posit [cph]
tok author details
Maintainer: Daniel Falbel <daniel at posit.co>
BugReports: https://github.com/mlverse/tok/issues
License: MIT + file LICENSE
URL: https://github.com/mlverse/tok
NeedsCompilation: yes
SystemRequirements: Rust tool chain w/ cargo, libclang/llvm-config
Materials: README NEWS
CRAN checks: tok results [issues need fixing before 2024-10-11]

Documentation:

Reference manual: tok.pdf

Downloads:

Package source: tok_0.1.4.tar.gz
Windows binaries: r-devel: tok_0.1.4.zip, r-release: tok_0.1.4.zip, r-oldrel: tok_0.1.4.zip
macOS binaries: r-release (arm64): tok_0.1.4.tgz, r-oldrel (arm64): tok_0.1.4.tgz, r-release (x86_64): tok_0.1.4.tgz, r-oldrel (x86_64): tok_0.1.4.tgz
Old sources: tok archive

Reverse dependencies:

Reverse imports: sacRebleu

Linking:

Please use the canonical form https://CRAN.R-project.org/package=tok to link to this page.