token-counter

Crates.iotoken-counter
lib.rstoken-counter
version0.1.0
sourcesrc
created_at2024-07-04 21:35:24.178371
updated_at2024-07-04 21:35:24.178371
description`wc` for tokens: count tokens in files with HF Tokenizers
homepage
repositoryhttps://github.com/EndlessReform/token-counter
max_upload_size
id1292122
size41,209
Jacob Keisling (EndlessReform)

documentation

README

tc - Token Count

tc is a CLI tool for counting tokens in text files, as a lightweight wrapper around the HuggingFace Tokenizers crate. It's like the Unix wc command, but for tokens instead of words.

Features

  • Count tokens in files or from stdin
  • Support for multiple files and glob patterns
  • Uses any tokenizer in HuggingFace Tokenizers

Installation

cargo install token-counter

Usage

Using default tokenizer (cl100k, the tokenizer for GPT-3.5 and GPT-4):

tc file1.md file2.md

Using globs:

tc *.md

Arguments:

  • -m, --model: HuggingFace ID of the model for tokenizer (ex. google-bert/bert-base-uncased)
Commit count: 4

cargo fmt