drama_llama

Crates.io	drama_llama
lib.rs	drama_llama
version	0.5.2
source	src
created_at	2024-04-20 20:13:24.956307
updated_at	2024-06-19 20:30:36.65483
description	A library for language modeling and text generation.
homepage
repository	https://github.com/mdegans/drama_llama
max_upload_size
id	1214837
size	3,085,489

Michael de Gans (mdegans)

documentation

drama_llama is yet another Rust wrapper for llama.cpp. It is a work in progress and not intended for production use. The API will change.

For examples, see the bin folder. There are two example binaries.

LLaMA 3 Support.
Iterators yielding candidates, tokens and pieces.
Stop criteria at regex, token sequence, and/or string sequence.
Metal support. CUDA may be enabled with the cuda and cuda_f16 features.
Rust-native sampling code. All sampling methods from llama.cpp have been translated.
N-gram based repetition penalties with custom exclusions for n-grams that should not be penalized.
Support for N-gram blocking with a default, hardcoded blocklist.

Candidate iterator with fine-grained control over sampling
Examples for new Candidate API.
Support for chaining sampling methods using SampleOptions. mode will become modes and applied one after another until only a single Candidate token remains.
Common command line options for sampling. Currently this is not exposed.
API closer to Ollama. Potentially support for something like Modelfile.
Logging (non-blocking) and benchmark support.
Better chat and instruct model support.
Web server. Tokenization in the browser.
Tiktoken as the tokenizer for some models instead of llama.cpp's internal one.
Reworked, functional, public, candidate API
Grammar constraints (maybe or maybe not llama.cpp style)
Async streams, better parallelism with automatic batch scheduling
Better cache management. llama.cpp does not seem to manage a longest prefix cache automatically, so one will have to be written.
Backends other than llama.cpp (eg. MLC, TensorRT-LLM, Ollama)

With LLaMA 3, safe vocabulary is not working yet so --vocab unsafe must be passed as a command line argument or VocabKind::Unsafe used for an Engine constructor.
The model doesn't load until genration starts, so there can be a long pause on first generation. However because mmap is used, on subsequent process launches, the model should already be cached by the OS.
Documentation is broken on docs.rs because llama.cpp's CMakeLists.txt generates code, and writing to the filesystem is not supported. For the moment use cargo doc --open instead. Others have fixed this by patching llama.cpp in their bindings, but I'm not sure I want to do that for now.

Generative, AI, specifically Microsoft's Bing Copilot, GitHub Copilot, and Dall-E 3 were used for portions of this project. See inline comments for sections where generative AI was used. Completion was also used for getters, setters, and some tests. Logos were generated with Dall-E and post processed in Inkscape.

Commit count: 23