Crates.io | rusty-dawg |
lib.rs | rusty-dawg |
version | 0.2.2 |
source | src |
created_at | 2024-10-01 21:48:19.728775 |
updated_at | 2024-10-03 03:37:27.041497 |
description | A Rust library for building and querying Directed Acyclic Word Graphs (DAWGs) and Compacted DAWGs (CDAWGs) for efficient string indexing and searching. |
homepage | |
repository | https://github.com/viking-sudo-rm/rusty-dawg |
max_upload_size | |
id | 1393442 |
size | 353,452 |
A library for building directed acyclic word graphs (DAWGs) or compacted DAWGS (CDAWGS) for string indexing and searching in Rust.
The key features are:
This library was started by Will Merrill and Yanai Elazar as part of an internship project at AI2. Ananya Jha, Rodney Kinney, David Wadden, Pete Walsh have all since contributed core engineering features. We've also appreciated the support of Michal Guerquin, Johann Dahm, and other members of the Beaker team at AI2 for getting the library to run at very large scale, as well as the data structures expertise of Shunsuke Inenaga.
Simply use the one-liner here.
To run tests, you can call Cargo (which should have been installed with Rust) from inside the repo directory:
cargo test
To compile an optimized release build, you can run:
cargo build --release
Note that the --release
flag is very important for performance. The code will be 10-100x slower without it.
To run the benchmarking script, you need the Wikitext2/103 data. You can either download this to rusty-dawg/data path or point to an existing repository (easy on beaker, you can use my copy of the data).
You first need to download the data directory, unzip it, and put it in the root of the repository directory (i.e., rusty-dawg/data). Then you can run:
./scripts/benchmark.sh wikitext-2-raw
If the data is stored somewhere else, you can do:
DATA=/home/willm/splits ./scripts/benchmark.sh wikitext-2-raw
The core functionality of Rusty-DAWG is to build DAWGs and CDAWGs, which are indexing structures for large corpora. The CDAWG is a strict improvement of the DAWG, so we recommend using the CDAWG if you are building a new index from scratch.
To get started building a CDAWG on your corpus, we recommend adapting the scripts/cdawg/run_pile.sh script. This script was written to build a CDAWG (memory-efficient improvement of DAWG) on the Pile. It could be run as follows:
scripts/cdawg/run_pile.sh /home/willm/data/pile/00_0.json.gz /home/willm/cdawgs/00_0
Here the first argument is the path where the input corpus can be found.
By default, this should be a path to an input file in .jsonl.gz
format, where each line looks like:
{"text": "this is a document", "meta": {"data": "here"}}
The meta
key must be present but "meta": {}
can be specified if no metadata exists. If you wish to pass input data in a different format, you can change the --data-reader
flag to a different option.
The second argument is a path at which output CDAWG will get created (as well as a disk vector storing a copy of the training tokens and a log of CDAWG stats during building).
Other arguments in the scripts/cdawg/run_pile.sh script:
N_TOKENS
, NODES_RATIO
, and EDGES_RATIO
: These are used to allocate memory for the CDAWG. N_TOKENS
should be an upper bound on the number of tokens in the dataset. NODES_RATIO
and EDGES_RATIO
should be upper bounds on the # of nodes and # of edges per input token. For the DAWG, these have an upper bound of 2 and 3, and for the CDAWG, they will typically be (well) below 1 and 2. You can estimate these values for a large dataset by simply building on a smaller chunk of the data first and extrapolating.gpt2
tokenizer. You might consider using a different tokenizer, since gpt2
treats whitespace somewhat poorly.The library is implemented in Rust, but DAWGs, once built, can be loaded and used easily in Python! You can even build DAWGs from scratch using the Python bindings, though we don't necessarily recommend that.
The Python bindings are generated using maturin. First install maturin in your Python environment:
pip install maturin
Then, you should be able to build (or rebuild) the Python bindings with:
source scripts/rebuild_bindings.sh
(If above doesn't work) Alternatively, cd
into the Python bindings directory (bindings/python
) and run:
make install
(If above still doesn't work) You can build the bindings in two steps:
python -m maturin build --release
pip install target/wheels/*.whl
After installing the bindings, you should be able to import the library:
from rusty_dawg import Cdawg, DiskCdawg
Refer to scripts/cdawg/test_cdawg_matches_dawg.py for an example of how to build and use a CDAWG in RAM with the Python bindings. To use a disk CDAWG instead, you can use DiskCdawg
instead of Cdawg
. scripts/cdawg/test_load_cdawg.py shows an example of how to load a pre-built DiskCdawg
.
@misc{merrill2024evaluatingngramnoveltylanguage,
title={Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG},
author={William Merrill and Noah A. Smith and Yanai Elazar},
year={2024},
eprint={2406.13069},
archivePrefix={arXiv},
primaryClass={cs.CL}
url={https://arxiv.org/abs/2406.13069},
}