Crates.io | tantivy_warc_indexer |
lib.rs | tantivy_warc_indexer |
version | 0.2.0 |
source | src |
created_at | 2021-08-19 18:03:43.83464 |
updated_at | 2021-08-19 18:03:43.83464 |
description | Builds a tantivy index from common crawl warc.wet files |
homepage | |
repository | https://github.com/ahcm/tantivy_warc_indexer |
max_upload_size | |
id | 439713 |
size | 43,015 |
tantivy_warc_indexer builds a tantivy index from common crawl warc.wet files
Install rust (e.g. via rustup).
make
./target/release/tantivy_warc_indexer --help
WARC Indexer
Usage:
warc_parserĀ [-t <threads>] [--from <from>] [--to <to>] <index> <warc_dir>
warc_parserĀ (-h | --help)
Options:
-h --help Show this help
-t <threads> number of threads to use, default 4
--from <from> skip files until from
--to <to> skip files after to
Where
./target/release/tantivy_warc_indexer ../common_crawl_tantivy_index ../wet
To create an index:
mkdir ../common_crawl_tantivy_index
cp template/meta.json ../common_crawl_tantivy_index/
Best Andreas