tantivy_warc_indexer

Crates.iotantivy_warc_indexer
lib.rstantivy_warc_indexer
version0.3.0
created_at2021-08-19 18:03:43.83464+00
updated_at2024-12-22 00:25:44.88342+00
descriptionBuilds a tantivy index from common crawl warc.wet files
homepage
repositoryhttps://github.com/ahcm/tantivy_warc_indexer
max_upload_size
id439713
size77,066
Andreas Hauser (ahcm)

documentation

README

tantivy_warc_indexer

tantivy_warc_indexer builds a tantivy index from common crawl warc.wet files and pubmed entrez articles.

Build

Install rust (e.g. via rustup).

make

Usage

./target/release/tantivy_warc_indexer --help
WARC Indexer

Usage:
  warc_parser [-t <threads>] [--from <from>] [--to <to>] -s <format> <index> <warc_dir>
  warc_parser (-h | --help)

Options:
  -h --help      Show this help
  -s <source>    type of source files (WARC or ENTREZ or WIKIPEDIA_ABSTRACT)
  -t <threads>   number of threads to use, default 4
  --from <from>  skip files until from
  --to <to>      skip files after to``

Run

Where is the directory of an empty index you created e.g. tantivy-cli and <warc_dir> the path to the directory with the common crawl warc.wet or warc.wet.gz files. Depending on your system this might take a few days or weeks.

./target/release/tantivy_warc_indexer -s WARC ../common_crawl_tantivy_index ../wet

To create an index:

mkdir ../common_crawl_tantivy_index
cp template/meta.json ../common_crawl_tantivy_index/

Best Andreas

Commit count: 29

cargo fmt