Crates.io | ungoliant |
lib.rs | ungoliant |
version | 2.0.0 |
source | src |
created_at | 2021-02-15 03:33:33.298064 |
updated_at | 2023-02-24 12:17:31.531827 |
description | The pipeline for the OSCAR corpus. |
homepage | https://github.com/oscar-project/ungoliant |
repository | https://github.com/oscar-project/ungoliant |
max_upload_size | |
id | 355317 |
size | 884,985 |
🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️
It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.
cargo
: cargo install ungoliant
git
: cargo install --git https://github.com/oscar-corpus/ungoliant
Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc
can be needed as the project uses fasttext-rs.
The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.
To enable it, install KenLM requirements:
apt install -y libboost-all-dev libeigen3-dev
and use cargo install ungoliant --feature kenlm
or cargo b --features kenlm
if you're building from source.
Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin
.
The usual way of generating corpora is:
wet.paths.gz
file from the last CommonCrawl dump and decompress it.download
command.pipeline
command (it may take some time).You can find more information on each command's --help
.
ungoliant 2
corpus generation tool.
USAGE:
ungoliant <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
download Download a CommonCrawl release
help Prints this message or the help of the given subcommand(s)
pipeline Run pipeline
rebuild Rebuild the corpus for a given language.
Ungoliant is not yet on docs.rs: use cargo doc --bins --open
to open the documentation.
Head on to OSCAR Documentation for more info about the project.