Crates.io | bert_create_pretraining |
lib.rs | bert_create_pretraining |
version | 0.1.3 |
source | src |
created_at | 2023-02-16 11:37:06.408881 |
updated_at | 2023-02-16 16:25:59.87023 |
description | This crate is a Rust port of Google's BERT create pretraining data. |
homepage | |
repository | |
max_upload_size | |
id | 786685 |
size | 63,742 |
The crate provides the port of the original BERT create_pretraining_data.py script from the Google BERT repository.
$ cargo install bert_create_pretraining
You can use the bert_create_pretraining
binary to create the pretraining data for BERT in parallel. The binary takes the following arguments:
$ find "${DATA_DIR}" -name "*.txt" | xargs -I% -P $NUM_PROC -n 1 \
basename % | xargs -I% -P ${NUM_PROC} -n 1 \
"${TARGET_DIR}/bert_create_pretraining" \
--input-file="${DATA_DIR}/%" \
--output-file="${OUTPUT_DIR}/%.tfrecord" \
--vocab-file="${VOCAB_DIR}/vocab.txt" \
--max-seq-length=512 \
--max-predictions-per-seq=75 \
--masked-lm-prob=0.15 \
--random-seed=12345 \
--dupe-factor=5
You can check the full list of options with the following command:
$ bert_create_pretraining --help
MIT license. See LICENSE file for full license.