| Crates.io | tokenizations |
| lib.rs | tokenizations |
| version | 0.4.2 |
| created_at | 2020-01-02 13:54:23.190952+00 |
| updated_at | 2021-04-01 15:26:20.969103+00 |
| description | Tokenizations alignments library |
| homepage | https://github.com/tamuhey/tokenizations |
| repository | https://github.com/tamuhey/tokenizations |
| max_upload_size | |
| id | 194488 |
| size | 50,588 |

Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly
$ pip install -U pip # update pip
$ pip install pytokenizations
This library uses maturin to build the wheel.
$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build
Now the wheel is created in python/target/wheels directory, and you can install it with pip install *whl.
get_alignmentsdef get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...
Returns alignment mappings for two different tokenizations:
>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]
a2b[i] is a list representing the alignment from tokens_a to tokens_b.
See here: docs.rs