Crates.io | sesdiff |
lib.rs | sesdiff |
version | 0.3.1 |
source | src |
created_at | 2020-08-14 09:49:45.828937 |
updated_at | 2024-10-14 13:38:02.988475 |
description | Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein). |
homepage | https://github.com/proycon/sesdiff/ |
repository | https://github.com/proycon/sesdiff |
max_upload_size | |
id | 276524 |
size | 92,697 |
This is a small and fast command line tool and Rust library that reads a two-column tab separated input from standard input and computes the shortest edit script (Myers' diff algorithm) to go from the string in column A to the string in column B. It also computes the edit distance (aka levenshtein distance).
There is also a python binding available if you want to use sesdiff from Python. The documentation here covers the command-line version.
It was written to build lemmatisers.
Install it using Rust's package manager:
cargo install sesdiff
No cargo/rust on your system yet? Do sudo apt install cargo
on Debian/ubuntu based systems, brew install rust
on mac, or use rustup.
This tool builds upon Dissimilar that provides the actual diff algorithm (will be downloaded and compiled in automatically).
$ sesdiff < input.tsv
Example input and output (reformatted for legibility, the first two columns correspond to the input). Output is in a four-column tab separated format:
hablaron hablar =[hablar]-[on] 2
contaron contar =[contar]-[on] 2
pidieron pedir =[p]-[i]+[e]=[di]-[eron]+[r] 6
говорим говорить =[говори]-[м]+[ть] 3
By default the full edit script will be provided in a simple language:
=[]
- The text between brackets is identical in strings A and B
=[#n]
- If you use the --abstract
parameter, this will be used instead, where n
represents a number
indicating the length of text between that is identical in strings A and B-[]
- The text between brackets is removed to get to string B+[]
- The text between brackets is added to get to string BFor lemmatisation purposes, it makes sense for many languages to look at
suffixes (from right to left) and strip common prefixes. Pass the --suffix
option for that behaviour and output is now:
$ sesdiff --suffix < input.tsv
hablaron hablar -[on] 2
contaron contar -[on] 2
pidieron pedir -[eron]+[r]=[di]-[i]+[e] 6
говорим говорить -[м]+[ть] 3
Note that the edit scripts in suffix mode are formulated differently than in normal mode (they start from the right
too). There is also a --prefix
option that strips common suffixes.
Use the --abstract
parameter to get a slightly more abstract edit script that refers to the length of unchanged parts
rather than their contents. You would then get:
pidieron pedir -[eron]+[r]=[#2]-[i]+[e] 6
Sesdiff can also apply edit scripts to our input, use the --apply
flag and feed the tool tab separated input with
a string in the first column and an edit script in the second, as in the the following example input2.tsv
:
$ cat input2.tsv
pidieron -[eron]+[r]=[di]-[i]+[e]
Run sesdiff as follows and a third column will be added with the solution:
$ sesdiff --suffix --apply < input2.tsv
pidieron -[eron]+[r]=[di]-[i]+[e] pedir
When using --apply
, you can also make use of an extra --infix
parameter to indicate that an edit script must be
attempted to be matched with any infix in the string, including multiple. Consider the following example that replaces
all letters a with o:
$ cat input3.tsv
hahaha -[a]+[o]
$ sesdiff --infix --apply < input3.tsv
hahaha -[a]+[o] hohoho
In --apply
mode, you can also make edit scripts applicable to multiple patterns by using the |
operator. This is
only allowed for deletions (-[]
) and equality checks (=[]
):
$ cat input4.tsv
hihaho -[a|i|o]+[e]
$ sesdiff --infix --apply < input4.tsv
hihaho -[a|i|o]+[e] hehehe
GNU General Public Licence v3