Crates.io | powierza-coefficient |
lib.rs | powierza-coefficient |
version | 1.0.2 |
source | src |
created_at | 2021-10-14 20:23:52.459458 |
updated_at | 2022-12-29 13:21:31.151447 |
description | Powierża coefficient is a statistic for gauging if one string is an abbreviation of another |
homepage | |
repository | https://github.com/micouy/powierza-coefficient |
max_upload_size | |
id | 465090 |
size | 35,891 |
Powierża coefficient is a statistic on strings for gauging whether a string is an "abbreviation" of another. The function is not symmetric so it is not a metric.
T
(text) be a non-empty string.P
(pattern) be a non-empty subsequence of T
.p
be a partition of P
and p_i
be its elements, where:
p_i
is equal to some substring of T
, t_i
.t_i
do not overlap.t_i
are in the same order as p_i
.Powierża coefficient is the number of elements of the shortest partition p
, less one. Alternatively, it is the number of gaps between the substrings t_i
.
Used terms:
xz
is a subsequence of xyz
but it is not its substring.Take all characters from the pattern and, while perserving the original order, align them with the same characters in the text so that there are as few groups of characters as possible. The coefficient is the number of gaps between these groups.
P |
T |
p |
Powierża coefficient |
---|---|---|---|
powcoeff |
powierża coefficient |
pow , coeff |
1 |
abc |
a_b_c |
a , b , c |
2 |
abc |
abc |
abc |
0 |
abc |
xyz |
— | not defined |
For more examples, see tests.
The Powierża coefficient is used in kn
and in nushell
to determine which of the directories' names better match the abbreviation. Many other string coefficients and metrics were found unsuitable, including Levenshtein distance. Levenshtein distance is biased in favour of short strings. For example, the Levenshtein distance from gra
to programming
is greater than to gorgia
, even though it does not "resemble" the abbreviation. Powierża coefficient for these pairs of strings is 0 and 2, so programming
would be chosen (correctly).
The algorithm was inspired by Wagner–Fischer algorithm . It is also very similar to a solution to the Longest Common Subsequence Problem. All of these algorithms are based on a matrix. Whereas in Wagner-Fischer algorithm (WF) there are 3 types of moves (horizontal, diagonal and vertical) in my algorithm there are only two — horizontal and diagonal. The main idea is that the 'cost' of a gap is always 1, no matter how long. (In WF the cost of a gap is it's length.)
That means the algorithm must differentiate between cells that were filled in horizontal moves and the ones that were filled in diagonal moves. The first type of cells are cells containing Gap(score)
; the second type — Continuation(score)
. A horizontal move results in Gap(score)
if the original cell contains Gap(score)
and in Gap(score + 1)
if the original cell contains Continuation(score)
. The algorithm prefers moves that result in lower score and a diagonal move over horizontal move if they result in the same score.
Create a matrix m
rows by n
cols where m
is the length of S
and n
is the length of P
. n
must be less or equal to m
. Each cell can either be empty (that's the initial state) or contain either Gap(score)
or Continuation(score)
.
Begin filling the matrix from left to right and from top to bottom. The first row is special — xth
, yth
cell is set to Continuation(0)
if the xth
element of S
and the yth
element of P
are equal. Otherwise, is set to Gap(score + cost)
where score
is the score of its left neighbor. If its left neighbor is empty, the cell is left empty as well.
Other cells are filled according to these rules:
Let x
be a
's upper-left neighbor and y
be its left neighbor:
x _
y a
The cost of a diagonal move is 0 but such move is only possible if the xth
element of S
and the yth
element of P
are equal and if x
isn't empty. After the move a
is set to Continuation(score)
where score
is x
's score.
The cost of a horizontal move is 0 if y
contains Gap
and 1 if y
contains Continuation
. Such move is only possible if y
isn't empty. After the move a
is set to Gap(score + cost)
where score
is y
's score.
a
empty.Powierża coefficient is the least value in the last row. In some cases there are no values in the last row and the coefficient is not defined.
Cells with G's were filled in horizontal moves and those with C's were filled in diagonal moves. The numbers next to the letters are cells' scores. Red cells were skipped because of an optimization. Yellow cells were left empty. The coefficient is 2.
The algorithm was compared with strsim's levenshtein
in a benchmark run on the author's computer:
[1.2908 µs 1.2946 µs 1.2987 µs]
[1.7718 µs 1.7748 µs 1.7778 µs]