# `smlr` ![smlr](smlr.png) ## Install cargo install smlr ## Help This is the documentation for `smlr v0.1.3`. Calling `smlr --help` will show the available options: smlr 0.1.3 Wilfried Kopp Find similar lines in a file or stdin. USAGE: smlr [FLAGS] [OPTIONS] [FILE] FLAGS: --case-sensitive Weather the checks consider the case or not -c, --count Show the number of occurences --debug Debug mode, slower -f, --full Set the score to the max --help Prints help information -h, --headers Show the headers -b, --ignore-spaces Ignore spaces -i, --invert Invert the result -l, --line-numbers How many of the next lines are considered in the match. A larger value requires more processing. --no-index Prevent saving index to disk --strict = Max distance = 0 -V, --version Prints version information -v, --verbose Sets the level of verbosity OPTIONS: --algo Algo used to calculate the distance [default: Levenshtein] [possible values: Levenshtein, DamerauLevenshtein] -d, --distance The max distance [default: 3] --persistence-folder Location where the persistence files will be stored if the mode is custom [default: temp] --persistence-mode Location where the persistence files will be stored. In 'beside' mode, the current folder will be used. [default: temp] [possible values: temp, custom, beside] -s How many of the next lines are considered in the match. A larger value requires more processing. [default: 10] ARGS: Sets the input file to use. Alternatively, you may pipe . The rustdoc can be found [here](https://chevdor.gitlab.io/smlr/smlr). ## Overview `smlr` is a command line utility that helps you find **similar** entries in text data. It is doe not aim at replacing tools such as `uniq` or `awk`. The following data is available in the `samples` folder. Let’s consider a few examples: ## Performance A test run on my iMac (fusion drive, no SSD) shows 11MB processed under 30s. $ TARGET=/var/log/install.log; du -h $TARGET | head -n 1; time smlr -lcb $TARGET > /tmp/output.txt 11M /var/log/install.log Finished release [optimized] target(s) in 0.01s Running `target/release/smlr -lcb /var/log/install.log` real 0m19.603s user 0m18.310s sys 0m1.202s ### Example \#1 Say we want to find duplicates in the following data:
Table uniq vs smlr
input: samples/file03.txt uniq -i -c file03.txt smlr -cbf file03.txt
Pizza
Ice Cream
Waffle
PiZZa
Waffle
pizza
Pizza
Pizaz
Peanuts
PIZZA
Beef Jerky
Popcorn
pizza
pizza
   1 Pizza
   1 Ice Cream
   1 Waffle
   1 PiZZa
   1 Waffle
   1 pizza
   1 Pizza
   1 Peanuts
   1 PIZZA
   1 Beef Jerky
   1 Popcorn
   2 pizza
7       Pizza
1       Ice Cream
2       Waffle
1       Peanuts
1       Beef Jerky
1       Popcorn
Table uniq vs smlr To my surprise, `uniq` seems to fail at finding some of the case insensitive matches. The result from `uniq` is a bit suprising as I expected to find 4x pizza and not 2 here and 2 there. This issue can be easily solved using the `sort` command. In the following example, we process the input using `sort` to group all duplicates. We then use `sort` again to show the most duplicates first. $ cat file03.txt | sort | uniq -i -c | sort -r 3 pizza 3 PiZZa 2 Waffle 1 Popcorn 1 Peanuts 1 PIZZA 1 Ice Cream 1 Beef Jerky Still `uniq` has issue with duplicates with different case despite using `-i` but `sort` can help further. $ cat file03.txt | sort -f | uniq -i -c | sort -r 7 PIZZA 2 Waffle 1 Popcorn 1 Peanuts 1 Ice Cream 1 Beef Jerky ### Example \#2 If we introduce some variance on some fields in our input, `uniq` will fail on finding the duplicates. Here is the data we work on: 2020-01-01 Pizza 2020-01-01 Ice Cream 2020-01-02 Waffle 2020-01-03 PiZZa 2020-03-05 Waffle 2020-03-05 pizza 2020-03-06 pizza 2020-03-06 Peanuts 2020-03-06 PIZZA 2020-03-06 Beef Jerky 2020-03-07 Popcorn 2020-03-07 pizza 2020-03-08 pizza Here we are stuck with out previous `sort+uniq` method. None of the following really work as expected: - `cat file04.txt | sort -f | uniq -i -c -f 1` - `cat file04.txt | sort -f | uniq -i -c -s 10` The reason for the failure is that our `sort` trick no longer works. We call `awk` to the rescue: $ cat file04.txt | awk '{print $2}' | sort -f | uniq -c -i | sort -r 7 PIZZA 2 Waffle 1 Popcorn 1 Peanuts 1 Ice 1 Beef ### Example \#3 Here is a nasty example where `uniq`, even with helps of some other friendly commands, won’t be able to be helpful: 2020-01-01 Pizza 2020-01-01 Ice Cream 2020-01-01 Ice Cream 2020-01-02 Waffle 2020-01-03 PiZZa 2020-03-05 Waflle 2020-03-05 pizza 2020-03-06 piiza 2020-03-06 Peanuts 2020-03-06 PlZZA 2020-03-06 Beef Jerky 2020-03-07 POPCORN 2020-03-07 P0PCORN 2020-03-07 pizza 2020-03-08 pizza This example contains a few typos on purpose. Those are the typos typically hard to spot (depending on your font and concentration level!). Let see how `smlr` can handle that. $ time smlr -cbf file05.txt | sort -r 7 2020-01-01 Pizza 2 2020-03-07 POPCORN 2 2020-01-02 Waffle 2 2020-01-01 Ice Cream 1 2020-03-06 Peanuts 1 2020-03-06 Beef Jerky real 0m0.005s user 0m0.002s sys 0m0.004s While `smlr` can do more, it has a cost in CPU and memory. Beware when parsing huge files! ## Install from GIT cargo install --git https://gitlab.com/chevdor/smlr.git --tag v0.1.3 Installing the version from \`master\` is not recommended for production: do \*NOT\* omit the \`--tag v0.1.3\` in the previous command. ## License MIT License Copyright (c) 2019-2020 Wilfried Kopp - Chevdor Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.