:toc: right :sectnums: :sectanchors: :cmd: smlr :proj: pass:q[`smlr`] :version: v0.1.3 = {proj} image::smlr.png[align="center", width=20%] == Install [source,bash,subs="attributes+"] cargo install smlr == Help This is the documentation for `{cmd} {version}`. Calling `{cmd} --help` will show the available options: ---- include::help.txt[] ---- The rustdoc can be found https://chevdor.gitlab.io/smlr/smlr[here]. == Overview {proj} is a command line utility that helps you find *similar* entries in text data. It is doe not aim at replacing tools such as `uniq` or `awk`. The following data is available in the `samples` folder. Let's consider a few examples: == Performance A test run on my iMac (fusion drive, no SSD) shows 11MB processed under 30s. ---- $ TARGET=/var/log/install.log; du -h $TARGET | head -n 1; time smlr -lcb $TARGET > /tmp/output.txt 11M /var/log/install.log Finished release [optimized] target(s) in 0.01s Running `target/release/smlr -lcb /var/log/install.log` real 0m19.603s user 0m18.310s sys 0m1.202s ---- === Example #1 Say we want to find duplicates in the following data: .Table uniq vs smlr :file: file03.txt [cols="<.<,<.<,<.<", options="header"] |=== | input: samples/{file} | `uniq -i -c {file}` | `smlr -cbf {file}` a| ---- include::samples/file03.txt[] ---- a| ---- 1 Pizza 1 Ice Cream 1 Waffle 1 PiZZa 1 Waffle 1 pizza 1 Pizza 1 Peanuts 1 PIZZA 1 Beef Jerky 1 Popcorn 2 pizza ---- a| ---- 7 Pizza 1 Ice Cream 2 Waffle 1 Peanuts 1 Beef Jerky 1 Popcorn ---- |=== NOTE: To my surprise, `uniq` seems to fail at finding some of the case insensitive matches. The result from `uniq` is a bit suprising as I expected to find 4x pizza and not 2 here and 2 there. This issue can be easily solved using the `sort` command. In the following example, we process the input using `sort` to group all duplicates. We then use `sort` again to show the most duplicates first. [source,bash] ---- $ cat file03.txt | sort | uniq -i -c | sort -r 3 pizza 3 PiZZa 2 Waffle 1 Popcorn 1 Peanuts 1 PIZZA 1 Ice Cream 1 Beef Jerky ---- Still `uniq` has issue with duplicates with different case despite using `-i` but `sort` can help further. ---- $ cat file03.txt | sort -f | uniq -i -c | sort -r 7 PIZZA 2 Waffle 1 Popcorn 1 Peanuts 1 Ice Cream 1 Beef Jerky ---- === Example #2 If we introduce some variance on some fields in our input, `uniq` will fail on finding the duplicates. Here is the data we work on: ---- include::samples/file04.txt[] ---- Here we are stuck with out previous `sort+uniq` method. None of the following really work as expected: - `cat file04.txt | sort -f | uniq -i -c -f 1` - `cat file04.txt | sort -f | uniq -i -c -s 10` The reason for the failure is that our `sort` trick no longer works. We call `awk` to the rescue: ---- $ cat file04.txt | awk '{print $2}' | sort -f | uniq -c -i | sort -r 7 PIZZA 2 Waffle 1 Popcorn 1 Peanuts 1 Ice 1 Beef ---- === Example #3 Here is a nasty example where `uniq`, even with helps of some other friendly commands, won't be able to be helpful: ---- include::samples/file05.txt[] ---- NOTE: This example contains a few typos on purpose. Those are the typos typically hard to spot (depending on your font and concentration level!). Let see how {proj} can handle that. ---- $ time smlr -cbf file05.txt | sort -r 7 2020-01-01 Pizza 2 2020-03-07 POPCORN 2 2020-01-02 Waffle 2 2020-01-01 Ice Cream 1 2020-03-06 Peanuts 1 2020-03-06 Beef Jerky real 0m0.005s user 0m0.002s sys 0m0.004s ---- WARNING: While {proj} can do more, it has a cost in CPU and memory. Beware when parsing huge files! == Install from GIT [source,bash,subs="attributes+"] cargo install --git https://gitlab.com/chevdor/smlr.git --tag {version} [subs="attributes"] WARNING: Installing the version from `master` is not recommended for production: do *NOT* omit the `--tag {version}` in the previous command. == License ---- include::LICENSE[] ----