:toc: right
:sectnums:
:sectanchors:

:cmd: smlr
:proj: pass:q[`smlr`]
:version: v0.1.3
= {proj}

image::smlr.png[align="center", width=20%]

== Install
[source,bash,subs="attributes+"]
    cargo install smlr

== Help
This is the documentation for `{cmd} {version}`.
Calling `{cmd} --help` will show the available options:
----
include::help.txt[]
----

The rustdoc can be found https://chevdor.gitlab.io/smlr/smlr[here].

== Overview

{proj} is a command line utility that helps you find *similar* entries in text data.
It is doe not aim at replacing tools such as `uniq` or `awk`.

The following data is available in the `samples` folder.
Let's consider a few examples:

== Performance

A test run on my iMac (fusion drive, no SSD) shows 11MB processed under 30s.

----
   $ TARGET=/var/log/install.log; du -h $TARGET | head -n 1; time smlr -lcb $TARGET > /tmp/output.txt
   11M    /var/log/install.log
   Finished release [optimized] target(s) in 0.01s
   Running `target/release/smlr -lcb /var/log/install.log`
   
   real    0m19.603s
   user    0m18.310s
   sys     0m1.202s
----

=== Example #1

Say we want to find duplicates in the following data:

.Table uniq vs smlr
:file: file03.txt
[cols="<.<,<.<,<.<", options="header"]
|===
| input: samples/{file} | `uniq -i -c {file}` | `smlr -cbf {file}` 

a|
----
include::samples/file03.txt[]
----
a|  
----
   1 Pizza
   1 Ice Cream
   1 Waffle
   1 PiZZa
   1 Waffle
   1 pizza
   1 Pizza
   1 Peanuts
   1 PIZZA
   1 Beef Jerky
   1 Popcorn
   2 pizza
----
a|
----
7       Pizza
1       Ice Cream
2       Waffle
1       Peanuts
1       Beef Jerky
1       Popcorn
----
|===

NOTE: To my surprise, `uniq` seems to fail at finding some of the case insensitive matches.

The result from `uniq` is a bit suprising as I expected to find 4x pizza and not 2 here and 2 there. This issue can be easily solved using the `sort` command.

In the following example, we process the input using `sort` to group all duplicates. We then use `sort` again to show the most duplicates first. 

[source,bash]
----
$ cat file03.txt | sort | uniq -i -c | sort -r
   3 pizza
   3 PiZZa
   2 Waffle
   1 Popcorn
   1 Peanuts
   1 PIZZA
   1 Ice Cream
   1 Beef Jerky
----


Still `uniq` has issue with duplicates with different case despite using `-i` but `sort` can help further.

----
$ cat file03.txt | sort -f | uniq -i -c | sort -r
   7 PIZZA
   2 Waffle
   1 Popcorn
   1 Peanuts
   1 Ice Cream
   1 Beef Jerky
----

=== Example #2

If we introduce some variance on some fields in our input, `uniq` will fail on finding the duplicates.

Here is the data we work on:
----
include::samples/file04.txt[]
----

Here we are stuck with out previous `sort+uniq` method. None of the following really work as expected:
- `cat file04.txt | sort -f | uniq -i -c -f 1`
- `cat file04.txt | sort -f | uniq -i -c -s 10`

The reason for the failure is that our `sort` trick no longer works. We call `awk` to the rescue:

----
$ cat file04.txt | awk '{print $2}' | sort -f | uniq -c -i | sort -r
   7 PIZZA
   2 Waffle
   1 Popcorn
   1 Peanuts
   1 Ice
   1 Beef
----


=== Example #3

Here is a nasty example where `uniq`, even with helps of some other friendly commands, won't be able to be helpful:

----
include::samples/file05.txt[]
----

NOTE: This example contains a few typos on purpose. Those are the typos typically hard to spot (depending on your font and concentration level!).

Let see how {proj} can handle that.

----
$ time smlr -cbf file05.txt | sort -r
7       2020-01-01  Pizza
2       2020-03-07  POPCORN
2       2020-01-02  Waffle
2       2020-01-01  Ice Cream
1       2020-03-06  Peanuts
1       2020-03-06  Beef Jerky

real    0m0.005s
user    0m0.002s
sys     0m0.004s
----

WARNING: While {proj} can do more, it has a cost in CPU and memory. Beware when parsing huge files!

== Install from GIT

[source,bash,subs="attributes+"]
    cargo install --git https://gitlab.com/chevdor/smlr.git --tag {version}

[subs="attributes"]
WARNING: Installing the version from `master` is not recommended for production: do *NOT* omit the `--tag {version}` in the previous command.

== License

----
include::LICENSE[]
----