# Extended Thoughts on _Super Speedy Syslog Searcher_! (s4)<!-- omit in TOC -->

Pontifications and realizations from the developer's point of view
about _Super Speedy Syslog Searcher!_.

---

<!-- TOC generated by Markdown All In One -->
- [Background](#background)
  - [That is a long name!](#that-is-a-long-name)
  - [Learn Rust](#learn-rust)
- [The Problem](#the-problem)
  - [The Solution](#the-solution)
    - [Solution Difficulties](#solution-difficulties)
      - [_differing sysline formats, differing datetime formats_](#differing-sysline-formats-differing-datetime-formats)
      - [_Which is the correct datetime stamp?_](#which-is-the-correct-datetime-stamp)
      - [_awareness of multi-line syslines adds some layer of program "state"_](#awareness-of-multi-line-syslines-adds-some-layer-of-program-state)
      - [_What year? What timezone?_](#what-year-what-timezone)
      - [_Unicode decoding is a surprisingly expensive operation._](#unicode-decoding-is-a-surprisingly-expensive-operation)
      - [_Binary-search in a file that has varying character encodings._](#binary-search-in-a-file-that-has-varying-character-encodings)
      - [_What defines the boundaries of a sysline?_](#what-defines-the-boundaries-of-a-sysline)

---

## Background

Why _Super Speedy Syslog Searcher_?

### That is a long name!

I prefer unique, searchable, memorable, and self-explanatory names for all
programs.

To explain another way, try an internet search for "syslog finder", or
"syslog datetime sort", etc.
Many varying results; not a memorable phrase.

### Learn Rust

Great language! Great ecosystem!

Not perfect but a necessary generational improvement over competitors C and C++.

## The Problem

As a Test Engineer, I'm often investigating problems that affect complex
systems.
I'm usually only superficially familiar with the underlying subsystems
(Test Engineers are often system generalists, not specialists).
During an investigation, I may know _when_ a problem occurred but I have
little certainty about what system components might be involved, and so I
don't what system logs ("syslogs") are relevant to review.

For example, the frobulator failed around minute _2022/01/01 14:33_.
I know the frobulator emitted an error but my hunch is this error only first
_appeared_ in the frobulator but the source error is somewhere else. I know
many system components interact with the frobulator but I'm not certain of that
knowledge.
So I'd like to search _all_ other system components for the preceding five
minutes (i.e. search all other syslog files).

I have several-times written scripts that:

1. parse varying datetime formats in lines
2. sort lines by datetime
3. filter lines by some datetime period ("I only want syslogs during this
   minute")
4. display syslog lines ("syslines") by datetime order and prepended file name

Doing this in a scripting language like Python is do-able but tedious.
It's also very slow for large amounts of syslogs and/or compressed syslogs.
Additional complexities make scripting difficult or imperfect.

- Handling compressed or archived files adds scripting complexity and runtime.
- Storing all the syslines in memory may be impossible.
- Dealing with multi-line syslog lines ("syslines") means typical
  "line by line" processing is inadequate. e.g. this is one sysline

```text
  2022/01/01 14:33:12 Frobulator: failed to initialize the woozybucket
        woozybucket: I/O failure /dev/sdc4
```

I once attempted to create such a script for a multi-host data storage system.
The system would generate tens of Gigabytes of syslogs per day
(it was running in Debug mode).
Those syslogs were periodically offloaded onto a busy central NAS that
resided in a remote office, where they were further compressed
and archived (`.tar` files in `.tar` files).
I would investigate errors days after they occurred.
Searching a large amount of syslogs for "everything that happened between
_datetime A_ and _datetime A plus five minutes_" was incredibly slow
(tens of minutes to hours).

I observed that if the Searching Tool was datetime-aware, and did a
binary-search instead of a linear search, then searching could be very fast.
Additionally, if the tool would release memory for syslines that have been
processed then it could handle arbitrarily large syslog files.

### The Solution

Enter _Super Speedy Syslog Searcher_ (s4).

_s4_ searches syslogs by datetime in a binary-search manner. That is, given
datetime A, a syslog (e.g. a file with lines having datetime-stamps; "syslines")
a binary-search for the sysline at or nearest to datetime A is possible.
Only small blocks of the file need to be read to discover the sysline with
datetime A (or nearest to datetime A).

Surprisingly, nothing exists that can do this.

At the same time, I was interested in learning about the new cool thing,
the _rust_ programming language. _rust_ seemed ideal for this situation where
speed and efficiency was a defining feature of the solution.

#### Solution Difficulties

Why do no other tools exist for this common need?

When moving from implementing a script for a known system, to implementing a
general purpose tool, here are some difficulties I discovered during design and
implementation.

##### _differing sysline formats, differing datetime formats_

In practice there are many reasonably possible sysline formats
(datetime format, plus arrangement).
Handling them all can be done but adds complexity. Handling them all speedily
adds further complexity.

These are a few system log lines ("syslines") I've found:

```text
2022/01/01 12:34:56 There was an error
There was an error January 01 2022 12:34:56PM
[Error] There was an error Sat Jan 1 12:34:56 2022
```

##### _Which is the correct datetime stamp?_

e.g.

```text
Frobulator 2022/01/01 12:34:50 there was an error 2022/01/01 12:34:56
```

How do you tell your parser that the _second_ datetime stamp is the
_real_ datetime stamp of the sysline?

##### _awareness of multi-line syslines adds some layer of program "state"_

```text
2022/01/01 12:34:56 Frobulator: failed to initialize the woozybucket
      woozybucket: I/O failure /dev/sdc4
```

Line-by-line parsing won't suffice.

##### _What year? What timezone?_

It is common to have syslog datetime formats the do not include a year or a
timezone.

For example, your syslog searching script finds these three syslines from three
different syslogs,
what is the correct chronological order?

```text
Jan. 1 11:00:00 Error
Jan. 1 12:00:00 PST Error
Jan. 1 13:00:00 UTC Error
```

What year and timezone did the first sysline occur? (Answer: you don't know).
What year did the second and third sysline occur? (Answer: you don't know).

There are ways to get a "probably correct" answer. But how to do this very
quickly (i.e. without reading the entire file? without spinning up some
fancy Learning AI computer?)?
It's not a straightforward solution.

In a multi-host system that spans different timezones (which I've worked with)
then this problem moves from theory to reality.

##### _Unicode decoding is a surprisingly expensive operation._

See `benches/bench_decode_utf.rs`.

##### _Binary-search in a file that has varying character encodings._

What if the "byte cursor" of your binary search lands on the middle byte that
is part of three-byte long unicode character?

##### _What defines the boundaries of a sysline?_

A sysline is a line or lines that come after a sysline (or the beginning of a
file), and before a sysline (or the end of a file).

(I hope you are understanding the implied difficulty here).