# linux-package-analyzer

`linux-package-analyzer` is a binary Rust crate providing the `lpa` command-line
executable. This CLI tool facilitates indexing and then inspecting the contents of
Linux package repositories. Both Debian and RPM based repositories are supported.

Run `lpa help` for more details.

# Installing

```
# From the latest released version on crates.io:
$ cargo install linux-package-analyzer

# From the latest commit in the canonical Git repository:
$ cargo install --git https://github.com/indygreg/linux-packaging-rs linux-package-analyzer

# From the root directory of a Git source checkout:
$ cargo install --path linux-package-analyzer
```

# How It Works

`lpa` exposes sub-commands for importing the contents of a specified package
repository into a local SQLite database. Essentially, the package lists from
the remote repository are retrieved and referenced packages are downloaded
and their content indexed. The indexed content includes:

* Files installed by the package
* ELF file content
 * File header values
 * Section metadata
 * Dynamic library dependencies
 * Symbols
 * x86 instruction counts

Additional sub-commands exist for performing analysis of the indexed content
within the SQLite databases. However, there is a lot of data in the SQLite
database that is not exposed or queryable via the CLI.

# Example

The following command will import all packages from Ubuntu 21.10 Impish for
amd64 into the SQLite database `ubuntu-impish.db`:

```
lpa --db ubuntu-impish.db \
    import-debian-repository \
    --components main,multiverse,restricted,universe \
    --architectures amd64 \
    http://us.archive.ubuntu.com/ubuntu impish
```

This should download ~96 GB of packages (as of January 2022) and create a
~12 GB SQLite database.

Once we have a populated database, we can run commands to query its content.

To see which files import (and presumably call) a specific C function:

```
lpa --db ubuntu-impish.db \
    elf-files-importing-symbol OPENSSL_init_ssl
```

To see what are the most popular ELF section names:

```
lpa --db ubuntu-impish.db elf-section-name-counts
```

Power users may want to write their own queries against the database. To
get started, open the SQLite database and poke around:

```
$ sqlite3 ubuntu-impish.db
SQLite version 3.35.5 2021-04-19 18:32:05
Enter ".help" for usage hints.

sqlite> .tables
elf_file                          package_file
elf_file_needed_library           symbol_name
elf_file_x86_base_register_count  v_elf_needed_library
elf_file_x86_instruction_count    v_elf_symbol
elf_file_x86_register_count       v_package_elf_file
elf_section                       v_package_file
elf_symbol                        v_package_instruction_count
package

sqlite> select * from v_elf_needed_library where library_name = "libc.so.6" order by package_name asc limit 1;
0ad|0.0.25b-1|http://us.archive.ubuntu.com/ubuntu/pool/universe/0/0ad/0ad_0.0.25b-1_amd64.deb|usr/games/pyrogenesis|libc.so.6
```

The `v_` prefixed tables are views and conveniently pull in data from
multiple tables. For example, `v_elf_symbol` has all the columns of
`elf_symbol` but also expands the package name, version, file path, etc.

# Constants and Special Values

Various ELF data uses constants to define attributes. e.g. `elf_file.machine`
is an integer holding the ELF machine type. A good reference for values of
these constants is
https://docs.rs/object/0.28.2/src/object/elf.rs.html#1-6256.

`lpa` also exposes various `reference-*` commands for printing known
values.

# Known Issues

## x86 Disassembly Quirks

On package index/import, an attempt is made to disassemble x86 / x86-64 files so
instruction counts and register usage can be stored in the database.

We disassemble all sections marked as executable. Instructions in other
sections may not be found (this is hopefully rare).

We disassemble using the [iced_x86](https://crates.io/crates/iced-x86) Rust crate.
So any limitations in that crate apply to the disassembler.

We disassemble instructions by iterating over content of the binary section,
attempting to read instructions until end of section. Executable sections can
contain NULL bytes, inline data, and other bytes that may not represent valid
instructions. This will result in many byte sequences decoding to the special
*invalid* instruction. In some cases, a byte sequence may decode to an
instruction even though the underlying data is not an instruction. i.e. there
can be false positives on instruction counts.

## Intermittent HTTP Failures on Package Retrieval

Intermittent HTTP GET failures when importing packages is expected due to
intrinsic network unreliability. This often manifests as an error like the
following:

```
error processing package (ignoring): repository I/O error on path pool/universe/g/gcc-10/gnat-10_10.3.0-11ubuntu1_amd64.deb: Custom { kind: Other, error: "error sending HTTP request: reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Domain(\"us.archive.ubuntu.com\")), port: None, path: \"/ubuntu/pool/universe/g/gcc-10/gnat-10_10.3.0-11ubuntu1_amd64.deb\", query: None, fragment: None }, source: hyper::Error(IncompleteMessage) }" }
```

If you see failures like this, simply retry the import operation. Already
imported packages should automatically be skipped.

## Package Server Throttling

`lpa` can issue parallel HTTP requests to retrieve content. By default, it
issues up to as many parallel requests as CPU cores/threads.

Some package repositories limit the number of simultaneous HTTP
connections/requests by client. If your machine has many CPU cores, you may run
into these limits and get a high volume of HTTP errors when fetching packages.
To mitigate, reduce the number of simultaneous I/O operations via `--threads`.
e.g. `lpa --threads 4 ...`

## SQLite Integrity Weakening

To maximize speed of import operations, SQLite databases have their content
integrity and durability guarantees weakened via `PRAGMA` statements issued
on database open. A process or machine crash during a write operation could
corrupt the SQLite database more easily than it otherwise would.