CSV Fish
========

CSV categorical data analyzer. Generates 2x2 contingency tables
according to specified row/column group conditions, and applies
Fisher's exact test.


Installation
------------

If you don't want to build the program yourself (see instructions
below for that), you can get it via the Cargo tool. First [install
Cargo](https://crates.io/) and then run

    cargo install csv_fish


Usage
-----

To use the program, point to the data, group specifications, and
desired output location:

    csv-fish --data data.csv --groups groups.csv --output output.csv

See the
[example files](https://gitlab.com/jistr/csv-fish/tree/main/example)
for inspiration about the content of the files, and read the below
input/output specification before using the program.

The Groups CSV file contains SQL queries executed in SQLite. Do not
run the program on input files you cannot trust.

### Inputs

* All CSV inputs are expected to be separated by semicolons.

* Data CSV - values of categorical variables for individual samples.

  * First row is a header - names of categorical variables in the data
    set. Each subsequent row represents one sample.

  * Each column represents a categorical variable.

  * Each cell is a value of a given categorical variable for a given
    sample.

  * Before running any queries on data, all cell values are trimmed of
    leading and trailing whitespace, to prevent accidental
    mismatches. (Don't use leading and trailing whitespace as intended
    differentiators in data.)

  * (More than two possible values per a categorical variable are
    supported.)

* Groups CSV - SQL for selection of the groups being tested.

  * If you're not familiar with SQL, you can get a quick overview of
    how the
    [where conditions](https://www.w3schools.com/sql/sql_where.asp)
    work, and the
    [logical operators](https://www.w3schools.com/sql/sql_and_or.asp)
    you can use in them.

  * First row is a header, subsequent rows represent groups
    (conditions for selecting groups).

  * The conditions are either for selecting rows of the contingency
    table, or for selecting columns, or for selecting both rows and
    columns at the same time.

    * When using row-only and column-only conditions, the program then
      combines each of the row-only conditions with each of the
      column-only conditions (Cartesian product) to generate
      row-and-column conditions, which are in turn used for generating
      contingency tables.

    * When using row-and-column conditions, each such row in Groups
      CSV file corresponds to one contingency table.

    * It is possible to combine row-only and column-only conditions
      with row-and-column conditions in a single Groups CSV file.

    * **See the [example files](https://gitlab.com/jistr/csv-fish/tree/main/example),
      which might be better than a thousand words of description.**

  * The header of Groups CSV must be:

        condtype;filter;r1cond;r2cond;c1cond;c2cond

  * `condtype` - *required* - the values can be `row`, `col`, or
    `rowcol`. The `row` and `col` types are for specifying conditions
    for rows and columns separately, the `rowcol` type specifies them
    together.

  * `filter` - *optional* - initial SQL WHERE condition applied to all
    data before conducting the fisher exact test. Can be used to
    shrink the sample set before performing any further operations.

    * **If your data contains samples where some categorical variables
      are unknown (empty cell), you probably want to add filter
      conditions so that queries working with that variable don't use
      those samples.** E.g. `my_var != ''`. See the
      [groups.csv example](https://gitlab.com/jistr/csv-fish/blob/main/example/groups.csv).
      As `filter` is SQL, you can use the operators (like `AND`) in
      the condition if you need to reference multiple colums.

  * `r1cond` - *required when `condtype` is `row` or `rowcol`* - SQL
    WHERE condition for selecting row 1.

  * `r2cond` - *optional* - SQL WHERE condition for selecting
    row 2. If empty, the complement of `r1cond` is used (the result is
    still limited by `filter`).

  * `c1cond` - *required when `condtype` is `col` or `rowcol`* - SQL
    WHERE condition for selecting column 1.

  * `c2cond` - *optional* - SQL WHERE condition for selecting
    column 2. If empty, the complement of `c1cond` is used (the result
    is still limited by `filter`).

  * To avoid accidental sample overlaps between `r1cond/r2cond` or
    `c1cond/c2cond`, it is recommended that `r2cond` and `c2cond` are
    not used empty, and `filter` is used to pre-select a narrower
    sample set when desired (e.g. skip samples with unknown value for
    given variables), and then only `r1cond` and `c1cond` are used for
    row/column selection, making the second row/column always the
    complement of the first one.

### Output

* Results CSV - Fisher's exact test results.

  * First row is a header, each subsequent row represents one Fisher's
    exact test.

  * The header is:

        filter;r1cond;r2cond;c1cond;c2cond;r1c1;r1c1;r2c1;r2c2;fisher_l;fisher_r;fisher_2t

  * The values in the first 5 columns are the same as in the input
    Groups CSV.

  * The values in the columns `r1c1,r1c2,r2c1,r2c2` are counts of
    samples in the contingency table, which satisfy the respective
    conditions.

  * The `fisher_l,fisher_r,fisher_2t` are left, right, and 2-tail
    p-values of Fischer's exact test.


Building
--------

Only tested on Linux. To build the binary:

    make

To run all tests:

    make test

If you're feeling adventurous, you can also cross-compile the program
for Windows. This requires `podman`. Make sure to read the script
before running it. Then run it:

    bash tools/compile-for-windows.sh


License
-------

GNU GPL v3+