Crates.io | csv_fish |
lib.rs | csv_fish |
version | 0.2.4 |
source | src |
created_at | 2019-03-21 23:36:49.820959 |
updated_at | 2021-02-13 20:04:59.960595 |
description | CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test. |
homepage | |
repository | https://gitlab.com/jistr/csv-fish |
max_upload_size | |
id | 123006 |
size | 71,674 |
CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test.
If you don't want to build the program yourself (see instructions below for that), you can get it via the Cargo tool. First install Cargo and then run
cargo install csv_fish
To use the program, point to the data, group specifications, and desired output location:
csv-fish --data data.csv --groups groups.csv --output output.csv
See the example files for inspiration about the content of the files, and read the below input/output specification before using the program.
The Groups CSV file contains SQL queries executed in SQLite. Do not run the program on input files you cannot trust.
All CSV inputs are expected to be separated by semicolons.
Data CSV - values of categorical variables for individual samples.
First row is a header - names of categorical variables in the data set. Each subsequent row represents one sample.
Each column represents a categorical variable.
Each cell is a value of a given categorical variable for a given sample.
Before running any queries on data, all cell values are trimmed of leading and trailing whitespace, to prevent accidental mismatches. (Don't use leading and trailing whitespace as intended differentiators in data.)
(More than two possible values per a categorical variable are supported.)
Groups CSV - SQL for selection of the groups being tested.
If you're not familiar with SQL, you can get a quick overview of how the where conditions work, and the logical operators you can use in them.
First row is a header, subsequent rows represent groups (conditions for selecting groups).
The conditions are either for selecting rows of the contingency table, or for selecting columns, or for selecting both rows and columns at the same time.
When using row-only and column-only conditions, the program then combines each of the row-only conditions with each of the column-only conditions (Cartesian product) to generate row-and-column conditions, which are in turn used for generating contingency tables.
When using row-and-column conditions, each such row in Groups CSV file corresponds to one contingency table.
It is possible to combine row-only and column-only conditions with row-and-column conditions in a single Groups CSV file.
See the example files, which might be better than a thousand words of description.
The header of Groups CSV must be:
condtype;filter;r1cond;r2cond;c1cond;c2cond
condtype
- required - the values can be row
, col
, or
rowcol
. The row
and col
types are for specifying conditions
for rows and columns separately, the rowcol
type specifies them
together.
filter
- optional - initial SQL WHERE condition applied to all
data before conducting the fisher exact test. Can be used to
shrink the sample set before performing any further operations.
my_var != ''
. See the
groups.csv example.
As filter
is SQL, you can use the operators (like AND
) in
the condition if you need to reference multiple colums.r1cond
- required when condtype
is row
or rowcol
- SQL
WHERE condition for selecting row 1.
r2cond
- optional - SQL WHERE condition for selecting
row 2. If empty, the complement of r1cond
is used (the result is
still limited by filter
).
c1cond
- required when condtype
is col
or rowcol
- SQL
WHERE condition for selecting column 1.
c2cond
- optional - SQL WHERE condition for selecting
column 2. If empty, the complement of c1cond
is used (the result
is still limited by filter
).
To avoid accidental sample overlaps between r1cond/r2cond
or
c1cond/c2cond
, it is recommended that r2cond
and c2cond
are
not used empty, and filter
is used to pre-select a narrower
sample set when desired (e.g. skip samples with unknown value for
given variables), and then only r1cond
and c1cond
are used for
row/column selection, making the second row/column always the
complement of the first one.
Results CSV - Fisher's exact test results.
First row is a header, each subsequent row represents one Fisher's exact test.
The header is:
filter;r1cond;r2cond;c1cond;c2cond;r1c1;r1c1;r2c1;r2c2;fisher_l;fisher_r;fisher_2t
The values in the first 5 columns are the same as in the input Groups CSV.
The values in the columns r1c1,r1c2,r2c1,r2c2
are counts of
samples in the contingency table, which satisfy the respective
conditions.
The fisher_l,fisher_r,fisher_2t
are left, right, and 2-tail
p-values of Fischer's exact test.
Only tested on Linux. To build the binary:
make
To run all tests:
make test
If you're feeling adventurous, you can also cross-compile the program
for Windows. This requires podman
. Make sure to read the script
before running it. Then run it:
bash tools/compile-for-windows.sh
GNU GPL v3+