CSV Fish ======== CSV categorical data analyzer. Generates 2x2 contingency tables according to specified row/column group conditions, and applies Fisher's exact test. Installation ------------ If you don't want to build the program yourself (see instructions below for that), you can get it via the Cargo tool. First [install Cargo](https://crates.io/) and then run cargo install csv_fish Usage ----- To use the program, point to the data, group specifications, and desired output location: csv-fish --data data.csv --groups groups.csv --output output.csv See the [example files](https://gitlab.com/jistr/csv-fish/tree/main/example) for inspiration about the content of the files, and read the below input/output specification before using the program. The Groups CSV file contains SQL queries executed in SQLite. Do not run the program on input files you cannot trust. ### Inputs * All CSV inputs are expected to be separated by semicolons. * Data CSV - values of categorical variables for individual samples. * First row is a header - names of categorical variables in the data set. Each subsequent row represents one sample. * Each column represents a categorical variable. * Each cell is a value of a given categorical variable for a given sample. * Before running any queries on data, all cell values are trimmed of leading and trailing whitespace, to prevent accidental mismatches. (Don't use leading and trailing whitespace as intended differentiators in data.) * (More than two possible values per a categorical variable are supported.) * Groups CSV - SQL for selection of the groups being tested. * If you're not familiar with SQL, you can get a quick overview of how the [where conditions](https://www.w3schools.com/sql/sql_where.asp) work, and the [logical operators](https://www.w3schools.com/sql/sql_and_or.asp) you can use in them. * First row is a header, subsequent rows represent groups (conditions for selecting groups). * The conditions are either for selecting rows of the contingency table, or for selecting columns, or for selecting both rows and columns at the same time. * When using row-only and column-only conditions, the program then combines each of the row-only conditions with each of the column-only conditions (Cartesian product) to generate row-and-column conditions, which are in turn used for generating contingency tables. * When using row-and-column conditions, each such row in Groups CSV file corresponds to one contingency table. * It is possible to combine row-only and column-only conditions with row-and-column conditions in a single Groups CSV file. * **See the [example files](https://gitlab.com/jistr/csv-fish/tree/main/example), which might be better than a thousand words of description.** * The header of Groups CSV must be: condtype;filter;r1cond;r2cond;c1cond;c2cond * `condtype` - *required* - the values can be `row`, `col`, or `rowcol`. The `row` and `col` types are for specifying conditions for rows and columns separately, the `rowcol` type specifies them together. * `filter` - *optional* - initial SQL WHERE condition applied to all data before conducting the fisher exact test. Can be used to shrink the sample set before performing any further operations. * **If your data contains samples where some categorical variables are unknown (empty cell), you probably want to add filter conditions so that queries working with that variable don't use those samples.** E.g. `my_var != ''`. See the [groups.csv example](https://gitlab.com/jistr/csv-fish/blob/main/example/groups.csv). As `filter` is SQL, you can use the operators (like `AND`) in the condition if you need to reference multiple colums. * `r1cond` - *required when `condtype` is `row` or `rowcol`* - SQL WHERE condition for selecting row 1. * `r2cond` - *optional* - SQL WHERE condition for selecting row 2. If empty, the complement of `r1cond` is used (the result is still limited by `filter`). * `c1cond` - *required when `condtype` is `col` or `rowcol`* - SQL WHERE condition for selecting column 1. * `c2cond` - *optional* - SQL WHERE condition for selecting column 2. If empty, the complement of `c1cond` is used (the result is still limited by `filter`). * To avoid accidental sample overlaps between `r1cond/r2cond` or `c1cond/c2cond`, it is recommended that `r2cond` and `c2cond` are not used empty, and `filter` is used to pre-select a narrower sample set when desired (e.g. skip samples with unknown value for given variables), and then only `r1cond` and `c1cond` are used for row/column selection, making the second row/column always the complement of the first one. ### Output * Results CSV - Fisher's exact test results. * First row is a header, each subsequent row represents one Fisher's exact test. * The header is: filter;r1cond;r2cond;c1cond;c2cond;r1c1;r1c1;r2c1;r2c2;fisher_l;fisher_r;fisher_2t * The values in the first 5 columns are the same as in the input Groups CSV. * The values in the columns `r1c1,r1c2,r2c1,r2c2` are counts of samples in the contingency table, which satisfy the respective conditions. * The `fisher_l,fisher_r,fisher_2t` are left, right, and 2-tail p-values of Fischer's exact test. Building -------- Only tested on Linux. To build the binary: make To run all tests: make test If you're feeling adventurous, you can also cross-compile the program for Windows. This requires `podman`. Make sure to read the script before running it. Then run it: bash tools/compile-for-windows.sh License ------- GNU GPL v3+