In [6]:
import os

This notebook provides some utility to perform grid searches over building parameters for Seismic.

#### Build

This section generates a `bash` script, whose name is specified by the `output_file` variable, that builds an index for every configuration of the combination of values specified in `all_n_postings`, `all_energies`, and `all_centroid_fractions`. Just execute the generated bash script to start the grid search. Don't forget to specify the `document_path`, i.e. the path to the documents in the inner format for Seismic, and the `index_dir_path`, which is where the generated indexes will be saved.

In [2]:
output_file = "../grid_index.sh"

all_n_posting = [1000, 1500, 2000]
all_energies = [0.1, 0.2]
all_centroid_fractions = [0.05, 0.75, 0.1]



executable = "./target/release/build_inverted_index"
documents_path = ""
index_dir_path = ""
for n_postings in all_n_posting:
 for energy in all_energies:
 for centroid_fraction in all_centroid_fractions:
 name = f"GlobalThreshold_n-postings_{n_postings}_energy_{energy}_centroid-fraction_{centroid_fraction}.seismic_index"
 index_path = os.path.join(index_dir_path, name)
 string = f"{executable} -i {documents_path} --centroid-fraction {centroid_fraction} -s {energy} --n-postings {n_postings} -o {index_path}\n"
 with open(output_file, "a") as f:
 f.write(string)

#### Search (MSMARCO)

This section helps you in running the grid search, given that you have built a set of indexes using the code in the **Build** section. You shall indicate:
 - `queries_path`: path to the queries in the inner format.
 - `groundtruth_path`: path to the groundtruth file generated using the `generate_groundtruth` binary. This is needed to compute the recall over the exact search. 
 - `index_folder`: path to the directory that contains the indexes, usually the same as `index_dir_path` above
 - `result_folder`: directory where the results (in the .tsv format) will be saved. The scripts generates one output file per index (with the same name of the index). You can change the name of the output file by modifying the `result_path` inside the `for` loop below.
- `qrels_path`: path to the qrles file.
- `original_queries_path`: path to the .tsv file containing the original queries.


The qrels and the original queries file can be downloaded [here](http://hpc.isti.cnr.it/~rulli/seismic-sigir2024/aux_data/).
 

In [2]:
base_command = "bash scripts/grid_search.sh" 

queries_path = ""
groundtruth_path = ""

index_folder = "" # Same as index_dir_path
result_folder = ""

qrels_path = ""
original_queries_path = ""

In [4]:
# You can apply different filters based on the index name, i.e. "4000" in x
files = filter(lambda x: x.startswith("Global"), os.listdir(index_folder))

In [6]:
grid_file = "../grid_search.sh"

In [22]:
for file in files:
 index_path = os.path.join(index_folder, file)
 result_path = os.path.join(result_folder, file)
 string = f"{base_command} {index_path} {result_path} {queries_path} {groundtruth_path}\n"
 with open(grid_file, "a") as f:
 f.write(string)