## Question generation

### Training sets

```bash
python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/artificial_intelligence_train \
--n_samples 4000 \
--hub_dataset_name "coldoc/syntheticDocQA_artificial_intelligence_train" \
--split_name "train" \
--vertex_ai
```

```bash
python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/energy_train \
--n_samples 4000 \
--hub_dataset_name "coldoc/syntheticDocQA_energy_train" \
--split_name "train" \
--vertex_ai
```

```bash
python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/government-reports_train \
--n_samples 4000 \
--hub_dataset_name "coldoc/syntheticDocQA_government_reports_train" \
--split_name "train" \
--vertex_ai &> logs/government-reports_train_25-05-24.log
```

```bash
python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/healthcare_industry_train \
--n_samples 4000 \
--hub_dataset_name "coldoc/syntheticDocQA_healthcare_industry_train" \
--split_name "train" \
--vertex_ai &> logs/healthcare_industry_train_26-05-24.log
```

### Test sets

```bash
nohup python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/artificial_intelligence_test \
--n_samples 150 \
--hub_dataset_name "coldoc/syntheticDocQA_artificial_intelligence_test" \
--split_name "test" \
--vertex_ai \
&> logs/generate_pdfs_questions/artificial_intelligence_test-28-05-24.log &
```

```bash
nohup python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/energy_test \
--n_samples 150 \
--hub_dataset_name "coldoc/syntheticDocQA_energy_test" \
--split_name "test" \
--vertex_ai \
&> logs/generate_pdfs_questions/energy_test-28-05-24.log &
```

```bash
nohup python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/government-reports_test \
--n_samples 150 \
--hub_dataset_name "coldoc/syntheticDocQA_government_reports_test" \
--split_name "test" \
--vertex_ai \
&> logs/generate_pdfs_questions/government-reports_test-28-05-24.log &
```

```bash
nohup python scripts/generate_data/generate_pdfs_questions.py \
--pdf_folder data/scrapped_pdfs_split/pages_extracted/healthcare_industry_test \
--n_samples 150 \
--hub_dataset_name "coldoc/syntheticDocQA_healthcare_industry_test" \
--split_name "test" \
--vertex_ai \
&> logs/generate_pdfs_questions/healthcare_industry_test-28-05-24.log &
```

### Shift dataset: question generation

```bash
nohup python scripts/generate_data/get_shift_dataset.py approvisionnement_petrolier --generate-questions &> logs/generate_data/get_shift_dataset/approvisionnement_petrolier.log &
nohup python scripts/generate_data/get_shift_dataset.py decarbonner_sante --generate-questions &> logs/generate_data/get_shift_dataset/decarbonner_sante.log &
nohup python scripts/generate_data/get_shift_dataset.py aviation --generate-questions &> logs/generate_data/get_shift_dataset/aviation.log &
nohup python scripts/generate_data/get_shift_dataset.py cartographie_transition --generate-questions &> logs/generate_data/get_shift_dataset/cartographie_transition.log &
nohup python scripts/generate_data/get_shift_dataset.py rapport_avancement --generate-questions &> logs/generate_data/get_shift_dataset/rapport_avancement.log &
```

### Shift dataset: filtering questions

```bash
python scripts/generate_data/get_shift_dataset.py approvisionnement_petrolier
python scripts/generate_data/get_shift_dataset.py decarbonner_sante
python scripts/generate_data/get_shift_dataset.py aviation
python scripts/generate_data/get_shift_dataset.py cartographie_transition
python scripts/generate_data/get_shift_dataset.py rapport_avancement
```

### Shift dataset: concatenate datasets

```bash
python scripts/generate_data/get_shift_dataset.py concat
```

## Push Datasets to Hub 

### Train sets 

```bash
nohup python scripts/generate_data/make_benchmarks.py docvqa_train &> logs/make_benchmarks/docvqa_train-28-05-24.log &
python scripts/generate_data/make_benchmarks.py infovqa_train   
```

## Eval sets 

```bash
nohup python scripts/generate_data/make_benchmarks.py docvqa_eval &> logs/make_benchmarks/docvqa_eval-28-05-24.log &
python scripts/generate_data/make_benchmarks.py infovqa_eval
python scripts/generate_data/make_benchmarks.py tabfquad_test
nohup python scripts/generate_data/make_benchmarks.py tatdqa &> logs/make_benchmarks/tatdqa-28-05-24.log &
python scripts/generate_data/make_benchmarks.py arxivqa
```

## Unstructured

```bash
nohup python scripts/baselines/captioning_unstructured.py coldoc/tatdqa_test &> logs/baselines/captioning_unstructured/tatdqa_test-06-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/infovqa_test_subsampled &> logs/baselines/captioning_unstructured/infovqa_test_subsampled-05-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/docvqa_test_subsampled &> logs/baselines/captioning_unstructured/docvqa_test_subsampled-04-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/arxivqa_test_subsampled &> logs/baselines/captioning_unstructured/arxivqa_test_subsampled-04-06-24.log &
```

### Synthetic

```bash
nohup python scripts/baselines/captioning_unstructured.py coldoc/syntheticDocQA_energy_test &> logs/baselines/captioning_unstructured/syntheticDocQA_energy_test-05-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/syntheticDocQA_artificial_intelligence_test &> logs/baselines/captioning_unstructured/syntheticDocQA_artificial_intelligence_test-05-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/syntheticDocQA_government_reports_test &> logs/baselines/captioning_unstructured/government_reports_test-05-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/syntheticDocQA_healthcare_industry_test &> logs/baselines/captioning_unstructured/healthcare_industry_test-05-06-24.log &
nohup python scripts/baselines/captioning_unstructured.py coldoc/shiftproject_test  &> logs/baselines/captioning_unstructured/shiftproject_subsampled-05-06-24.log &
```

## Measure latency

### Measure time to save embeddings with ColPali

```bash
python scripts/measure_latency/measure_saving_embeds.py --n-iter 1
```


## Tesseract 

### Run Tesseract from collection of datasets 

```bash
python scripts/baselines/tesseract.py --collection-name "coldoc/syntheticdocqa-test-6655c59cfda461267c0d9ac8" &> logs/tesseract/synthdocqa-12-06-24.log &
python scripts/baselines/tesseract.py --collection-name "coldoc/existing-datasets-test-6655c5e0504da7ec0c14253c" &> logs/tesseract/existing-12-06-24.log &
```