# Ersatz make file
The goal of this file is to explain the structure of the different code files, which data products are made by which pieces of code, and where they are consumed. 

In the end, the goal is to be able to run the figures_01.ipynb notebook, which produces all the final paper figures. 

llama_nlp.yml contains the conda environment used to run the code. 

Some of the code was generated using a slurm cluster and h100s. In practice, this is not necessary: the total GPU time for this project was less than a few hundred minutes on an H100, so a single H100 could do the entire job itself, so simply running sweep_toy_noise.py with the appropriate command line parameters could obviate the need to run the slurm submission scripts. These slurm submission scripts attempt to activate the conda environment "llama_nlp". 

## figures_01.ipynb -> all final paper figures
required data files: 

- synthetic_pruning_dataset.pkl
- all_accuracies.pickle
- data/sweep_toy_noise_mult_noise_summary.pickle
- data/sweep_toy_noise_summary.pickle
- block_diagonal_matrix.pickle
- wikipedia_empirics_results.pickle
- sparsed_P_matrices.pickle
- sparsed_performance_lognormed.pickle
- sparsed_performance_normed.pickle
- sparsified_eigenvalue_decay.pickle

## pruning_script.py  -> synthetic_pruning_dataset.pkl
Performs the vocabulary pruning experiment. Runs in a few minutes on an h100.

required data files:

- None!

## cooccur_stats.ipynb -> all_accuracies.pickle
Measures the accuracies for different matrix targets on the wikipedia dataset. 
required data files: 
- qwem/analogies.pickle
- qwem/word_counts.pickle
- qwem/corpus_stats.pickle

## analyze_toy_noise.ipynb -> data/sweep_toy_noise_summary.pickle
Processes the data/sweep_toy_noise/*.dat data into a summary file. 
required data files:
- data/sweep_toy_noise/*.dat

## analyze_toy_noise_mult_noise.ipynb -> data/sweep_toy_noise_mult_noise_summary.pickle
Processes the data/sweep_toy_noise_mult_noise/*.dat data into a summary file

required data files:

- data/sweep_toy_noise_mult_noise/*.dat

## wikipedia_empirics.ipynb -> block_diagonal_matrix.pickle, wikipedia_empirics_results.pickle, wikipedia_empirics_results_individual.pickle
Runs the analysis of the corpus pruning experiment (zeroing out explicit examples of analogies in the wikipedia data). 

Outputs:

(i) block_diagonal_matrix.pickle
- data product is just a nice example of the block-diagonal matrix structure that tends to occur along the lines of analogies. 

(ii) wikipedia_empirics_results.pickle
- data product contains accuracies when pruning /all/ the analogies from the co-occurrence matrix

(iii) wikipedia_empirics_results_individual.pickle
- contains the accuacies when pruning one family of analogies from the co-occurrence matrix at a time. 

required data files:
- qwem/analogies.pickle
- qwem/word_counts.pickle
- qwem/corpus_stats.pickle

## prototyping_pruning.ipynb -> sparsed_P_matrices.pickle,  sparsed_performance_lognormed.pickle sparsed_performance_normed.pickle, sparsified_eigenvalue_decay.pickle
Runs a few simulations of the performance for sparsified matrices (i.e. removing some words from the vocabulary). 

Outputs:

(i) sparsed_P_matrices.pickle
- example matrices for sparisifed P

(ii) sparsed_performance_lognormed.pickle
- performance on the log M matrix target

(iii) sparsed_performance_normed.pickle 
- performance on the raw M matrix target. 

(iv) sparsified_eigenvalue_decay.pickle
- top d=12 eigenvalues as a function of sparsification density f


required data files:
- None!

## slurm_scripts/queue_noise_sweep.sh -> data/sweep_toy_noise/*.dat
Runs a sweep through different dimensions of synthetic system, for different distributions of s_k

required data files:

- None!

## slurm_scripts/queue_noise_sweep_mult_noise.sh -> data/sweep_toy_noise_mult_noise/*.dat
Runs a sweep through different dimensions of synthetic system, for different distributions of s_k, and for different multiplicative noises.

required data files:

- None!

## compute_coocurrence.py -> corpus_stats.pickle 
Generates the co-occurrence matrices from wikipedia word statistics. 

Outputs: 

(i) corpus_stats.pickle 

- Contains the co-occurence matrix for the words 20000 most common words on the English Wiki. 
- Has both a re-weighted and un-weighted variation of co-occurence. 

required data files:

- qwem/word_counts.pickle
- qwem/article_arr_idxs.npy
- qwem/enwiki.bin

## analogies_prepare_enwiki.py -> qwem/article_arr_idxs.npy qwem/word_counts.pickle qwem/enwiki.bin

Generates the basic wikipedia word statistics, using HuggingFace's 2023 version of the English Wikipedia, restricted to articles with at least 500 tokens. 