# Source code for "The emergence of the left-right asymmetry in predicting brain activity from LLMs' representation specifically correlates  with their formal linguistic performance"

![The emergence of the left-right asymmetry in predicting brain activity from LLMs' representation specifically correlates  with their formal linguistic performance](phase_transition.png)

## Dependencies

### Python modules

See `requirements.txt` for the full list of packages used in this work. This file provides the exact version that was used, but the code is expected to work with other versions as well.

It is recommended to create a virtual environment to install the python modules, for example:

With Anaconda

    conda create --name llm_brain python=3.10
    conda activate llm_brain
    pip install -r requirements.txt

Or with Pyenv

    pyenv virtualenv 3.10.0 llm_brain
    pyenv activate llm_brain
    pip install -r requirements.txt


Or uv:

    uv venv --python 3.10
    uv pip install -r requirements.txt
    source .venv/bin/activate

### fMRI data

The fMRI data can be obtained at openneuro.org: [doi:10.18112/openneuro.ds003643.v2.0.5](https://doi.org/10.18112/openneuro.ds003643.v2.0.5).

Note: the dataset is described in Li, J., Bhattasali, S., Zhang, S., Franzluebbers, B., Luh, W., Spreng, R. N., Brennen, J., Yang, Y., Pallier, C., & Hale, J. (2022). **Le Petit Prince multilingual naturalistic fMRI corpus**. _Scientific Data_, 9, 530.

Use the processing pipeline from Bonnasse-Gahot & Pallier (2024) available at [https://github.com/l-bg/llms_brain_lateralization](https://github.com/l-bg/llms_brain_lateralization) to compute the average English (or French) subjects.

This output of the pipeline will create:
* a folder for the average subject, `lpp_en_average_subject`, containing 9 files corresponding to each run
* a  `mask_lpp_en.gz` file
* a `isc_10trials_en.gz` file containing the inter-subject correlations

### Textual data

The text files of Le Petit Prince (e.g. `lpp_en_text.zip`) from [https://github.com/l-bg/llms_brain_lateralization](https://github.com/l-bg/llms_brain_lateralization) are needed in order to extract the internal activations from the LLMs. These files must be copied in the main directory (`home_folder`).

## Setting up paths

First, you must set all folders variable (in particular `home_folder` and `lpp_path`) in [llm_brain_asym.py](llm_brain_asym.py).

* `lpp_path` must point to the downloaded dataset from openneuro
* `home_folder` must point to the folder containing the average subject folder, the mask and isc files.

## Main processing pipeline: OLMo-2-1124-7B model

1. Extract activations from the LLM and fit the average subject, using the two Python scripts `extract_llm_activations.py` and `fit_average_subject.py`, adapted from [https://github.com/l-bg/llms_brain_lateralization](https://github.com/l-bg/llms_brain_lateralization) by Bonnasse-Gahot & Pallier (2024). The following code is for English. For French, simply replace `en` with `fr` after the `--lang` option.
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    python extract_llm_activations.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --lang en
    python fit_average_subject.py --model allenai_OLMo-2-1124-7B_stage1-step${step[$i]}-tokens${tokens[$i]}B --lang en
done
```

1. Evaluate performance on the minimal pairs benchmark: BLiMP, Zorro, Arithmetic, and Dyck.
First download the data for BLiMP and Zorro. In the home directory, `git clone https://github.com/alexwarstadt/blimp` and `git clone https://github.com/phueb/Zorro`. Then one can use the following bash lines (note that you need to adjust the batch size to your hardware):
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    python evaluate_llm_blimp.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --batch_size 64 --output_folder 'blimp_results' --device cuda
    python evaluate_llm_zorro.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --batch_size 64 --output_folder 'zorro_results' --device cuda
    python evaluate_llm_arithmetic.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --batch_size 128 --output_folder 'arithmetic_results' --seed 12345 --device cuda
    python evaluate_llm_dyck.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --batch_size 32 --output_folder 'dyck_results' --seed 12345 --device cuda
done
```

1. For the linguistic acceptability of generated texts. First generate the texts using the `generate_texts.py` Python code for each checkpoint,
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    python generate_texts.py --model allenai/OLMo-2-1124-7B --revision stage1-step${step[$i]}-tokens${tokens[$i]}B --output_folder llm_gen --seed 12345 --device cuda
done
```
 then evaluate all these texts using the `evaluate_linguistic_acceptability.py` Python code:
 ```sh
 python evaluate_linguistic_acceptability.py
 ```

1. Evaluate performance on Hellaswag`lm_eval` from EleutherAI.
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    lm_eval --model hf \
      --model_args pretrained=allenai/OLMo-2-1124-7B,revision=stage1-step${step[$i]}-tokens${tokens[$i]}B,dtype="float16" \
      --tasks hellaswag \
      --num_fewshot 5 \
      --device cuda:0 \
      --batch_size auto:4 \
      --output_path lm_eval_results/OLMo-2-1124-7B_stage1-step${step[$i]}-tokens${tokens[$i]}B_hellaswag.json
done
```

1. Same for ARC:
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    lm_eval --model hf \
      --model_args pretrained=allenai/OLMo-2-1124-7B,revision=stage1-step${step[$i]}-tokens${tokens[$i]}B,dtype="float16" \
      --tasks ai2_arc \
      --num_fewshot 5 \
      --device cuda:0 \
      --batch_size auto:8 \
      --output_path lm_eval_results/OLMo-2-1124-7B_stage1-step${step[$i]}-tokens${tokens[$i]}B_ai2_arc.json
done
```

1. fr-grammar:
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    lm_eval --model hf \
      --model_args pretrained=allenai/OLMo-2-1124-7B,revision=stage1-step${step[$i]}-tokens${tokens[$i]}B,dtype="float16" \
      --tasks ai2_arc \
      --num_fewshot 5 \
      --device cuda:0 \
      --batch_size auto:8 \
      --output_path lm_eval_results/OLMo-2-1124-7B_stage1-step${step[$i]}-tokens${tokens[$i]}B_ai2_arc.json
done
```

1. and French Hellaswag:
```sh
step=("150" "600" "1000" "3000" "7000" "19000" "51000" "133000" "352000" "928646")
tokens=("1" "3" "5" "13" "30" "80" "214" "558" "1477" "3896")
for i in "${!step[@]}"; do
    lm_eval --model hf \
       --model_args pretrained=allenai/OLMo-2-1124-7B,revision=stage1-step${step[$i]}-tokens${tokens[$i]}B,dtype="float16" \
       --tasks french_bench_hellaswag \
       --num_fewshot 5 \
       --device cuda:0 \
       --batch_size 4 \
       --output_path lm_eval_results/OLMo-2-1124-7B_stage1-step${step[$i]}-tokens${tokens[$i]}B_french_bench_hellaswag.json
done


1. Analyze and visualize the results, reproducing all the figures in the paper running the jupyter notebook `analyze_results_olmo2.ipynb`

## Reproduce the results using models from the Pythia family.
Replicate the results on EleutherAI/pythia-2.8b and EleutherAI/pythia-6.9b by following the same procedure.
