# Supplementary Material: Circuits, features, and heuristics in molecular transformers

This repository contains the code and resources necessary to reproduce the findings presented in the paper. The files are organized to reflect the main components of our research: the core library, experiment scripts, and analysis notebooks.

### Core Library (`lmkit/`)

This directory contains the foundational code for the models and analysis tools used throughout the project.

-   **`lmkit/impl/`**: Core implementation of the transformer model, including attention mechanisms and caching.
-   **`lmkit/sparse/`**: Implementation of the Sparse Autoencoder (SAE) (`sae.py`), tools for chemical feature analysis (`fragment_mapper.py`, `fragment_metrics.py`), and the SMARTS pattern library (`MolSAE_SMARTS_v1.1_leadlike.yaml`).
-   **`lmkit/tools/`**: General utilities for data processing, training, and SMILES manipulation.

### Experiments & Analyses

These scripts are used to run the primary experiments and generate the main results of the paper.

-   **`experiments/pointers/`** and **`experiments/pointer_suite_all/`**: Scripts for the analysis of syntactic circuits. These files identify and validate the attention heads responsible for handling SMILES grammar, such as ring and branch closures.
-   **`experiments/valence/`**: Scripts for probing and analyzing the model's understanding of chemical valence. This includes training linear probes, performing causal interventions, and localizing valence-related heads.
-   **`fragment_scan.py`**: The main script for screening thousands of SAE features against a library of chemical substructures (SMARTS patterns) to identify interpretable "detector" features.
-   **`experiments/tdc_benchmarks.py`**: Script for evaluating the learned SAE features on a suite of downstream ADMET and pharmacokinetic prediction tasks from the Therapeutics Data Commons (TDC).
-   **`downstream_feature_importance.py`**: Analyzes the results of downstream tasks to determine which SAE features are most predictive.
-   **`verify_by_ablation.py`**: Performs causal validation by ablating (zeroing out) specific, important SAE features and measuring the impact on downstream task performance.

### Analysis & Visualization Notebooks (`notebooks/`)

These notebooks provide interactive examples and detailed workflows for the key analyses.

-   **`notebooks/atlas.ipynb`**: The primary notebook for exploring and annotating the SAE feature atlas. It demonstrates the process of linking abstract features to concrete chemical concepts.
-   **`notebooks/visualizer.ipynb`**: Contains code for generating detailed visualizations of individual SAE features, showing their top-activating molecular contexts.
-   **`notebooks/downstream.ipynb`**: Details the setup, execution, and analysis of the downstream benchmark evaluations.
-   **`notebooks/artifacts.ipynb`**: A notebook for high-level aggregation and summary of results from various experiments.

### Summary & Plotting Scripts

These scripts in the root directory and `experiments/` are used to process raw experimental outputs and generate the figures and tables presented in the paper.

-   `summarize_*.py` files: A collection of scripts that aggregate raw data from different experiments (e.g., `summarize_fragment_scan_wsd.py`, `summarize_downstream_importance.py`) into coherent tables and summaries.
-   `make_*.py` and `*_figures.py` files: Scripts dedicated to generating the final plots and figures for the paper, such as `experiments/pointer_suite_all/make_plots.py` and `make_downstream_plots.py`.