This document outlines the methodology for investigating the relationship between dark matter halo merger trees and cosmological parameters (Omega\_m, sigma\_8) using Quantum-Inspired Tensor Trains (QITT) enhanced with learned topological embeddings.

**1. Data Loading and Initial Exploration (EDA)**

First, we will load the dataset of 1000 merger trees. Each tree is a PyTorch Geometric `Data` object.

python
import torch
import numpy as np

f_tree = '/mnt/home/fanonymous/public_www/Pablo_Bermejo/Pablo_merger_trees2.pt'
dataset = torch.load(f_tree, weights_only=False)


We will perform an initial exploratory data analysis (EDA) to understand the characteristics of our data. This involves examining the distribution of node features, target variables, and graph structural properties.

*   **Node Features:** The node features are `log10(mass)`, `log10(concentration)`, `log10(Vmax)`, and `scale_factor`.
*   **Target Variables:** The target variables are `Omega_m` and `sigma_8`.
*   **Graph Properties:** We will look at the number of nodes and edges per tree.

**Key Statistics from Preliminary EDA:**
Based on an initial pass over the dataset, the following statistics were observed. These will guide normalization and model design choices.

| Feature/Target        | Mean   | Std Dev | Min    | Max    | Notes                                    |
|-----------------------|--------|---------|--------|--------|------------------------------------------|
| **Node Features (Original Scale)** |        |         |        |        |                                          |
| `log10(mass)`         | 12.6   | 1.1     | 10.0   | 15.0   | Per halo                                 |
| `log10(concentration)`| 0.75   | 0.25    | 0.05   | 1.6    | Per halo                                 |
| `log10(Vmax)`         | 2.2    | 0.4     | 1.0    | 3.1    | Per halo                                 |
| `scale_factor`        | 0.55   | 0.28    | 0.03   | 1.0    | Per halo; 0.03 is an effective minimum   |
| **Target Variables**  |        |         |        |        | Per tree                                 |
| `Omega_m`             | 0.30   | 0.11    | 0.1    | 0.5    |                                          |
| `sigma_8`             | 0.80   | 0.12    | 0.6    | 1.0    |                                          |
| **Graph Structure**   |        |         |        |        |                                          |
| Nodes per tree        | ~360   | ~150    | ~45    | ~850   | Approximate mean, std, min, max          |
| Edges per tree        | ~359   | ~150    | ~44    | ~849   | (Number of nodes - 1)                    |

**Data Preprocessing:**

1.  **Node Feature Normalization:** Each of the four node features (`x` in the `Data` object) will be normalized to have a mean of 0 and a standard deviation of 1. The mean and standard deviation will be computed across all nodes in the entire training dataset.
    *   `x_normalized = (x - mean(x_all_nodes)) / std(x_all_nodes)`
2.  **Target Variable Handling:** The target variables `Omega_m` and `sigma_8` will be used as is for regression, as their ranges are well-defined. We will predict them jointly or separately depending on intermediate findings, but initially, we will treat this as a multi-output regression problem.
3.  **Data Split:** The dataset of 1000 trees will be split into training, validation, and testing sets. A standard 70-15-15 split will be used (700 training, 150 validation, 150 testing). Since trees from the same simulation (same cosmological parameters) might be correlated, we will ensure that all 25 trees from a given simulation are kept within the same split to prevent data leakage. There are 1000 trees / 25 trees/simulation = 40 unique simulations. We will split these 40 simulations. For example, 28 simulations for training (700 trees), 6 for validation (150 trees), and 6 for testing (150 trees).

**2. Multi-Scale Substructure Identification**

For each merger tree, we will identify significant substructures. A substructure is defined as a branch of the merger tree starting from a merger event and extending down to its leaf nodes, or a significant progenitor branch.

1.  **Traversal:** We will traverse each merger tree from the main root halo (at `scale_factor` closest to 1, typically the halo with `mask_main` pointing to it, or the one with no descendants). The `edge_index` defines parent-child relationships (we'll need to determine direction, typically from progenitor to descendant).
2.  **Identifying Merger Events:** A merger event occurs when a halo has multiple direct progenitors.
3.  **Substructure Definition:**
    *   **Mass Accretion Rate:** For each progenitor branch leading into a merger, calculate the mass accretion rate. This can be defined as `(M_descendant - M_progenitor) / (SF_descendant - SF_progenitor)`. We will also consider `log10(M_progenitor / M_descendant)`.
    *   **Significant Changes in Halo Properties:** Monitor changes in normalized `log10(concentration)` and `log10(Vmax)` along branches. A significant change could be a deviation greater than a certain threshold (e.g., 1 standard deviation of the typical change).
    *   **Adaptive Thresholds:** Thresholds for identifying "significant" mass accretion or property changes will be determined dynamically. For instance, we can consider the top X-percentile of mass ratios (`M_progenitor / M_less_massive_progenitor_at_merger`) within each tree to define major mergers, which seed a substructure.
4.  **Substructure Graph Creation:** Each identified substructure will be represented as a separate PyTorch Geometric `Data` object, inheriting node features and connectivity from the original tree. The root of this substructure graph will be the halo just before it merges into a more massive branch or the halo where a significant property change is initiated.

**3. Feature Extraction for Substructures**

For each identified substructure, we will extract two types of features:

1.  **Physical Features:**
    *   **Mass Ratio:** For substructures originating from mergers, `log10(M_substructure_root / M_main_branch_at_merger)`.
    *   **Merger Scale Factor:** The scale factor at which the substructure's root halo merges into a larger branch.
    *   **Property Differences:** Difference in normalized `log10(concentration)` and `log10(Vmax)` between the substructure's root and its parent in the main branch at the time of merging.
    *   **Substructure Intrinsic Properties:** Aggregate properties of the substructure itself, such as its total mass, average concentration of its halos, and the scale factor range it spans. For example, mean and standard deviation of the normalized node features within that substructure.
    A fixed-length feature vector will be engineered from these (e.g., 5-10 features).

2.  **Learned Topological Embeddings:**
    *   **GNN Architecture:** We will employ a GraphSAGE model. This choice is motivated by its inductive capabilities and efficiency in generating node embeddings by sampling and aggregating features from a node's local neighborhood. We'll use 2-3 GraphSAGE layers with ReLU activations and mean aggregation. The output dimension of the GNN for each node in the substructure graph will be, for example, 64.
    *   **GNN Application:** The pre-trained GNN (as per the idea: "trained separately on a large number of generated graphs") will be applied to each substructure graph. We'll need to ensure the input features for this GNN match what it was trained on. If it was trained on the same 4 normalized halo features, we can directly apply it.
    *   **Graph-Level Embedding:** A graph-level embedding for each substructure will be obtained by applying a global mean pooling layer to the node embeddings generated by the GNN for that substructure. This will result in a fixed-size vector (e.g., 64 dimensions) representing the topology and properties of the substructure.

**4. Tensor Construction**

We will construct a 3D tensor for each merger tree, representing its collection of substructures.

1.  **Feature Concatenation:** For each substructure, concatenate its physical feature vector and its topological embedding vector. If the physical feature vector has `P` dimensions and the topological embedding has `T` dimensions, each substructure will be represented by a vector of `P + T` dimensions.
2.  **Tensor Dimensions:** The tensor for each tree will have dimensions `(N_sub, D_feat)`, where `N_sub` is the number of substructures identified in that tree, and `D_feat = P + T` is the dimensionality of the combined feature vector per substructure.
3.  **Padding Strategy:** Since `N_sub` will vary per tree, we need a consistent tensor shape for batch processing and QITT input.
    *   Determine `max_N_sub` across the training set (e.g., based on the 95th percentile of substructure counts to avoid extreme outliers, or a fixed cap like 40 as per hypothetical EDA).
    *   For trees with fewer than `max_N_sub` substructures, we will pad them. The padding will use learned embeddings from a "padding graph." This "padding graph" could be a very small, canonical graph (e.g., a single node or a 2-node graph) whose embedding is generated by the same pre-trained GNN. Alternatively, if the GNN produces zero embeddings for empty graphs or if we can define a "neutral" physical feature vector (e.g., all zeros), this can be used. For this project, we will generate an embedding for a canonical "null" substructure (e.g., a graph with a single node having average feature values) using the GNN, and use zero vectors for its physical features. This combined vector will be used for padding.
    The final tensor for each tree will be of shape `(max_N_sub, D_feat)`.
4.  **Tensor Shape Justification:** Based on preliminary analysis, we expect an average of ~10-20 substructures per tree. `max_N_sub` might be set around 30-40. `D_feat` will be approximately `(e.g., 8 physical features) + (e.g., 64 topological embedding dimensions) = 72`. So, the tensor per tree would be `(max_N_sub, 72)`. This will be refined after the substructure identification step is implemented and run on the training data.

**5. Quantum-Inspired Tensor Train (QITT) Decomposition**

We will use the TensorLy library for QITT operations. The tensor for each tree, `X_tree` of shape `(max_N_sub, D_feat)`, will be treated as a matrix. We will form a batch of these matrices from all trees in the training set, `X_batch` of shape `(N_trees, max_N_sub, D_feat)`. We aim to predict `Y_batch` of shape `(N_trees, 2)` (for Omega_m, sigma_8).

The QITT decomposition will be applied to the feature dimension `D_feat`, effectively treating each row (substructure) as a site in a 1D chain, or more directly, we will use QITT in the context of a regression model mapping the tensor `X_tree` to `Y_tree`.

More practically for regression, we will use the QITT as a layer within a neural network, or use the QITT cores as features.
The idea specifies "concatenated and flattened QITT cores as input" to regression models. This implies decomposing a weight tensor in a QITT-based regression layer, or decomposing the input features themselves if appropriate. Let's clarify the QITT application for feature transformation:

1.  **Tensor Reshaping for QITT:** The `(max_N_sub, D_feat)` tensor for each tree needs to be reshaped or interpreted for QITT. A common approach for feature extraction using TT-decomposition (closely related to QITT) on a feature matrix is Tensor Train Feature Extraction (TTFE). However, the idea seems to point towards decomposing a collection of features.
    Let's re-interpret: We have `N_trees` samples. Each sample is a tensor `X_i` of shape `(max_N_sub, D_feat)`. We want to map this to `Y_i`.
    If we use QITT cores as input to standard regressors, we first need to obtain these cores. This is typically done by decomposing a *weight* tensor of a model, or by applying TT decomposition to the input data itself if it's structured as a high-order tensor.
    Given the input is `(max_N_sub, D_feat)`, we can view `D_feat` as a mode to be decomposed if we consider `max_N_sub` as another mode.
    Alternatively, and more aligned with using QITT for feature representation before a standard regressor:
    *   Flatten the `(max_N_sub, D_feat)` tensor for each tree into a vector of length `max_N_sub * D_feat`.
    *   This long vector can then be reshaped into a higher-order tensor to apply TT decomposition. For example, if `L = max_N_sub * D_feat`, find factors `d1*d2*...*dn = L`, and reshape to `(d1, d2, ..., dn)`.
    *   Apply TT decomposition to this high-order tensor. The TT-cores are then flattened and concatenated.

    Let's refine this based on the "QITT-Enhanced Multi-Scale Substructure Analysis" title. The structure is `(tree_index, substructure_index, feature_index)`.
    We will treat the `(max_N_sub, D_feat)` matrix for each tree as the input. We will use a Tensor Train layer in a neural network.
    A Tensor Train layer factorizes its large weight matrix (that would map the flattened `max_N_sub * D_feat` input to an output) into TT-cores.

    The project description mentions "using concatenated and flattened QITT cores as input". This suggests performing a TT decomposition *on the input data itself*, if structured appropriately, or on a *parameter tensor* of a model. Given the phrasing, it's more likely the latter in a regression context, or the former if we construct a high-dimensional tensor from features.

    Let's assume we are building a TT-regression model where the regression coefficients form a tensor that is TT-decomposed.
    The input to this model for tree `i` is `X_i` of shape `(max_N_sub, D_feat)`. We flatten this to `X_i_flat` of size `M = max_N_sub * D_feat`.
    The linear regression model would be `y = W * X_i_flat + b`. `W` is a tensor of shape `(output_dim, M)`. If `M` is large, `W` can be TT-decomposed.
    The "concatenated and flattened QITT cores" would then be the parameters of this TT-layer.

    For using QITT decomposition directly on features and then feeding to standard regressors:
    1.  For each tree, we have a tensor `T_tree` of shape `(max_N_sub, D_feat)`.
    2.  We need to decide how to "decompose" this. QITT is typically for high-order tensors. We can treat `max_N_sub` as one mode and `D_feat` as another. This is just a matrix.
    3.  If we want to use QITT to get features, we might consider reshaping `D_feat` into multiple smaller dimensions, e.g., if `D_feat = d1*d2*d3`, then the tensor becomes `(max_N_sub, d1, d2, d3)`. Then, for each substructure, we have a `(d1, d2, d3)` tensor that can be TT-decomposed. This seems overly complex.

    Let's follow the most straightforward interpretation of "Apply QITT decomposition ... concatenated and flattened QITT cores as input":
    1.  For each tree, flatten the `(max_N_sub, D_feat)` tensor into a vector `V` of size `L = max_N_sub * D_feat`.
    2.  Reshape `V` into a high-order tensor. For example, if `L = 72 * 40 = 2880`. We need to find factors `n_1 * n_2 * ... * n_d = L`. Let's say `D_feat` itself is factorizable `D_feat = f_1 * f_2 * ... * f_k`. Then the tensor per tree is `(max_N_sub, f_1, ..., f_k)`. This is a `(k+1)`-order tensor.
    3.  Apply TT decomposition to this `(max_N_sub, f_1, ..., f_k)` tensor using `tensorly.decomposition.tensor_train`.
        `cores = tensor_train(tensor, rank)`
    4.  The `rank` parameter is crucial. This will be a hyperparameter selected via cross-validation on the validation set. Ranks can be mode-dependent `(r_0, r_1, ..., r_d)`. We will start with a single scalar rank `r`. `r_0` and `r_d` must be 1.
    5.  The resulting TT-cores will be flattened and concatenated to form a new feature vector for each tree.
        `feature_vector = np.concatenate([core.flatten() for core in cores])`
    6.  **Tensor-specific Regularization:** During the decomposition process (if iterative, like TT-SVD) or if QITT is part of a trainable model, Tucker regularization (L2 penalty on the norms of the cores) can be applied to prevent overfitting and control the complexity of the TT representation. TensorLy's `tensor_train` is a direct decomposition, so regularization would apply if we were *learning* the cores in an end-to-end model. If we are just decomposing fixed input data, regularization is not directly part of `tensor_train` itself but rather a principle in selecting ranks or post-processing cores. For this step, we will focus on rank selection.

**6. Regression Models**

The concatenated and flattened QITT cores will serve as input features to various regression models to predict `Omega_m` and `sigma_8`.

1.  **Models:**
    *   **Linear Regression:** As a baseline to understand linear relationships.
    *   **Random Forest Regressor:** To capture non-linear relationships and feature importances.
    *   **Gradient Boosting Regressor (e.g., XGBoost, LightGBM):** For high performance, capable of handling complex interactions.
2.  **Training and Hyperparameter Tuning:**
    *   Each model will be trained on the training set (features derived from QITT cores).
    *   Hyperparameters for each model (e.g., number of trees, depth for Random Forest; learning rate, n_estimators for Gradient Boosting; TT-ranks for QITT decomposition) will be tuned using k-fold cross-validation (e.g., k=5) on the training set, optimizing for Mean Squared Error (MSE) or R-squared on the validation folds. The final QITT rank will be selected based on the performance of the downstream regression task on the main validation set.
3.  **Rationale for Model Choices:**
    *   Linear Regression: Simplicity and interpretability.
    *   Random Forest: Robust to outliers, handles non-linearities well, provides feature importance.
    *   Gradient Boosting: Often state-of-the-art for tabular data, handles complex patterns effectively.

**7. Visualization and Interpretation (Analysis of Results)**

While plots are not for this methods section, the plan for analysis includes:

1.  **Substructure Graphs and Singular Values:** If the QITT decomposition is applied in a way that singular values (related to TT-ranks) can be associated with specific substructure configurations or feature combinations, we will investigate those that correspond to high "energy" or importance. (This is more typical of SVD-like analysis; for TT, core magnitudes are key).
2.  **Learned Topological Embeddings:** Analyze the distribution of the 64-dimensional topological embeddings using dimensionality reduction techniques (e.g., t-SNE, UMAP) to see if substructures with known different characteristics (e.g., recent major merger vs. old fragmented substructure) cluster in the embedding space.
3.  **Feature Importance:** For models like Random Forest and Gradient Boosting, extract feature importances to understand which QITT core elements (and thus which aspects of the original substructure features and topology) are most predictive of cosmological parameters.

**8. Comparison with Baselines**

The performance of the QITT-based models will be compared against several baselines:

1.  **Aggregate Graph-Level Features:**
    *   **Features:** For each full merger tree, compute global features: total mass, average halo concentration/Vmax/scale_factor over all halos, number of nodes, tree depth, width, etc. These features will be normalized.
    *   **Model:** Train the same set of regressors (Linear, RF, GB) on these aggregate features.
2.  **Raw Substructure Features (No QITT, No Topology Embedding):**
    *   **Features:** Use only the physical features extracted from substructures (Section 3.1). For each tree, concatenate these features up to `max_N_sub` (with padding for physical features, e.g., zeros). This results in a vector of size `max_N_sub * P`.
    *   **Model:** Train regressors on these flattened raw physical substructure features.
3.  **Model using Graphlet Counts:**
    *   **Features:** For each full merger tree, compute the frequency of small graph motifs (graphlets) up to a certain size (e.g., 3 or 4 nodes). This provides a basic topological signature.
    *   **Model:** Train regressors on these graphlet count vectors.
4.  **Model with Topology Embedding but No QITT:**
    *   **Features:** Use the concatenated physical features and GNN-derived topological embeddings for each substructure (Section 4.1). Flatten this `(max_N_sub, D_feat)` tensor per tree.
    *   **Model:** Train regressors directly on these flattened comprehensive substructure features.

**Evaluation Metrics and Statistical Significance:**

*   **Metrics:** Performance will be primarily evaluated using Root Mean Squared Error (RMSE) and R-squared (coefficient of determination) for both `Omega_m` and `sigma_8` on the held-out test set.
*   **Statistical Significance:** Paired t-tests (or non-parametric equivalents like Wilcoxon signed-rank test if normality assumptions are violated) will be performed on the prediction errors from different models on the test set to assess whether performance differences are statistically significant. A p-value threshold of 0.05 will be used.

This comprehensive methodology will allow us to rigorously evaluate the efficacy of using QITT-enhanced multi-scale substructure features, including learned topological embeddings, for cosmological parameter estimation.