\section{Extended Introduction and Background}
\label{app:background}

\subsection{Climate Context and Catalyst Challenges}

The escalating climate crisis demands immediate technological breakthroughs to mitigate atmospheric CO$_2$ concentrations, which have reached unprecedented levels exceeding 420 ppm. Electrochemical conversion of CO$_2$ into value-added chemicals and fuels represents a critical pathway toward carbon neutrality, with catalysts serving as the cornerstone of this transformation. Current state-of-the-art OER catalysts, predominantly based on precious metals like IrO$_2$ and RuO$_2$, achieve overpotentials of 320-370 mV but suffer from scarcity, high cost, and limited long-term stability under operational conditions. This fundamental challenge has motivated intensive research into alternative catalyst architectures that leverage synergistic interactions among multiple metallic elements to enhance both activity and durability.

\subsection{Materials Discovery Challenges}

The traditional paradigm of materials discovery presents a formidable barrier to rapid catalyst development, typically requiring 10-20 years from initial concept to commercial deployment. This protracted timeline stems from the complex interplay between composition, structure, and catalytic properties. Computational screening methods have accelerated the initial exploration phase, yet they demand deep domain expertise in density functional theory, thermodynamic modeling, and electrochemistry. Even with high-throughput computational approaches, researchers can only explore a minuscule fraction of the available chemical space, potentially missing breakthrough compositions that lie outside conventional design heuristics. The bottleneck intensifies when considering synthesis feasibility, stability under operating conditions, and scalability for industrial applications, creating a multidimensional optimization challenge that has historically limited progress to incremental improvements rather than transformative discoveries.

\subsection{LLM Integration Challenges}

While LLMs are not explicitly trained in materials science, they excel at pattern recognition, hypothesis generation, and assisting researchers in exploring complex parameter spaces. The key challenge lies in effectively grounding their outputs in physical and chemical constraints while leveraging their ability to identify non-obvious patterns and connections. Initial attempts to apply LLMs directly to materials design have shown that proper integration with domain knowledge and validation frameworks is essential for producing chemically meaningful results. This approach fundamentally differs from traditional machine learning methods that require extensive training on labeled datasets; instead, it leverages the LLM's pre-existing knowledge representation and pattern recognition capabilities, augmented with real-time access to materials data. The integration of structured prompt engineering further refines the generation process, encoding chemical constraints such as Pauling's electronegativity rules and Hume-Rothery criteria as natural language instructions that the model interprets and applies during catalyst design.

\section{Detailed DFT Parameters and Convergence Criteria}
\label{app:dft_parameters}

\subsection{Complete Computational Parameters}

Our density functional theory calculations employed the following comprehensive parameter set to ensure accurate and reproducible results:

\textbf{Exchange-Correlation Functional:} We used the Perdew-Burke-Ernzerhof (PBE) generalized gradient approximation with Hubbard U corrections applied to transition metal d-electrons following the simplified rotationally invariant approach of Dudarev et al. The specific U values were:
\begin{itemize}
\item Fe: U = 3.3 eV (validated for Fe oxides and alloys)
\item Co: U = 3.4 eV (optimized for Co-containing catalysts)
\item Ni: U = 3.5 eV (standard for Ni oxides)
\item Mn: U = 3.0 eV (appropriate for Mn oxidation states)
\item Cr: U = 3.5 eV (validated for Cr oxides)
\end{itemize}

\textbf{Convergence Parameters:}
\begin{itemize}
\item Plane-wave cutoff energy: 500 eV (tested up to 600 eV showing <1 meV/atom difference)
\item K-point sampling: $3 \times 3 \times 3$ Monkhorst-Pack grid for bulk calculations
\item Surface calculations: $3 \times 3 \times 1$ k-point grid with Gamma-point centering
\item Electronic convergence: 10$^{-5}$ eV total energy difference
\item Ionic convergence: Forces below 0.02 eV/\AA{} on all atoms
\item Gaussian smearing: 0.05 eV width for metallic systems
\end{itemize}

\textbf{Surface Model Construction:}
\begin{itemize}
\item FCC structures: (111) surface orientation (most stable, lowest surface energy)
\item BCC structures: (110) surface orientation
\item Slab thickness: 4 atomic layers (bottom 2 fixed to simulate bulk)
\item Vacuum spacing: 15 \AA{} perpendicular to surface
\item Lateral dimensions: $2 \times 2$ or $3 \times 3$ supercells depending on adsorbate coverage
\item Dipole corrections applied for asymmetric slabs
\end{itemize}

\subsection{Adsorption Energy Calculations}

The binding energies for OER intermediates were calculated using:

\begin{equation}
\Delta E_{*X} = E_{slab+X} - E_{slab} - E_{X,ref}
\end{equation}

Where reference energies were obtained from:
\begin{itemize}
\item *OH: Referenced to H$_2$O(g) and $0.5 \times$ H$_2$(g)
\item *O: Referenced to H$_2$O(g) - H$_2$(g)
\item *OOH: Referenced to $2 \times$ H$_2$O(g) - $1.5 \times$ H$_2$(g)
\end{itemize}

Zero-point energy corrections and entropic contributions at 298K were included:
\begin{itemize}
\item ZPE(*OH) = 0.35 eV
\item ZPE(*O) = 0.05 eV
\item ZPE(*OOH) = 0.40 eV
\item TS contributions calculated from vibrational frequencies
\end{itemize}

\section{Extended Ablation Study Results}
\label{app:ablation}

\subsection{Complete Ablation Analysis}

\begin{figure}[h]
\centering
\includegraphics[width=0.8\columnwidth]{figures/stability_activity.png}
\caption{Detailed ablation results showing RAG impact on thermodynamic stability (3.6× improvement), comparison of different prompt engineering strategies, and iterative refinement convergence over 5 cycles demonstrating plateau at cycle 4.}
\label{fig:ablation}
\end{figure}

Figure~\ref{fig:ablation} visualizes the impact of each component on system performance. The dramatic stability improvement with RAG underscores the importance of grounding LLM outputs in validated materials data. Combined prompting strategies significantly outperform individual approaches, while convergence typically occurs within 4 iterations.

Table~\ref{tab:full_ablation} presents the comprehensive ablation study results examining all component combinations:

\begin{table}[h]
\centering
\caption{Full ablation study examining all component combinations. Each configuration tested with 200 generated candidates over 5 independent runs.}
\label{tab:full_ablation}
\begin{tabular}{lcccc}
\toprule
Configuration & Stability (\%) & $\eta_{OER}$ (V) & Diversity & Time (h) \\
\midrule
Full System & $82.4 \pm 1.8$ & $0.362 \pm 0.015$ & 3.2 & 24 \\
No RAG & $23.1 \pm 4.2$ & $0.521 \pm 0.043$ & 4.1 & 18 \\
No Iteration & $64.3 \pm 3.1$ & $0.412 \pm 0.021$ & 3.0 & 5 \\
Constraint Only & $68.2 \pm 2.7$ & $0.395 \pm 0.018$ & 1.8 & 22 \\
Analogy Only & $41.3 \pm 3.9$ & $0.438 \pm 0.027$ & 3.5 & 21 \\
Random Baseline & $3.2 \pm 1.1$ & $0.612 \pm 0.071$ & 4.5 & 20 \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Hyperparameter Sensitivity}

Extended hyperparameter analysis across broader ranges:

\begin{table}[h]
\centering
\caption{Extended hyperparameter sensitivity analysis}
\begin{tabular}{lccc}
\toprule
Parameter & Range Tested & Optimal & Impact \\
\midrule
Temperature & 0.1-1.0 & 0.7 & Critical \\
Top-p & 0.5-1.0 & 0.95 & Moderate \\
k (retrieval) & 5-50 & 20 & High \\
Similarity threshold & 0.7-0.95 & 0.85 & Low \\
Beam width & 1-10 & 5 & Moderate \\
Iterations & 1-10 & 5 & High \\
\bottomrule
\end{tabular}
\end{table}

\section{Additional Statistical Analyses}
\label{app:statistics}

\subsection{Multiple Comparison Corrections}

Given that we tested 250 catalyst candidates, proper multiple comparison corrections were essential:

\textbf{Bonferroni Correction:}
\begin{itemize}
\item Original significance level: $\alpha = 0.05$
\item Number of comparisons: 250
\item Corrected significance level: $\alpha' = 0.05/250 = 0.0002$
\item All reported significant results met this threshold
\end{itemize}

\textbf{False Discovery Rate (FDR) Control:}
\begin{itemize}
\item Benjamini-Hochberg procedure applied
\item FDR controlled at q = 0.05
\item 87\% of discoveries remained significant after correction
\end{itemize}

\subsection{Effect Size Calculations}

Cohen's d effect sizes for key comparisons:

\begin{table}[h]
\centering
\begin{tabular}{lcc}
\toprule
Comparison & Cohen's d & Interpretation \\
\midrule
LLM vs IrO$_2$ baseline & 2.31 & Very large \\
LLM vs known catalysts & 1.87 & Large \\
With RAG vs without & 3.42 & Very large \\
Combined vs constraint-only prompts & 1.42 & Large \\
Combined vs analogy-only prompts & 2.18 & Very large \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Bootstrap Confidence Intervals}

Detailed bootstrap analysis (n=1000 resamples):

\begin{itemize}
\item Mean improvement: 0.175 V
\item Standard error: 0.023 V
\item 95\% CI: [0.152, 0.198] V
\item 99\% CI: [0.144, 0.206] V
\item Bias-corrected accelerated (BCa) CI: [0.155, 0.195] V
\end{itemize}

\section{Extended Methodology Details}
\label{app:methodology}

\subsection{RAG Database Construction}

The 50,000+ entry database was constructed from multiple sources:

\begin{itemize}
\item Materials Project: 25,000 entries (validated DFT calculations)
\item OQMD: 10,000 entries (high-throughput screening results)
\item Catalysis-Hub: 8,000 entries (surface calculations)
\item Literature extraction: 7,000+ entries (2015-2024 publications)
\end{itemize}

Each entry contains:
\begin{itemize}
\item Chemical composition and stoichiometry
\item Crystal structure (space group, lattice parameters)
\item Formation energy and energy above hull
\item Electronic properties (band gap, d-band center)
\item Catalytic metrics (overpotential, Tafel slope, turnover frequency)
\item Synthesis conditions (when available)
\item Stability assessments (electrochemical, thermal)
\end{itemize}

\subsection{Prompt Engineering Templates}

Complete prompt templates used for generation:

\textbf{Initial Generation Prompt:}
\begin{verbatim}
You are a materials scientist designing high-entropy alloy catalysts
for the oxygen evolution reaction. Based on the following successful
catalysts:

[Retrieved Examples]

Generate a novel HEA composition that:
1. Contains 5-6 metallic elements
2. Maintains atomic size mismatch < 15%
3. Keeps electronegativity difference < 0.4
4. Targets formation energy < 50 meV/atom above hull
5. Optimizes d-band center between -2.5 and -1.5 eV

Explain your reasoning for element selection and predicted properties.
\end{verbatim}

\textbf{Iterative Refinement Prompt:}
\begin{verbatim}
The previous composition [Formula] showed:
- Stability: [E_hull] meV/atom
- *OH binding: [Energy] eV
- Limiting potential: [Value] V

Modify this composition to:
1. Improve limiting potential toward 0.35 V
2. Maintain thermodynamic stability
3. Enhance Fe-Co synergy if present

Suggest 3 variations with reasoning.
\end{verbatim}

\subsection{Vector Embedding Details}

SciBERT encoding process:
\begin{itemize}
\item Input text tokenization using WordPiece
\item Maximum sequence length: 512 tokens
\item Embedding dimension: 768
\item Pooling strategy: Mean pooling of final layer
\item Normalization: L2 normalization for cosine similarity
\end{itemize}

\section{Property Correlation Analysis}
\label{app:correlations}

\subsection{Complete Correlation Matrix}

\begin{figure}[h]
\centering
\includegraphics[width=0.8\columnwidth]{figures/property_correlations.png}
\caption{Complete correlation matrix showing relationships between all catalyst properties including overpotential, stability metrics, d-band center, and compositional features for the full set of LLM-generated catalysts.}
\label{fig:correlations}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=0.8\columnwidth]{figures/3d_activity_surface.png}
\caption{3D activity landscape of HEA catalysts showing the relationship between NOH adsorption energy ($\Delta E_{NOH}$), mixing enthalpy, and limiting potential. The surface color represents catalytic activity, with dark purple regions indicating optimal performance. Black circles mark individual catalyst compositions, demonstrating clustering in the favorable low-potential region.}
\label{fig:3d_surface}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=0.8\columnwidth]{figures/catalyst_boxplots.png}
\caption{Statistical comparison of key properties across catalyst types. Box plots show mixing enthalpy distribution with LLM-HEAs exhibiting most negative values (median -0.8 eV/atom) indicating superior stability, and d-band center distribution with LLM-HEAs centered at -2.8 eV correlating with enhanced activity.}
\label{fig:boxplots}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=\columnwidth]{figures/summary_statistics.png}
\caption{Property distributions for HEA catalysts showing mixing enthalpy right-skewed distribution (mean -0.593 eV/atom), multimodal d-band center distribution (mean -2.425 eV), broad $\Delta E_{NOH}$ distribution (mean 0.774 eV), and left-skewed limiting potential distribution with exceptional catalysts in the tail. Vertical lines indicate mean (red) and median (green) values.}
\label{fig:summary_stats}
\end{figure}

The correlation analysis (Figure~\ref{fig:correlations}) reveals strong relationships between electronic structure descriptors and catalytic performance. The 3D activity landscape (Figure~\ref{fig:3d_surface}) provides intuitive visualization of the property-performance relationship, clearly showing the optimal region where mixing enthalpy < -0.5 eV/atom and $\Delta E_{NOH}$ > 1.0 eV. Statistical distributions (Figures~\ref{fig:boxplots} and \ref{fig:summary_stats}) confirm that LLM-generated catalysts systematically explore favorable property ranges compared to known materials.

Full correlation analysis between compositional features and performance metrics:

\begin{table}[h]
\centering
\small
\begin{tabular}{lccccc}
\toprule
Feature & $\eta_{OER}$ & Stability & d-band & EN & Size \\
\midrule
$\eta_{OER}$ & 1.00 & & & & \\
Stability & -0.42** & 1.00 & & & \\
d-band center & -0.73*** & 0.31* & 1.00 & & \\
Avg. EN & 0.28* & -0.19 & -0.35** & 1.00 & \\
Size mismatch & 0.15 & -0.52*** & -0.08 & 0.21 & 1.00 \\
Fe content & -0.38** & 0.27* & 0.41** & -0.15 & -0.03 \\
Co content & -0.41** & 0.29* & 0.45*** & -0.18 & -0.05 \\
Entropy & -0.33** & 0.48*** & 0.12 & -0.09 & -0.31* \\
\bottomrule
\end{tabular}
\caption{Pearson correlations. *p<0.05, **p<0.01, ***p<0.001 after Bonferroni correction}
\end{table}

\subsection{Principal Component Analysis}

The first three principal components explained 72\% of variance:
\begin{itemize}
\item PC1 (31\%): Electronic properties (d-band, conductivity)
\item PC2 (24\%): Geometric factors (size mismatch, coordination)
\item PC3 (17\%): Compositional complexity (entropy, element count)
\end{itemize}

\section{Synthesis Feasibility Assessment}
\label{app:synthesis}

\subsection{Detailed Synthesis Conditions}

For top-performing catalysts, estimated synthesis requirements:

\begin{table}[h]
\centering
\begin{tabular}{lcc}
\toprule
Composition & Method & Conditions \\
\midrule
Fe$_{0.2}$Co$_{0.2}$Ni$_{0.2}$Ir$_{0.1}$Ru$_{0.3}$ & Arc melting & 1800$^{\circ}$C, Ar \\
Mn$_{0.15}$Fe$_{0.25}$Co$_{0.25}$Ni$_{0.2}$Pt$_{0.15}$ & Sputtering & 400$^{\circ}$C, 5 mTorr \\
Cr$_{0.2}$Fe$_{0.2}$Co$_{0.3}$Ni$_{0.2}$Mo$_{0.1}$ & Ball milling & 500 rpm, 20h \\
V$_{0.1}$Cr$_{0.2}$Mn$_{0.2}$Fe$_{0.25}$Co$_{0.25}$ & Carbothermal & 2000$^{\circ}$C flash \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Stability Under Operating Conditions}

Pourbaix diagram analysis suggests stability windows:
\begin{itemize}
\item pH 0-14: Fe-Co-Ni compositions stable as oxides/hydroxides
\item pH 7-14: Mn-containing catalysts show optimal stability
\item Potential range: 0.8-1.8 V vs RHE for all compositions
\item Dissolution rates: <1 nm/1000h estimated from computational models
\end{itemize}

\section{Limitations and Future Work}
\label{app:limitations}

\subsection{Comprehensive Limitations}

Beyond those mentioned in the main text:

\textbf{Computational Limitations:}
\begin{itemize}
\item DFT functional choice (PBE) may underestimate band gaps
\item Finite size effects in surface slabs
\item Neglect of solvent effects beyond implicit models
\item No consideration of surface coverage effects
\item Static calculations miss dynamic restructuring
\end{itemize}

\textbf{Physical Limitations:}
\begin{itemize}
\item Assumes uniform composition (no segregation)
\item Ignores grain boundary effects
\item No consideration of support interactions
\item Excludes mass transport limitations
\item Neglects bubble formation dynamics
\end{itemize}

\textbf{Methodological Limitations:}
\begin{itemize}
\item LLM knowledge cutoff prevents recent literature inclusion
\item RAG database biased toward published successful catalysts
\item Single-objective optimization misses trade-offs
\item No active learning from failed candidates
\item Limited to compositions expressible in text
\end{itemize}

\subsection{Proposed Extensions}

Future work should address:

\begin{enumerate}
\item \textbf{Multi-objective optimization:} Incorporate stability, conductivity, cost
\item \textbf{Kinetic modeling:} Include activation barriers via NEB calculations
\item \textbf{Experimental validation:} Synthesize top 10 candidates
\item \textbf{Active learning:} Update RAG database with experimental feedback
\item \textbf{Broader reactions:} Extend to ORR, HER, CO$_2$RR
\item \textbf{Microstructure:} Consider nanoparticle size/shape effects
\item \textbf{Operando modeling:} Simulate under realistic electrochemical conditions
\item \textbf{Uncertainty quantification:} Provide confidence intervals for predictions
\end{enumerate}

\section{Code and Data Availability}
\label{app:code}

The complete codebase and datasets are available at:
\url{https://zenodo.org/records/17129646}

Repository structure:
\begin{verbatim}
llm-catalyst-discovery/
|-- data/
|   |-- materials_database.json
|   |-- generated_catalysts.csv
|   |-- dft_results/
|-- src/
|   |-- rag_system.py
|   |-- prompt_engineering.py
|   |-- dft_validation.py
|   |-- statistical_analysis.py
|-- notebooks/
|   |-- data_analysis.ipynb
|   |-- figure_generation.ipynb
|-- requirements.txt
\end{verbatim}

\section{Reproducibility Checklist}
\label{app:reproducibility}

To reproduce our results:

\begin{enumerate}
\item \textbf{Environment Setup:}
   \begin{itemize}
   \item Python 3.9+
   \item GPT-4 API access
   \item VASP 6.3 license
   \item 200+ CPU cores recommended
   \end{itemize}

\item \textbf{Data Preparation:}
   \begin{itemize}
   \item Download materials database
   \item Index with FAISS
   \item Precompute SciBERT embeddings
   \end{itemize}

\item \textbf{Generation Parameters:}
   \begin{itemize}
   \item Temperature: 0.7
   \item Top-p: 0.95
   \item Retrieval k: 20
   \item Iterations: 5
   \end{itemize}

\item \textbf{Validation Protocol:}
   \begin{itemize}
   \item Screen with ML potentials first
   \item Run DFT with specified parameters
   \item Calculate limiting potentials
   \item Apply statistical tests
   \end{itemize}
\end{enumerate}

Estimated computation time: 5-7 days for full pipeline with 250 candidates.