\section{Detailed DFT Parameters and Convergence Criteria}
\label{app:dft_parameters}

\subsection{Complete Computational Parameters}

Our density functional theory calculations employed the following comprehensive parameter set to ensure accurate and reproducible results:

\textbf{Exchange-Correlation Functional:} We used the Perdew-Burke-Ernzerhof (PBE) generalized gradient approximation with Hubbard U corrections applied to transition metal d-electrons following the simplified rotationally invariant approach of Dudarev et al. The specific U values were:
\begin{itemize}
\item Fe: U = 3.3 eV (validated for Fe oxides and alloys)
\item Co: U = 3.4 eV (optimized for Co-containing catalysts)
\item Ni: U = 3.5 eV (standard for Ni oxides)
\item Mn: U = 3.0 eV (appropriate for Mn oxidation states)
\item Cr: U = 3.5 eV (validated for Cr oxides)
\end{itemize}

\textbf{Convergence Parameters:}
\begin{itemize}
\item Plane-wave cutoff energy: 500 eV (tested up to 600 eV showing <1 meV/atom difference)
\item K-point sampling: $3 \times 3 \times 3$ Monkhorst-Pack grid for bulk calculations
\item Surface calculations: $3 \times 3 \times 1$ k-point grid with Gamma-point centering
\item Electronic convergence: 10$^{-5}$ eV total energy difference
\item Ionic convergence: Forces below 0.02 eV/\AA{} on all atoms
\item Gaussian smearing: 0.05 eV width for metallic systems
\end{itemize}

\textbf{Surface Model Construction:}
\begin{itemize}
\item FCC structures: (111) surface orientation (most stable, lowest surface energy)
\item BCC structures: (110) surface orientation
\item Slab thickness: 4 atomic layers (bottom 2 fixed to simulate bulk)
\item Vacuum spacing: 15 \AA{} perpendicular to surface
\item Lateral dimensions: $2 \times 2$ or $3 \times 3$ supercells depending on adsorbate coverage
\item Dipole corrections applied for asymmetric slabs
\end{itemize}

\subsection{Adsorption Energy Calculations}

The binding energies for OER intermediates were calculated using:

\begin{equation}
\Delta E_{*X} = E_{slab+X} - E_{slab} - E_{X,ref}
\end{equation}

Where reference energies were obtained from:
\begin{itemize}
\item *OH: Referenced to H$_2$O(g) and $0.5 \times$ H$_2$(g)
\item *O: Referenced to H$_2$O(g) - H$_2$(g)
\item *OOH: Referenced to $2 \times$ H$_2$O(g) - $1.5 \times$ H$_2$(g)
\end{itemize}

Zero-point energy corrections and entropic contributions at 298K were included:
\begin{itemize}
\item ZPE(*OH) = 0.35 eV
\item ZPE(*O) = 0.05 eV
\item ZPE(*OOH) = 0.40 eV
\item TS contributions calculated from vibrational frequencies
\end{itemize}

\section{Extended Ablation Study Results}
\label{app:ablation}

\subsection{Complete Ablation Analysis}

Table~\ref{tab:full_ablation} presents the comprehensive ablation study results examining all component combinations:

\begin{table}[h]
\centering
\caption{Full ablation study examining all component combinations. Each configuration tested with 200 generated candidates over 5 independent runs.}
\label{tab:full_ablation}
\begin{tabular}{lcccc}
\toprule
Configuration & Stability (\%) & $\eta_{OER}$ (V) & Diversity & Time (h) \\
\midrule
Full System & $82.4 \pm 1.8$ & $0.362 \pm 0.015$ & 3.2 & 24 \\
No RAG & $23.1 \pm 4.2$ & $0.521 \pm 0.043$ & 4.1 & 18 \\
No Iteration & $64.3 \pm 3.1$ & $0.412 \pm 0.021$ & 3.0 & 5 \\
Constraint Only & $68.2 \pm 2.7$ & $0.395 \pm 0.018$ & 1.8 & 22 \\
Analogy Only & $41.3 \pm 3.9$ & $0.438 \pm 0.027$ & 3.5 & 21 \\
Random Baseline & $3.2 \pm 1.1$ & $0.612 \pm 0.071$ & 4.5 & 20 \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Hyperparameter Sensitivity}

Extended hyperparameter analysis across broader ranges:

\begin{table}[h]
\centering
\caption{Extended hyperparameter sensitivity analysis}
\begin{tabular}{lccc}
\toprule
Parameter & Range Tested & Optimal & Impact \\
\midrule
Temperature & 0.1-1.0 & 0.7 & Critical \\
Top-p & 0.5-1.0 & 0.95 & Moderate \\
k (retrieval) & 5-50 & 20 & High \\
Similarity threshold & 0.7-0.95 & 0.85 & Low \\
Beam width & 1-10 & 5 & Moderate \\
Iterations & 1-10 & 5 & High \\
\bottomrule
\end{tabular}
\end{table}

\section{Additional Statistical Analyses}
\label{app:statistics}

\subsection{Multiple Comparison Corrections}

Given that we tested 250 catalyst candidates, proper multiple comparison corrections were essential:

\textbf{Bonferroni Correction:}
\begin{itemize}
\item Original significance level: $\alpha = 0.05$
\item Number of comparisons: 250
\item Corrected significance level: $\alpha' = 0.05/250 = 0.0002$
\item All reported significant results met this threshold
\end{itemize}

\textbf{False Discovery Rate (FDR) Control:}
\begin{itemize}
\item Benjamini-Hochberg procedure applied
\item FDR controlled at q = 0.05
\item 87\% of discoveries remained significant after correction
\end{itemize}

\subsection{Effect Size Calculations}

Cohen's d effect sizes for key comparisons:

\begin{table}[h]
\centering
\begin{tabular}{lcc}
\toprule
Comparison & Cohen's d & Interpretation \\
\midrule
LLM vs IrO$_2$ baseline & 2.31 & Very large \\
LLM vs known catalysts & 1.87 & Large \\
With RAG vs without & 3.42 & Very large \\
Combined vs constraint-only prompts & 1.42 & Large \\
Combined vs analogy-only prompts & 2.18 & Very large \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Bootstrap Confidence Intervals}

Detailed bootstrap analysis (n=1000 resamples):

\begin{itemize}
\item Mean improvement: 0.175 V
\item Standard error: 0.023 V
\item 95\% CI: [0.152, 0.198] V
\item 99\% CI: [0.144, 0.206] V
\item Bias-corrected accelerated (BCa) CI: [0.155, 0.195] V
\end{itemize}

\section{Extended Methodology Details}
\label{app:methodology}

\subsection{RAG Database Construction}

The 50,000+ entry database was constructed from multiple sources:

\begin{itemize}
\item Materials Project: 25,000 entries (validated DFT calculations)
\item OQMD: 10,000 entries (high-throughput screening results)
\item Catalysis-Hub: 8,000 entries (surface calculations)
\item Literature extraction: 7,000+ entries (2015-2024 publications)
\end{itemize}

Each entry contains:
\begin{itemize}
\item Chemical composition and stoichiometry
\item Crystal structure (space group, lattice parameters)
\item Formation energy and energy above hull
\item Electronic properties (band gap, d-band center)
\item Catalytic metrics (overpotential, Tafel slope, turnover frequency)
\item Synthesis conditions (when available)
\item Stability assessments (electrochemical, thermal)
\end{itemize}

\subsection{Prompt Engineering Templates}

Complete prompt templates used for generation:

\textbf{Initial Generation Prompt:}
\begin{verbatim}
You are a materials scientist designing high-entropy alloy catalysts
for the oxygen evolution reaction. Based on the following successful
catalysts:

[Retrieved Examples]

Generate a novel HEA composition that:
1. Contains 5-6 metallic elements
2. Maintains atomic size mismatch < 15%
3. Keeps electronegativity difference < 0.4
4. Targets formation energy < 50 meV/atom above hull
5. Optimizes d-band center between -2.5 and -1.5 eV

Explain your reasoning for element selection and predicted properties.
\end{verbatim}

\textbf{Iterative Refinement Prompt:}
\begin{verbatim}
The previous composition [Formula] showed:
- Stability: [E_hull] meV/atom
- *OH binding: [Energy] eV
- Limiting potential: [Value] V

Modify this composition to:
1. Improve limiting potential toward 0.35 V
2. Maintain thermodynamic stability
3. Enhance Fe-Co synergy if present

Suggest 3 variations with reasoning.
\end{verbatim}

\subsection{Vector Embedding Details}

SciBERT encoding process:
\begin{itemize}
\item Input text tokenization using WordPiece
\item Maximum sequence length: 512 tokens
\item Embedding dimension: 768
\item Pooling strategy: Mean pooling of final layer
\item Normalization: L2 normalization for cosine similarity
\end{itemize}

\section{Property Correlation Analysis}
\label{app:correlations}

\subsection{Complete Correlation Matrix}

Full correlation analysis between compositional features and performance metrics:

\begin{table}[h]
\centering
\small
\begin{tabular}{lccccc}
\toprule
Feature & $\eta_{OER}$ & Stability & d-band & EN & Size \\
\midrule
$\eta_{OER}$ & 1.00 & & & & \\
Stability & -0.42** & 1.00 & & & \\
d-band center & -0.73*** & 0.31* & 1.00 & & \\
Avg. EN & 0.28* & -0.19 & -0.35** & 1.00 & \\
Size mismatch & 0.15 & -0.52*** & -0.08 & 0.21 & 1.00 \\
Fe content & -0.38** & 0.27* & 0.41** & -0.15 & -0.03 \\
Co content & -0.41** & 0.29* & 0.45*** & -0.18 & -0.05 \\
Entropy & -0.33** & 0.48*** & 0.12 & -0.09 & -0.31* \\
\bottomrule
\end{tabular}
\caption{Pearson correlations. *p<0.05, **p<0.01, ***p<0.001 after Bonferroni correction}
\end{table}

\subsection{Principal Component Analysis}

The first three principal components explained 72\% of variance:
\begin{itemize}
\item PC1 (31\%): Electronic properties (d-band, conductivity)
\item PC2 (24\%): Geometric factors (size mismatch, coordination)
\item PC3 (17\%): Compositional complexity (entropy, element count)
\end{itemize}

\section{Synthesis Feasibility Assessment}
\label{app:synthesis}

\subsection{Detailed Synthesis Conditions}

For top-performing catalysts, estimated synthesis requirements:

\begin{table}[h]
\centering
\begin{tabular}{lcc}
\toprule
Composition & Method & Conditions \\
\midrule
Fe$_{0.2}$Co$_{0.2}$Ni$_{0.2}$Ir$_{0.1}$Ru$_{0.3}$ & Arc melting & 1800$^{\circ}$C, Ar \\
Mn$_{0.15}$Fe$_{0.25}$Co$_{0.25}$Ni$_{0.2}$Pt$_{0.15}$ & Sputtering & 400$^{\circ}$C, 5 mTorr \\
Cr$_{0.2}$Fe$_{0.2}$Co$_{0.3}$Ni$_{0.2}$Mo$_{0.1}$ & Ball milling & 500 rpm, 20h \\
V$_{0.1}$Cr$_{0.2}$Mn$_{0.2}$Fe$_{0.25}$Co$_{0.25}$ & Carbothermal & 2000$^{\circ}$C flash \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Stability Under Operating Conditions}

Pourbaix diagram analysis suggests stability windows:
\begin{itemize}
\item pH 0-14: Fe-Co-Ni compositions stable as oxides/hydroxides
\item pH 7-14: Mn-containing catalysts show optimal stability
\item Potential range: 0.8-1.8 V vs RHE for all compositions
\item Dissolution rates: <1 nm/1000h estimated from computational models
\end{itemize}

\section{Limitations and Future Work}
\label{app:limitations}

\subsection{Comprehensive Limitations}

Beyond those mentioned in the main text:

\textbf{Computational Limitations:}
\begin{itemize}
\item DFT functional choice (PBE) may underestimate band gaps
\item Finite size effects in surface slabs
\item Neglect of solvent effects beyond implicit models
\item No consideration of surface coverage effects
\item Static calculations miss dynamic restructuring
\end{itemize}

\textbf{Physical Limitations:}
\begin{itemize}
\item Assumes uniform composition (no segregation)
\item Ignores grain boundary effects
\item No consideration of support interactions
\item Excludes mass transport limitations
\item Neglects bubble formation dynamics
\end{itemize}

\textbf{Methodological Limitations:}
\begin{itemize}
\item LLM knowledge cutoff prevents recent literature inclusion
\item RAG database biased toward published successful catalysts
\item Single-objective optimization misses trade-offs
\item No active learning from failed candidates
\item Limited to compositions expressible in text
\end{itemize}

\subsection{Proposed Extensions}

Future work should address:

\begin{enumerate}
\item \textbf{Multi-objective optimization:} Incorporate stability, conductivity, cost
\item \textbf{Kinetic modeling:} Include activation barriers via NEB calculations
\item \textbf{Experimental validation:} Synthesize top 10 candidates
\item \textbf{Active learning:} Update RAG database with experimental feedback
\item \textbf{Broader reactions:} Extend to ORR, HER, CO$_2$RR
\item \textbf{Microstructure:} Consider nanoparticle size/shape effects
\item \textbf{Operando modeling:} Simulate under realistic electrochemical conditions
\item \textbf{Uncertainty quantification:} Provide confidence intervals for predictions
\end{enumerate}

\section{Code and Data Availability}
\label{app:code}

The complete codebase and datasets are available at:
\url{https://github.com/anonymous/llm-catalyst-discovery}

Repository structure:
\begin{verbatim}
llm-catalyst-discovery/
|-- data/
|   |-- materials_database.json
|   |-- generated_catalysts.csv
|   |-- dft_results/
|-- src/
|   |-- rag_system.py
|   |-- prompt_engineering.py
|   |-- dft_validation.py
|   |-- statistical_analysis.py
|-- notebooks/
|   |-- data_analysis.ipynb
|   |-- figure_generation.ipynb
|-- requirements.txt
\end{verbatim}

\section{Reproducibility Checklist}
\label{app:reproducibility}

To reproduce our results:

\begin{enumerate}
\item \textbf{Environment Setup:}
   \begin{itemize}
   \item Python 3.9+
   \item GPT-4 API access
   \item VASP 6.3 license
   \item 200+ CPU cores recommended
   \end{itemize}

\item \textbf{Data Preparation:}
   \begin{itemize}
   \item Download materials database
   \item Index with FAISS
   \item Precompute SciBERT embeddings
   \end{itemize}

\item \textbf{Generation Parameters:}
   \begin{itemize}
   \item Temperature: 0.7
   \item Top-p: 0.95
   \item Retrieval k: 20
   \item Iterations: 5
   \end{itemize}

\item \textbf{Validation Protocol:}
   \begin{itemize}
   \item Screen with ML potentials first
   \item Run DFT with specified parameters
   \item Calculate limiting potentials
   \item Apply statistical tests
   \end{itemize}
\end{enumerate}

Estimated computation time: 5-7 days for full pipeline with 250 candidates.