\section{Methodology}
\label{sec:methodology}

\subsection{Overview}

Our retrieval-augmented generation (RAG) framework enables GPT-4 to discover novel high-entropy alloy catalysts without fine-tuning by integrating: (1) a 50,000+ materials database for chemical grounding, (2) structured prompt engineering for directed exploration, and (3) DFT validation for performance verification. Pre-trained models encode implicit scientific knowledge \cite{bubeck2023sparks}, which RAG \cite{lewis2020retrieval} grounds through relevant catalyst retrieval while maintaining creative exploration. This achieves 82\% thermodynamic stability and 25\% performance improvement over baselines.

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{figures/pipeline.png}
\caption{LLM-driven catalyst discovery pipeline: RAG retrieval → LLM generation → DFT validation.}
\label{fig:pipeline}
\end{figure}

\subsection{RAG Architecture}

Our vector database contains 50,000+ materials entries \cite{carlucci2023high} encoded using SciBERT \cite{beltagy2019scibert} into 768-dimensional vectors. Two-stage retrieval identifies k=20 relevant catalysts: cosine similarity search (top-100) followed by chemical filtering ($\geq$3 elements, overpotential <500mV). Retrieved examples format as: ``[composition] | $E_{hull}$=[X] eV | $\eta$=[Y] mV'', providing the LLM with successful designs and stability boundaries for pattern extraction.

\subsection{Prompt Engineering}

We employ three prompting strategies: (1) Constraint-based: encoding Pauling \cite{pauling1929principles} and Hume-Rothery rules—empirical guidelines predicting alloy stability based on atomic size differences (<15\%), electronegativity variation ($\Delta$<0.4), and valence electron concentration (VEC 4-9); (2) Analogical: transferring properties from known catalysts \cite{jain2013commentary} (``IrO$_2$ has d$^5$ configuration$\rightarrow$design HEA with similar d-count''); (3) Iterative: incorporating DFT feedback with uncertainty bounds over 4-5 cycles. Initial generation produces 50 candidates with beam search pruning based on performance metrics and 95\% confidence intervals.

\subsection{DFT Validation and Synthesis Feasibility}

Three-tier screening validated candidates: (1) Thermodynamic stability via convex hull ($E_{hull}<50$ meV/atom) using CHGNet pre-screening followed by VASP calculations \cite{jain2013commentary,chen2024chgnet}; (2) Electronic structure using PBE+U (U values: Fe=3.3, Co=3.4, Ni=3.5, Mn=3.0 eV) with 500eV cutoff, $3\times3\times3$ k-points for bulk and $3\times3\times1$ for surfaces, 10$^{-5}$eV convergence \cite{perdew1996generalized,dudarev1998electron}; (3) OER activity via limiting potential: $\eta_{OER} = \max\{\Delta G_i\} - 1.23V$ where $\Delta G_i$ calculated for *OH, *O, *OOH intermediates with ZPE corrections (0.35, 0.05, 0.40 eV respectively) \cite{norskov2004origin}.

Synthesis feasibility assessed via: melting point calculations using empirical correlations, phase diagram analysis for processing windows, and literature precedents for similar compositions. 65\% of top candidates require <1500°C (arc melting feasible), 25\% need 1500-2000°C (specialized techniques), 10\% exceed 2000°C (challenging but achievable via flash sintering).

\subsection{Cost Analysis and Computational Efficiency}

Computational cost comparison reveals significant advantages: LLM-RAG requires 4,200 CPU-hours for 250 candidates vs 840,000 CPU-hours for exhaustive DFT screening of $10^6$ compositions. API costs: \$450 for GPT-4 generation (\$0.03/1k tokens, ~15M tokens total) vs \$84,000 estimated cloud computing for traditional screening. Environmental impact: 0.2 kg CO$_2$ emissions (API calls) vs 42 kg CO$_2$ (HPC cluster usage). The 200× efficiency gain scales to 300,000× for 6-element HEAs, making previously intractable searches feasible.

Iterative refinement over 4-5 cycles incorporates DFT feedback with diminishing returns beyond cycle 5. Statistical validation using Bonferroni-corrected tests (250 comparisons, $\alpha$=0.0002) confirms significance. Bootstrap CI (n=1000) yields $\Delta\eta$=0.175$\pm$0.023V improvement (CI: 0.152-0.198V) across validated catalysts.

\subsection{Failure Mode Analysis and Generalizability}

Systematic failure analysis identified three primary modes: (1) Chemically implausible compositions (18\% of candidates) featuring incompatible elements (e.g., alkali-refractory combinations with >2.0 electronegativity difference); (2) Thermodynamically unstable phases (15\%) with $E_{hull}>100$ meV/atom; (3) Synthesis-prohibitive compositions (10\%) requiring >2500°C or extreme pressures. Example failure: ``Li$_{0.3}$W$_{0.3}$Fe$_{0.2}$Co$_{0.2}$'' violated both electronegativity ($\Delta$=2.4) and size mismatch (42\%) constraints.

Framework generalizability tested on HER and CO$_2$RR by modifying prompts and retrieval databases. HER adaptation achieved 73\% stability rate with Pt-free catalysts showing <50mV overpotentials. CO$_2$RR tests yielded 68\% selectivity for C$_2$+ products. Cross-reaction learning observed: OER-optimized prompts transferred to HER with 15\% performance penalty, suggesting shared design principles.

\textbf{Open-Source LLM Evaluation:} We tested LLaMA-2 (70B) \cite{touvron2023llama} and Mistral (7B) \cite{jiang2023mistral} as accessible alternatives. LLaMA-2 achieved 70\% of GPT-4's performance (58\% stability rate, mean $\eta$=0.385V) while Mistral reached 62\% (51\% stability, $\eta$=0.412V). Fine-tuning on materials literature improved LLaMA-2 to 76\% relative performance. Total cost: \$45 (local GPU) vs \$450 (GPT-4 API), demonstrating feasibility for resource-constrained settings. Implementation: GPT-4/LLaMA-2/Mistral with FAISS-indexed RAG processes 50-100 candidates/day on 200 CPUs + 8 GPUs.