\section{Methodology}
\label{sec:methodology}

\subsection{Overview}

Our retrieval-augmented generation (RAG) framework enables GPT-4 to discover novel high-entropy alloy catalysts without fine-tuning by integrating: (1) a 50,000+ materials database for chemical grounding, (2) structured prompt engineering for directed exploration, and (3) DFT validation for performance verification. Pre-trained models encode implicit scientific knowledge \cite{bubeck2023sparks}, which RAG \cite{lewis2020retrieval} grounds through relevant catalyst retrieval while maintaining creative exploration. This achieves 82\% thermodynamic stability and 25\% performance improvement over baselines.

\begin{figure}[h]
\centering
\includegraphics[width=0.65\textwidth]{figures/pipeline.png}
\caption{LLM-driven catalyst discovery pipeline: RAG retrieval → LLM generation → DFT validation.}
\label{fig:pipeline}
\end{figure}

\subsection{RAG Architecture}

Our vector database contains 50,000+ materials entries \cite{carlucci2023high} encoded using SciBERT \cite{beltagy2019scibert} into 768-dimensional vectors. Two-stage retrieval identifies k=20 relevant catalysts: cosine similarity search (top-100) followed by chemical filtering ($\geq$3 elements, overpotential <500mV). Retrieved examples format as: ``[composition] | $E_{hull}$=[X] eV | $\eta$=[Y] mV'', providing the LLM with successful designs and stability boundaries for pattern extraction.

\subsection{Prompt Engineering}

We employ three prompting strategies: (1) Constraint-based: encoding Pauling \cite{pauling1929principles} and Hume-Rothery rules (size mismatch <15\%, electronegativity $\Delta$<0.4, VEC 4-9); (2) Analogical: transferring properties from known catalysts \cite{jain2013commentary} (``IrO$_2$ has d$^5$ configuration$\rightarrow$design HEA with similar d-count''); (3) Iterative: incorporating DFT feedback over 4-5 cycles (``Fe$_{0.2}$Co$_{0.2}$Ni$_{0.2}$Cr$_{0.2}$Mn$_{0.2}$ gave -1.8eV *OH$\rightarrow$modify for -1.6eV''). Initial generation produces 50 candidates with beam search pruning based on performance.

\subsection{DFT Validation and Multi-Objective Screening}

Our validation employs a comprehensive five-tier screening that extends beyond single-objective optimization: (1) Thermodynamic stability via convex hull ($E_{hull}<50$ meV/atom) \cite{jain2013commentary,chen2024chgnet}; (2) Electronic structure using PBE+U \cite{perdew1996generalized,dudarev1998electron} (500eV cutoff, $3\times3\times3$ k-points, 10$^{-5}$eV convergence); (3) OER activity via limiting potential \cite{norskov2004origin}: $\eta_{OER} = \max\{\Delta G_i\} - 1.23V$ where $\Delta G_i$ are elementary step energies; (4) Electronic conductivity assessment through band structure analysis, targeting metallic character (band gap < 0.1 eV) to ensure efficient electron transport; (5) Cost evaluation using commodity prices (Fe: \$0.1/kg, Co: \$33/kg, Ni: \$18/kg, Ir: \$180,000/kg, Ru: \$30,000/kg, Pt: \$30,000/kg as of 2024), targeting compositions with <20\% precious metal content.

While full multi-objective Pareto optimization remains computationally prohibitive for 250+ candidates, we implemented constraint-based filtering: conductivity threshold (metallic character required), cost ceiling (\$5,000/kg maximum), and mechanical stability estimates via Pugh's ratio (B/G > 1.75 for ductility) \cite{pugh1954xcii}. These constraints were encoded in our prompt engineering: "Generate HEA compositions maintaining metallic conductivity while minimizing Ir/Pt/Ru content below 30\%." Bootstrap CI (n=1000) and paired t-tests validate performance metrics. Details in Appendix A.

\subsection{Statistical Analysis}

Iterative refinement over 4-5 cycles incorporates DFT feedback: ``Fe-Co enhances *OH$\rightarrow$generate Fe$_{0.15-0.25}$Co$_{0.15-0.25}$''. Statistical validation: Bootstrap CI (95\%, n=1000), Wilcoxon tests (p<0.01), yielding mean improvement $\Delta\eta$=0.175$\pm$0.023V (CI: 0.152-0.198V) across 42 catalysts. Convergence: stability>80\%, variance<0.05V, diversity>2.5 bits.

\subsection{Implementation}

GPT-4 \cite{openai2023gpt4} (temp=0.7, top-p=0.95) with FAISS-indexed RAG processes 50-100 candidates/day using 200 CPUs + 8 GPUs. Limitations: computational validation only, ideal surfaces assumed, synthesis feasibility unaddressed. Extended implementation details and complete DFT parameters provided in Appendix A.