% This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
\pdfoutput=1
% In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.

\documentclass[11pt]{article}

% Remove the "review" option to generate the final version.
\usepackage[review]{acl}

% Standard package includes
\usepackage{times}
\usepackage{bm}
\usepackage{latexsym}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{makecell}
\usepackage[disable]{todonotes}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage[most]{tcolorbox}
\usepackage{enumitem}% http://ctan.org/pkg/enumitem
\usepackage{rotating}

\DeclareMathOperator*{\argmax}{arg\,max}
\let\nvec\vec
\def\vec#1{\nvec{\vphantom t\smash{#1}}}

\newcommand{\mcva}{}% To make sure that \naive isn't already defined    
\def\mcva{My Climate Advisor\ } % add View if gets accepted

\title{My Climate Advisor:\\ %add view if it gets accepted
An Application of NLP in Climate Adaptation for Agriculture}
% A Dialogue-based Question Answering System for Climate Adaptation}


\author{\begin{tabular}{ccccc} 
Vincent Nguyen$^{1}$ & Sarvnaz Karimi$^{1}$ & Willow Hallgren$^{2}$ & Ashley Harkin$^{3}$ & Mahesh Prakash$^{1}$\\
\end{tabular}\\
\begin{tabular}{cccc}
\multicolumn{4}{c}{$^{1}$CSIRO Data61, Australia}\\
\multicolumn{4}{c}{$^{2}$CSIRO Agriculture and Food, Australia}\\
\multicolumn{4}{c}{\tt \{firstname.lastname\}@csiro.au}\\
\multicolumn{4}{c}{$^{3}$Bureau of Meteorology, Australia}\\
\multicolumn{4}{c}{\tt \{ashley.harkin\}@bom.gov.au}\\
\end{tabular}
}

\begin{document}
\maketitle
\begin{abstract}
Climate adaptation for agriculture necessitates tools that equip farmers and farm advisors with relevant and trustworthy knowledge to help make them more resilient to climate change. We introduce {\em My Climate Advisor}, a question-answering (QA) prototype that synthesizes information from different data sources, such as peer-reviewed scientific literature, and high-quality, industry-relevant grey literature, to generate answers, with references, to a given user's question. Our prototype uses open-source generative models for data privacy and intellectual property protection, and retrieval augmented generation for answer generation, grounding and provenance. While there are standard evaluation metrics for QA systems, there is no evaluation framework that suits our LLM-based QA application in the climate adaptation domain. We design an evaluation framework with seven metrics based on the requirements of the domain experts to judge the generated answers from 12 different LLM-based models. Our initial evaluations through a user study via domain experts show promising usability results.
\end{abstract}

\section{Introduction} 
Climate change impacts are seen across the globe in many different ways, from an increase in temperatures to an increase in the frequency of natural disasters. According to the United Nations Framework Convention on Climate Change~\cite{bodansky1993united}, climate change adaptations are increasingly necessary to adjust and respond to the impacts of climate change. These can include technological developments~\cite{smithers2001technology_climate_adaptation}, behavioral changes~\cite{lenzholzer2020awareness_climate_adaptation_behaviour}, early warning systems for extreme events~\cite{de2022adapting_weather_climate_adaptation}, and improved risk management~\cite{massetti2018measuring_climate_adaptation}. In agriculture, climate adaptation means improving farmers' capacity to deal with climate change. This adaptation can include the development and use of tools to increase their knowledge of climate change and methods to make them more resilient~\cite{cradock2020climate_adaptation_farmer}.

% what are the contributions 
% 1- a tool to help with climate adaptation 2- evaluation guidelines 3- initial evaluations
Our contributions are two-fold: (1) To make the evolving knowledge of climate change and adaptation practices accessible, we have developed a question-answering tool called {\em My Climate Advisor} (MCA). It is a prototype online service for farmers and farm advisors to gain easier access to information from scientific literature, grey literature and reports, as well as future climate projection data. Given a farmer or farm advisor's question, it responds with information synthesized from the literature alongside references for further reading; and, (2) We propose a novel framework for evaluating such a system, with seven different evaluation criteria, which we share through an annotation guideline together with our initial experimental results. Note that these criteria are carefully designed by the domain experts.\footnote{This tool is currently private while further developments and testing are underway.} 

%The tool will integrate with {\em My Climate View}'s API, allowing access to climate historical and projection data within a 100-year window for a breadth of Representative Concentration Pathway (RCP) emission scenarios~\cite{van2011representative_RCP}.

% https://www.hindawi.com/journals/ajgwr/2023/5025359/

\section{Background and Related work}
We provide a background on climate adaptation, related tools and research in the climate change-agriculture space below.
%\todo[inline]{Perhaps this climate adaptation section goes after?}
\paragraph{Climate Adaptation}
%"Climate adaptation refers to adjustments in ecological-socio-economic systems in response to actual or expected climatic changes and their impacts (Source: 10.5194/nhess-15-2511-2015). It is a crucial response to climate change aimed at reducing vulnerability and minimizing the adverse impacts of changing weather patterns on natural and human systems. Climate adaptation encompasses various strategies, including autonomous, planned, anticipatory, reactive, individual, and collective approaches (Source: 10.1088/1755-1315/487/1/012005). These efforts aim to enhance resilience, protect ecosystems, and ensure the well-being of communities in the face of increasing climate variability and extremes."

%In the context of agriculture and farming, this means that farmers must adapt their practices and strategies to cope with the changing climate conditions, such as variations in temperature, precipitation patterns, and extreme weather events. This may involve implementing new technologies, crop varieties, irrigation systems, and management techniques to maintain or enhance productivity, ensure food security, and protect the environment. By adapting to climate change, agriculture and farming communities can build resilience and continue to provide essential goods and services while minimizing negative impacts on natural resources and ecosystems.

Climate adaptation is described as an adjustment in a social, economic or ecological setting in response to actual or expected climate change~\cite{climate_adaptation_frameworks_definitions}. In agriculture, farmers need to adjust their practices to improve resilience to variations in temperature, precipitation patterns and extreme weather events~\cite{su11071921_climate_adaptation_farmer}. Farmers may need to implement new technologies, crop cultivars and management techniques to ensure food security or economic security in a sustainable manner~\cite{fosu2012farmers_socioeconomic}. To help farmers adapt to climate change, a goal of \mcva is to produce location and commodity-relevant, up-to-date management advice from the literature.

%\paragraph{My Climate View} %one paragraph -- we add it back if accepted
%My Climate View~\cite{webb2023climate_csa} is a service that provides climate projections for commodities and regions within Australia. The service is backed by climate indices constructed by climate and commodity experts and climate information from the Australian Bureau of Meteorology. The service is being continually updated with a continuing user engagement initiative. 

\paragraph{NLP for Climate Science}
Machine learning in the climate science domain has been prevalent for years. Many efforts have been dedicated to climate modelling~\cite{dueben2018challenges_climate_modeling_ml,bittner2023lstm_downscaling}, disaster prediction~\cite{haggag2021deep_climate_disaster,keum2020real_climate_disaster_floods}, climate change in finance and commerce~\cite{nguyen2021predicting_climate_finance_carbon_footprint}, climate forecasting~\cite{NguyenBKGG23_foundation_climate_model} and to inform policy change~\cite{milojevic2021machine_climate_policy}. However, natural language processing (NLP) for climate science is under-explored. 

NLP techniques have been utilized as an analysis tool to provide an overview of climate sentiment on social media,~\cite{prasse2023towards,pupneja2023understanding} for events such as the Conference of the Parties on Climate Change,~\cite{pupneja2023understanding} or government policies~\cite{greenwell2023all_nlp_government_policy}. Aside from analysis, NLP techniques helped with the monitoring of climate technology innovation~\cite{toetzke2023leveraging_climate_tech_innovation}, strategies for Environmental, Social and Governance (ESG) investment decision-making~\cite{visalli2023esg_enterprise} and the filtering of literature related to adaptation or mitigation strategies for climate-change-related health problems~\cite{berrang2021systematic}.

Annotated datasets are crucial for evaluating NLP models. The existing datasets include stance detection for climate change mitigation on social media~\cite{vaid-etal-2022-towards}, and global warming in the news~\cite{luo-etal-2020-detecting}, claim verification for climate change~\cite{leippold2020climatefever} and question-answering for both carbon disclosure and climate risk disclosure~\cite{climabench}. Climate-aware or Green Machine Learning has become more relevant over the years~\cite{Cowls2023}. This is also reflected in the NLP community, in the form of Green NLP intending to reduce carbon emissions in the training process of NLP models by re-using pretrained models~\cite{wolf-etal-2020-transformers} or in the disclosing or tracking of carbon emissions from NLP models~\cite{strubell-etal-2019-energy,hershcovich-etal-2022-towards}.

A common approach in NLP is to pre-train foundation models with a language model objective for downstream tasks~\cite{orig-bert-2018}. These models have been used in the form of Transformer~\cite{transformer_original} encoder-based models such as ClimateBERT~\cite{bingler2022cheap_climatebert}, which was pretrained on climate-related news articles, research abstracts, and corporate climate reports using domain-adaptive pre-training~\cite{gururangan-etal-2020-dont-dapt}, and CliMedBERT~\cite{jalalzadehfard2022climedbert} which proposed pre-training on climate science literature~\cite{berrang2021systematic}, climate-policy documents and IPCC reports. However, such approaches using masked language modeling~\cite{orig-bert-2018} are becoming less prevalent in the question-answering space.

Instead, recently, there has been a shift in the NLP community in adopting Large Language Models (LLMs) pretrained on an autoregressive language modeling task~\cite{gpt_3_paper} and fine-tuned with instructions and human preference labels~\cite{instructgpt}. These have been used in a chatbot question-answering context~\cite{vaghefi2023chatclimate} to provide climate-related information from a combination of Intergovernmental Panel on Climate Change (IPCC) reports and internal LLM knowledge.

% Although the use of machine learning is prevalent in the climate science domain~\cite{}, 
% - Personalizing Sustainable Agriculture with Causal Machine (?)

% the use of natural language processing is less explored. 
% - How convincing are AI-generated moral arguments for climate action
% - chatClimate - Grounding Conversational AI in Climate Science
% - Leveraging large language models to monitor climate technology innovation
% - CliMedBERT A Pre-trained Language Model for Climate and Health-related Text
% - ClimateBERT
% - ESG Data Collection with Adaptive AI
% - Climate Fever
% - Understanding Opinions Towards Climate Change on Social Media
% - https://aclanthology.org/P19-1355/

% https://paperswithcode.com/paper/climabench-a-benchmark-dataset-for-climate
% https://paperswithcode.com/dataset/climabench

% Related work impact statement
However, absent from the literature is NLP for climate change-related agriculture or climate adaptation management advice for agriculture. To the best of our knowledge, we present the first study that collates agriculture-related literature to answer climate change-related agricultural questions and provides climate risk management options to farmers and farm advisors.

\section{Methods}
My Climate Advisor is currently designed as a question-answering tool\footnote{The restrictions on the inputs and outputs for users will require a thorough investigation. See Appendix~\ref{appendix:user_restriction} for more details.} with several components and data sources. We detail our data collection method and corpora used for the Retrieval Augmented Generation (RAG) and the retrieval algorithm to search over the corpora. For generation, we detail the Large Language Model (LLM) used in the study and the decoding algorithms and hyperparameters used for answer generation. 

\subsection{Data Collection and Indexing}
% What is this data for? Why is it useful?
Climate adaptation information needs to be trustworthy and relevant. We therefore gather information from reputable sources such as peer-reviewed published agriculture literature, books, expert-curated documents and high-quality industry grey literature.

For peer-reviewed agriculture literature, we gather articles from the S2ORC corpus~\cite{lo-etal-2020-s2orc}, snapshot on 2023-11-03. The initial size of the corpus was 12.4 million articles. We filter the corpus using the `fields of study` facet provided by semantic scholar~\cite{semantic_scholar}. Documents matching the fields of study `Agricultural and Food Sciences' and `Environmental Science' are retained, resulting in 1.88 million documents. We remove documents without body text or a Digital Object Identifier (DOI), leaving a final set of 1.36 million articles. We use this corpus for general-purpose agriculture-related questions in our first index.

From this corpus, we filter the documents found in the top 100 agriculture journals ranked by impact score (13,400 documents). However, not all journals could be found within S2ORC. We supplement the rest from the Elsevier~\footnote{\url{https://www.elsevier.com/en-au/about}} snapshot 2023-11-03, leading to a total of 126,000 articles. We use this corpus for more precise climate adaptation advice, forming our second index.

For our third index, we use an expert-curated document set containing information on agricultural commodities, climate adaptation and growing conditions for the ANONYMIZED LOCATION\footnote{Location anonymized for reviewing process.} climate. We augment it with information from books and industry reports containing information on climate risk and adaptation methods relevant to the ANONYMIZED LOCATION climate. This corpus is highly specialized; as such, it is the smallest of the three indexes, with 28 documents.

For indexing, we chunk all documents using a semantic chunking parser\footnote{\url{https://crates.io/crates/text-splitter}, (Accessed: 15/5/24)} to 400 tokens, roughly the size of a paragraph, and ensure we split at sensible sentence boundaries. For each chunk, we use a sentence encoder~\cite{reimers2019}, JinaBERT~\cite{jina_bert}, to produce contextual embeddings which are then normalized and byte quantized. Further details on the statistics of the datasets can be found in Table~\ref{tab-statistics}.

\begin{table}[tb]
\centering
\footnotesize
%\resizebox{\columnwidth}{!}{%
\tabcolsep 2pt
\begin{tabular}{@{}llll@{}}
\toprule
\bf Corpus      & \# Documents & \# Chunks (C=400) & Size (GB) \\ \midrule
S2ORC           & 1.36M        & 30.6M             & 124       \\
Top Journals    & 126K         & 221K              & 8.3       \\
Grey Literature & 28           & 1513 & 0.008 \\ 
\bottomrule
\end{tabular}
\caption{Corpus statistics.\label{tab-statistics}}
\end{table}

%- Perhaps can be generic here for these resources
%Should we mention indices (?) or save for a future citation
%Should we mention Stokes and Howden 

%- Australian grey literature: industry reports for tried and tested (proven) methods

\subsection{Generative Models}
Causal LLMs provide a conditional probability distribution over an output vocabulary, $V$, given an input sequence, $S = (w_1, ..., w_2)$ or preceding context~\cite{jurafsky2009speech}:%
\begin{equation}
P(w_{n} | w_{1}, ..., w_{n-1}), w \in V.
\end{equation}
To select the word to decode from the probability distribution at each autoregressive timestep, $t$, we use maximum likelihood (greedy decoding) to enable reproducibility and reduce hallucinations from pseudo-randomness~\cite{ippolito-etal-2019-lm-decodingcomparison,peng-etal-2023-towards-temperature-chatgpt}:
\begin{equation}
\hat{w}_{t} = \argmax_{w\in V}P(w|\boldsymbol{w}_{<t}).
\end{equation}
When LLMs are fine-tuned with instructions~\cite{t5-flan}, they can generate responses given a prompt $S_p$ as an assistant rather than behaving as a text completion language model~\cite{instructgpt}.

We use an open-source LLM, in this case, Llama 3-8b~\cite{touvron2023llama}, which has been instruction fine-tuned. Using an open source allows control over the privatization of the user's data, compliance with API agreements, use of scientific literature and most importantly, reliability, which cannot be achieved with proprietary mixture-of-expert models as they are non-deterministic~\cite{buffer_overflow_moe}. Open source allows access to the weights, which can be beneficial for precise safeguarding with control vectors~\cite{control_vectors}. Furthermore, although have more representation power, proprietary models tend to be more resource-heavy, contributing to climate change~\cite{rillig2023risks_llm_climate}.

\subsection{Retrieval Augmented Generation}
We use retrieval augmented generation (RAG) to generate answers using scientific document snippets as context. Using RAG emphasizes the provenance of scientific literature as the LLM can be instructed via system prompt to provide the DOI of any relevant document snippets used to generate the answer. We also provide these references in our user interface for further transparency. 
%It also uses an API from My Climate View~\cite{webb2023climate_csa} for location and commodity-specific information, such as noteworthy climate factors\footnote{API access was not for the evaluation experiments.}. 
We use Naive RAG~\cite{rag_survey} to synthesize information from an inverted index with a Hierarchical Navigable Small World (HNSW) vector store. For retrieval, we use a hybrid scoring to capture orthogonal signals from keyword matching and semantic similarity~\cite{bert_bm25_orthogonal,search-like-an-expert-2022}. The hybrid score, $S$, is a function of an exact-matching (lexical overlap) and soft-matching (vector embeddings)\footnote{We use the terminology from~\cite{gao-etal-2021-coil}.} of tokens component. The hybrid scorer is used to rank the query $q \in Q$ and document $d \in D$ pairs as follows,
\begin{equation}
    S(q,d) = \beta(\alpha\sum_{t\in q \cap d} f(t) + \ (1-\alpha) \frac{\vec{q} \cdot \vec{d}}{|\vec{q}||\vec{d}|}),
\end{equation}
\noindent
where $f(t)$ is a function of term, which uses document-level or term-level statistics to produce a score given an exact match between the query and document terms, the vector representations, or embedding representations, $\vec{x} = Enc(x), x \in{(q,d)}$, is given by a universal embedding model, $Enc$. A soft-match can be computed using cosine similarity between the vector representations. The hyperparameter $\alpha$ is a weighted linear combination of the exact-matching and soft-matching components. Finally, the entire score is multiplied by an index-specific weight, $\beta$, which denotes the importance of the index/corpus. We set $\beta = 1$ and $\alpha = 0.02$ in our experiments. The matching components can be interchanged with any model; currently, we use BM25~\cite{BM25} for our exact-matching component and Jina BERT~\cite{jina_bert} for soft-matching.

\section{Experiments}
To understand how our tool performs, we benchmark it against other existing and proprietary methods. With consultation of climate risk and adaptation experts, we created 15 questions % about Australian climate adaptation (Appendix~\ref{appendix:climate-questions}), 
which we used to generate responses. These questions range from general climate change and adaptation questions to more difficult commodity and region-specific questions.

\subsection{Evaluation}
Evaluating the capabilities of abstractive QA systems using standardized benchmarks remains challenging due to problems such as data contamination~\cite{sainz-etal-2023-nlp-llm-data-contamination}, hallucination~\cite{li-etal-2023-halueval-llm-hallucination} and sycophancy~\cite{sycophancy_llms}. Automatic metrics for abstractive question answering such as BERT-score, METEOR, and ROUGE suffer from lexical insensitivity and negation errors, which distort the semantics of text~\cite{saadany-orasan-2021-bleu-automatic-metric-bad} and have bias towards machine-written text~\cite{caglayan-etal-2020-curious-automatic-metric-bad} leading to a low alignment with human annotators~\cite{liu-etal-2023-g-eval}.

We, therefore, rely on two experts, a climate scientist and an agronomist, to evaluate the system responses of our system (with and without RAG) and proprietary methods: GPT-3.5, GPT-4, Gemini, Claude, Mistral and the 70B variant in a single-blind study. For all models, including ours, we use the default settings aside from temperature, which we manually set to 0. Specifically for the Llama models, we use the defaults from the llama.cpp library\footnote{\url{https://github.com/ggerganov/llama.cpp}, (Accessed: 15/5/24)}. The Llama 3 models used in the experiments are all the instruct-tuned variants from Meta's official repository. However, for Mistral~\cite{mistral_7b}, we use a variant that is instruction fine-tuned with OpenHermes 2.5~\cite{OpenHermes2.5} and preference aligns using direct preference optimization (DPO)~\cite{rafailov2024direct_dpo} with Argilla's DPO mix~\cite{argilla-dpo-mix-7k}. 

Given that the Llama family models do not come with a default system prompt, we use a customized system prompt depending on whether or not RAG was used. Details of these prompts can be found in Appendix~\ref{appendix:prompt_details}. 

%For the Llama3 models, we used a custom prompt (Appendix Figure~\ref{fig:llama3rag}) for RAG and another prompt (Appendix Figure~\ref{fig:llama3}) otherwise. For the Mistral model, we used a similar prompt (Appendix Figure~\ref{fig:mistral_rag}) for RAG and a standard prompt (Appendix Figure~\ref{fig:mistral}) otherwise.

%(See Appendix~\ref{appendix:experimental} for details) %Preliminary testing with ChatClimate~\cite{vaghefi2023chatclimate} was done (Appendix~\ref{}). However, we found that the responses were unsuitable for climate adaptation-related questions. 

The expert annotators curated the following set of 15 questions for the ANONYMIZED LOCATION climate to which each system generated responses: 
\begin{enumerate}
\item What are the ideal pollination conditions for growing almonds?
\item What can I do to prevent sunburn risk in apples?
\item What varieties of apples are more tolerant to sunburn?
\item What regions will support growing cotton in 2070?
%\item How does the climate in South West Western Australia compare from 1970 to now?
\item How does the climate in South West ANONYMIZED LOCATION compare from 1970 to now? % Western 
\item What will be the greatest climate risk for growing wheat in the wheatbelt in 2050?
%\item Will my rainfall continue to increase in variability in Northern NSW?
\item Will my rainfall continue to increase in variability in Northern ANONYMIZED LOCATION?
\item In north-east ANONYMIZED LOCATION, how many days will I likely experience over 45 degrees? %SA
\item How accurate are climate projections?
\item What is the difference between a heatwave and a hot day?
\item Will we likely see less cold risk days over the lambing season in central ANONYMIZED LOCATION?
\item How will climate change impact cherry production in Young?
\item What is the production cycle of potatoes?
\item Are there regions in ANONYMIZED LOCATION where agriculture will not be viable in 2050?
\item Will commodity distribution in ANONYMIZED LOCATION change under a future climate?
\end{enumerate}

We used maximum likelihood decoding for each model by setting the temperature to zero. The annotators were given the generated responses without knowing the model used to generate the response. They were the literature alongside references for further reading; asked to evaluate the 15 question-response pairs according to the following annotation criteria and the Likert scale~\cite{likert1932technique}: 

\begin{enumerate}
    \item Context: Does the LLM provide enough background information to understand its response?
      \begin{enumerate}[label*=\arabic*.]
        \item Attempts to give some broader context to explain the issue.
        \item Provides an introductory paragraph to introduce the topic.
        \item Provides a summary paragraph at the end.
      \end{enumerate}
    \item Readability: Is the response of the LLM easy to read? 
      \begin{enumerate}[label*=\arabic*.]
        \item Overall, the response is well-structured and easy to read.
        \item Headings and subheadings are well structured and logical and with appropriate categories.
        \item Used dot points appropriately.
      \end{enumerate}
    \item Language: Does the LLM use fluent industry terminology?
      \begin{enumerate}[label*=\arabic*.]
        \item Phrasing is appropriate (easy to read, fluent) and not awkward or incorrect.
        \item Correct use of grammar.
        \item Consistent with the language used within the industry.
      \end{enumerate}
    \item Provenance: Does the LLM provide relevant citations to its answers?
      \begin{enumerate}[label*=\arabic*.]
        \item Citations are used appropriately with respect to the context.
        \item The number of citations used is appropriate (not too few, not too many, regarding what we might expect for the topic).
      \end{enumerate}
    \item Specificity: Is the information in the response relevant? For instance, to location, time and commodity in question?
      \begin{enumerate}[label*=\arabic*.]
        \item Gives information that is specific to a commodity.
        \item Gives information specific to the location/region in question, where applicable.
        \item Where there is no information specific to a location, the LLM admits this (and, preferably, gives information for the appropriate broader region).
      \end{enumerate}
    \item Comprehensiveness: Does the LLM respond with a complete answer?
      \begin{enumerate}[label*=\arabic*.]
      \item The LLM’s response is comprehensive and does not just give a partial, incomplete answer.
      \end{enumerate}
    \item Scientific accuracy: Is the information correct, given the source material?
      \begin{enumerate}[label*=\arabic*.]
        \item The citations used accurately cite their source material.
        \item The cited source material provides high-quality, reliable scientific information.
        \item No obvious hallucinations.
      \end{enumerate}
\end{enumerate}

%use of background information (context), readability, use of industry language (language), provenance of answers (explainability), specificity to location and commodity, comprehensiveness and scientific accuracy. 
We then normalize each annotator's scores before combining them. This allows us to capture the overall ranking preference of the systems rather than an absolute scoring. The raw unnormalized scores can be found in Appendix Table~\ref{annotator-ash-tab} and~\ref{annotator-willow-tab}.

\begin{table*}[t]
\centering
\footnotesize
\tabcolsep 2pt
\begin{tabular}{lrrrrrrrrr}
\toprule
&\multicolumn{7}{c}{\bf Evaluation Criteria}\\
\cmidrule{2-9}
\bf Model & Context & Structure & Language & Specificity & Comprehensiveness & Accuracy & Citation & Avg. Score \\
\midrule
GPT 4-Turbo & \textbf{2.00} & \textbf{2.00} & \textbf{2.00} & \textbf{2.00} & \textbf{2.00} & 1.05 & 0.00 & \textbf{2.00} \\
Llama 3 70b & 1.83 & 1.83 & 1.68 & 1.96 & 1.61 & 1.05 & 0.16 & 1.85 \\
Claude 3 Opus & 1.52 & 1.56 & 1.57 & 0.83 & 1.52 & \textbf{1.69} & 0.00 & 1.69 \\
Llama 3 8b + RAG (Ours) & 1.15 & 0.94 & 1.29 & 0.84 & 1.11 & 1.04 & \textbf{2.00} & 1.54 \\
Gemini 1.5 Pro & 1.40 & 1.50 & 1.57 & 1.44 & 1.65 & 0.92 & 0.00 & 1.54 \\
Llama 3 8b & 1.59 & 1.44 & 1.51 & 1.60 & 1.29 & 0.64 & 0.04 & 1.46 \\
Mistral 7b + RAG & 1.39 & 0.89 & 1.20 & 0.73 & 0.93 & 0.90 & 1.65 & 1.39 \\
Claude 3 Haiku & 1.20 & 1.44 & 1.30 & 1.01 & 1.30 & 0.82 & 0.00 & 1.23 \\
Mistral 7b & 1.34 & 1.11 & 1.34 & 1.06 & 0.94 & 0.61 & 0.48 & 1.15 \\
Llama 3 70b + RAG & 0.94 & 0.72 & 0.94 & 0.64 & 0.70 & 0.80 & 1.94 & 1.08 \\
Gemini 1.0 Pro & 0.00 & 0.39 & 0.23 & 1.17 & 1.02 & 0.31 & 0.00 & 0.54 \\
GPT 3.5-Turbo & 0.20 & 0.00 & 0.36 & 0.00 & 0.00 & 0.00 & 0.08 & 0.00 \\
\bottomrule
\end{tabular} %
\caption{Responses generated by 12 models were annotated for climate adaptation-related questions based on seven criteria (scores of 0 to 4). The values in the tables are from the normalized sum of two annotators. The models are ranked by average score.}\label{tbl:combined-annotator-normalized-results}
\end{table*}

%Similar to reinforcement learning from human feedback, the annotators were also asked to provide a preference for the responses irrespective of the annotation criteria. \todo[inline]{Rephrase XXX This is potentially a less noisy way to evaluate the systems.}

\section{Results and Analysis}
In the literature, we often see that proprietary generalist models perform better than open-source models~\cite{llm_survey_paper,lmsys}. However, we found no clear distinction between proprietary and open-source models (Table~\ref{tbl:combined-annotator-normalized-results}). The GPT-4 model responses were preferred most across all metrics except accuracy and citation. However, when inspecting the raw scores, the open-source models, Llama and Mistral, are either tied or were marginally worse than GPT-4. This is encouraging as in our application, given the privacy of our data, we cannot use proprietary models.\footnote{Raw scores are in Appendix Table~\ref{annotator-ash-tab} \&~\ref{annotator-willow-tab}.}

In line with prior work, we found that model scale was generally indicative of model performance~\cite{c1e2faff_chincilla_scaling,caballeroGRK23_broken_scaling_law}; the Llama3 70b variant outperformed its 8b and 7b variants, for the Claude family, Opus outperformed Haiku, Gemini 1.5 outperformed 1.0 and GPT-4 outperformed GPT-3.5.

% Evidence of this fact is that the top-scoring individual question-response pairs were from open-source + RAG combos
% Access to weights allows for greater contorl over outputs e.g. control vectors and safeguarding

% Access to the harder helps alleviate concerns with LLM contribution to climate change

\paragraph{Agreement }Inner-annotator agreement using Kendalls's Tau~\cite{kendall1938tau} led to 0.319 (moderate) agreement and an overlap of 41.5\%. Although the annotators mutually drafted the evaluation criteria, {\em scientific accuracy} was a source of significant disagreement (Table~\ref{fig:disagreements}). One annotator penalized responses that were not self-contained; that is, the response must contain scientifically robust sources to back up any claims. The other annotator used their knowledge to determine the scientific validity of the claims. Noting that verification of climate-related claims has been established as a low agreement task~\cite{leippold2020climatefever}. 

Another source of disagreement was with specificity however, upon inspection, many of these disagreements were within one point and can be attributed to human error or bias. We can further back this claim by looking at the sentiment of scores. When the labels are binarized, scores higher than 2 become positive, and scores 2 or less become negative. In this binary setting, Kendall's Tau agreement is 0.488 (moderate), with an overlap of 76.6\%, which can be interpreted as the annotator's overall sentiments of responses being closely aligned. When removing accuracy annotations from this calculation, strong agreement is reached at 0.635 with an overlap of 85.4\%, highlighting that the annotator's sentiments are closely aligned.

\begin{figure}[tb]
\centering
\includegraphics[width=\columnwidth]{figures/disagreements.pdf}
\caption{The number of disagreements between annotators for each criterion for the annotation task. A disagreement is defined as when the annotators give different annotations to one another.}\label{fig:disagreements}
\end{figure}

\begin{figure*}[t]
\centering
\includegraphics[width=\textwidth]{figures/individual_model_scores.pdf}
\caption{The normalized sum of the two annotator's scores for each response generated by 12 models for each of the 15 questions. Each sub-graph contains the normalized score sum of a particular model plotted against the question number.}\label{fig:individual-model-scores}
\end{figure*}

\paragraph{System Preference} Both annotators preferred GPT-4 with Llama-3 70B faring well also. The initial results indicated that the most scientifically accurate model is Claude Opus (one annotator). Both annotators agreed that ChatGPT (GPT-3.5 turbo) was the worst model. This is noteworthy given that it is currently the most popular public-facing chat model. When analyzing the combined raw distribution of scores (Figure~\ref{fig:combined-boxplot}), we note that the highest performing question-response pair was from the llama-variants, Llama 3 8b + RAG and Mistral 7b + RAG, to questions 6 and 15 respectively from each annotator (see Appendix~\ref{appendix:additional-results}). These responses were not only scientifically accurate but were stylistically similar to the responses from GPT-4, where a list of dot points is given, a summary and references at the end. Therefore, we find that there is potential for our tool to outperform GPT-4 once aligned with this style of response. Both annotators agreed on the worst performing question-response pair, where Gemini 1.0-pro responded to question 3 with a hallucinated \textit{Apples do not get sunburned} response. An initial hypothesis could be that the model was trained with incorrect data. However, this did not occur with Gemini 1.5-pro, assumed to be trained with similar data, where the model responded with the correct strategies to prevent sunburn risk. 

Regarding individual scores, the first annotator (Table~\ref{annotator-ash-tab}) generally preferred the non-RAG models due to the stylistic issues mentioned earlier. In contrast, the second annotator (Table~\ref{annotator-willow-tab}) preferred the RAG models due to their scientific accuracy and provenance. 

\paragraph{Question difficulty} 
A hypothesis that can be reasonably drawn is that LLMs should struggle with questions that are more specific to locations, commodities and time-periods. However, we did not see this trend within our annotation. Instead, from Figure~\ref{fig:individual-model-scores}, we see that questions requiring more reasoning tended to be more difficult (questions 3, 8, 11) for the LLMs over questions more knowledge-recalled oriented (questions 5, 9, 15). In particular, question 8 was difficult as many models responded by telling the user to check the weather forecasts rather than a concrete response. The GPT-4 fared the worst for question 13; although the response was stylistically well-received, it uses generic terminology that is not in line with the industry standard, opting for the term \textit{growth} over the more accurate \textit{vegetative growth} or \textit{tuber bulking}. GPT-4 also had a problem with question 8, where it explained what climate projections were but did not elaborate on their accuracy. 

Some questions were underspecified to test the applicability to the ANONYMIZED LOCATION climate, such as question 12. Surprisingly, only four models failed to recognize that Young was a town in ANONYMIZED LOCATION. Claude's Opus model performed the worst on this question, providing a generic response about its inability to access climate projection data and, therefore, unable to answer the question. A similar answer was provided by Claude Haiku, but the model still provided an answer after its generic response. Mistral 7b and Claude Haiku had a similar issue but with question 7 and question 11, respectively, where they provided a generic response about being unable to predict weather patterns. The RAG models underperformed for specific questions for which the counterpart model did not. A detailed results table for each question and model pair can be found in the Appendix: Table~\ref{tab:question_model_score_table}.

\paragraph{Ablation on RAG} Our ablation analysis reveals that our in-house RAG models were more scientifically accurate than their counterparts. However, this was at the expense of the other metrics, such as readability and background information context. We suspect the model might be using terminology based on the academic context and omitting context as there is an assumption that the user has read the retrieved literature. Furthermore, annotators mentioned that the models included references within their responses, making them longer and more challenging to read. However, including references allows users to read further and verify information. Although our method is scientifically robust, it may not align with the user, who prefers their responses to be structured in a particular way. Fine-tuning the model to include its references at the end of the answer is needed as part of future work. 

The most surprising observation was that the Llama3 70b RAG variant under-performed. In particular, the questions that the retriever failed to find relevant impacted the models the most. In particular, as Llama3 70b is more aligned with instruction-following, it suffered the most performance drop as it refused to answer questions where the answer cannot be found in the documents. This was seen in question 3, where the documents referred to sunburn as \textit{sunscald} and did not contain relevant information related to sunburn risk. A similar occurrence happened with question 8, where the retriever found information about the number of days over 40 degrees in ANONYMISED LOCATION, but the models were either too aligned with instruction-following (Llama3 70b) or misinterpreted the locations (Mistral 7b + RAG). % SA as South Asia
Overall, we observe that the relevance of retrieved documents impacted the RAG models. However, smaller models were less inclined to follow instructions and answered using their internal knowledge rather than our documents and scored higher.

%we will integrate it with my climate view for more accurate climate projections.

%We can reduce the Likert scale to a preference scale (1-3), in order to delineate between positive or negative leaning responses. We find that XXX agreement and XXX

%One annotator 
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{figures/combined_raw_scores_boxplot.pdf}
\caption{The raw sum of two annotators for the 12 models. Model families are grouped by color.}\label{fig:combined-boxplot}
\end{figure}

\section{Conclusions}
{\em My Climate Advisor} is a question-answering tool designed to provide trustworthy climate change risk and adaptation information for farmers and their advisors. Our tool is created on an in-house Llama 3 with RAG, which synthesizes information from peer-reviewed scientific literature and trustworthy grey literature. With the assistance of domain experts, we design an evaluation framework that outlines criteria designed to differentiate LLM-generated answers to a set of questions. 
While our initial evaluations show a gap between our tool and the leading proprietary systems, the outcome is still encouraging. Our analysis shows that our tool is on par for scientific accuracy while providing provenance for explainability. 

Our system can be fine-tuned for further improvements in the near future. Note that due to privacy concerns and the financial and environmental costs of proprietary LLMs, we are limited to open-source models. We will refine the prompting strategy to synthesize climate adaptation information better without sacrificing readability. Finally, we plan to expand the input to multimodal data, including numerical data and graphs, for more accurate representations of climate data including climate projections.

%Our initial evaluations show a gap between our tool and the leading proprietary systems. However, our analysis shows that our tool is on par for scientific accuracy while providing provenance for explainability. We are continually updating this tool, and we aim to integrate it with the My Climate View tool in the future to help Australian farmers and their advisors adapt to climate change while providing the LLM with accurate climate and weather projections.

\todo[inline]{Include information about the usability of open-source LLMs given the promising results}

%There remains a gap between our system, My Climate Advisor, which relies on open-source models and the leading proprietary systems. In the future, we aim to fine-tune our model using internal data as domain-specific models tend to perform better than generalist ones~\cite{shi2023generalist}. We will refine the prompting strategy to synthesize climate adaptation information better without sacrificing readability. Finally, we plan to expand the input to multimodal data, including numerical data and graphs, for more accurate climate projections.

\section{Limitations}
Some limitations include the lack of prompt engineering for each model. We used the default settings, aside from the temperature setting. However, we believe this is a fair comparison using the default settings. Our tool is also limited in comparison to proprietary offerings, but given that it will be continually updated and supported, we believe that our tool will eventually surpass proprietary offerings while reaping the benefits of using open-source models such as mitigating privacy concerns, protecting intellectual property, integration with control vectors and reducing carbon emissions.
Finally, although the annotation guidelines were created jointly by the experts when it came to annotation, there were some interpretations of the criteria. We tried to overcome this limitation by normalizing the scores and considering the ranks of the models rather than the raw scores.
Despite these limitations, the findings of this study should inform similar studies on the capabilities of proprietary models and open-source LLMs for answering questions in the climate change adaptation domain.

\todo[inline]{Did not use the API of MCV}

\section{Ethical Concerns}
We use open-source LLMs to ensure user data privacy and intellectual property protection. We do not use cookies or any tracking mechanism for the users interacting with the My Climate Advisor tool. Given the climate impact of LLMs, it is critical to use power-efficient hardware alongside local LLMs where environmental impacts can be minimized.

\todo[inline]{climate change impact of LLMS}

\bibliography{custom,main}

\appendix

\begin{table*}[t]
\resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrrrrrrr}
\toprule
&\multicolumn{7}{c}{\bf Evaluation Criteria}\\
\cmidrule{2-9}
\bf Model & Context & Structure & Language & Specificity & Comprehensiveness & Accuracy & Citation & Avg. Score \\
\midrule
GPT 4-Turbo & \textbf{3.90} & \textbf{3.70} & \textbf{3.70} & \textbf{3.70} & \textbf{3.70} & \textbf{3.80} & 0.00 & \textbf{3.20} \\
Llama 3 70b & 3.70 & 3.60 & 3.40 & \textbf{3.70} & 3.40 & \textbf{3.80} & 0.00 & 3.10 \\
Gemini 1.5 Pro & 3.40 & 3.30 & 3.50 & 3.40 & 3.40 & 3.70 & 0.00 & 3.00 \\
Claude 3 Opus & 3.50 & 3.50 & 3.50 & 3.10 & 3.30 & 3.40 & 0.00 & 2.90 \\
Claude 3 Haiku & 3.60 & 3.60 & 3.50 & 3.30 & 3.10 & 3.30 & 0.00 & 2.90 \\
Llama 3 8b & 3.40 & 3.10 & 3.30 & 3.50 & 2.90 & 3.30 & 0.00 & 2.80 \\
Mistral 7b & 3.10 & 2.90 & 3.20 & 3.30 & 2.70 & 3.30 & 0.27 & 2.70 \\
Mistral 7b + RAG & 3.20 & 2.70 & 2.90 & 2.80 & 2.50 & 3.20 & \textbf{1.10} & 2.60 \\
Llama 3 8b + RAG & 2.60 & 2.50 & 2.90 & 2.90 & 2.70 & 3.30 & \textbf{1.10} & 2.60 \\
Llama 3 70b + RAG & 2.70 & 2.40 & 2.80 & 2.80 & 2.20 & 3.10 & \textbf{1.10} & 2.40 \\
Gemini 1.0 Pro & 1.70 & 2.50 & 2.70 & 3.20 & 2.70 & 2.90 & 0.00 & 2.30 \\
GPT 3.5-Turbo & 1.90 & 1.90 & 2.40 & 2.80 & 1.60 & 2.50 & 0.00 & 1.90 \\
\bottomrule
\end{tabular} % 
}
\caption{First annotator's average scores. In the first column, the models are sorted based on average scores. Bold numbers indicate the highest in the column.\label{annotator-ash-tab}}
\end{table*}

\begin{table*}[t]
\resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrrrrrrr}
\toprule
&\multicolumn{7}{c}{\bf Evaluation Criteria}\\
\cmidrule{2-9}
\bf Model & Context & Structure & Language & Specificity & Comprehensiveness & Accuracy & Citation & Avg. Score \\
\midrule
GPT 4-Turbo & \textbf{3.70} & \textbf{3.80} & \textbf{4.00} & \textbf{3.30} & \textbf{3.50} & 0.13 & 0.00 & \textbf{2.60} \\
Llama 3 8b + RAG & 3.00 & 3.10 & 3.90 & 2.70 & 2.50 & 1.10 & \textbf{1.70} & \textbf{2.60} \\
Claude 3 Opus & 2.90 & 3.20 & 3.70 & 2.20 & 2.80 & \textbf{2.60} & 0.00 & 2.50 \\
Llama 3 70b & 3.50 & 3.60 & 3.90 & 3.20 & 2.90 & 0.13 & 0.27 & 2.50 \\
Mistral 7b + RAG & 2.90 & 2.80 & 3.80 & 2.70 & 2.30 & 0.93 & 1.10 & 2.40 \\
Llama 3 8b & 3.20 & 3.40 & 3.80 & 2.90 & 2.70 & 0.07 & 0.07 & 2.30 \\
Gemini 1.5 Pro & 2.70 & 3.30 & 3.70 & 2.80 & 3.00 & 0.00 & 0.00 & 2.20 \\
Llama 3 70b + RAG & 2.30 & 2.80 & 3.60 & 2.50 & 2.10 & 0.87 & 1.60 & 2.20 \\
Mistral 7b & 2.90 & 3.00 & 3.70 & 2.20 & 2.10 & 0.00 & 0.40 & 2.00 \\
Claude 3 Haiku & 1.90 & 2.90 & 3.40 & 2.10 & 2.50 & 0.53 & 0.00 & 1.90 \\
Gemini 1.0 Pro & 1.00 & 2.10 & 2.90 & 2.70 & 2.30 & 0.00 & 0.00 & 1.60 \\
GPT 3.5-Turbo & 1.30 & 2.00 & 3.30 & 1.10 & 1.10 & 0.00 & 0.13 & 1.30 \\
\bottomrule
\end{tabular} % 
}
\caption{Second annotator's average scores. In the first column, the models are sorted based on average scores. Bold numbers indicate the highest in the column.\label{annotator-willow-tab}}
\end{table*}

\section{Interfaces}
\subsection{My Climate Advisor interface}\label{appendix:mcva-interface}
We present the user interface of our tool, My Climate Advisor, in Figure~\ref{fig:mcva_interface}. The tool is currently in the early stages of development. The interface's main use is to collect feedback from users to improve the retrieval and generation capabilities of the system.

\subsection{Annotation interface}\label{appendix:annotation-interface}
Each annotation was tasked with annotating 180 samples in a single-blind study. We use the Label Studio library and interface~\cite{label_studio} hosted locally. Each annotator was allowed to choose when to do their annotations and which annotations to start from. 

\section{Additional experimental results}\label{appendix:additional-results}
The individual scores from the annotators are also included for completeness. Table~\ref{annotator-ash-tab} \&~\ref{annotator-willow-tab} show the individual raw scores of each annotator, which were combined and normalized to produce Table~\ref{tbl:combined-annotator-normalized-results}. 

We also include boxplots to show the variance of each method across the questions in Figures~\ref{fig:boxplot-ash} \&~\ref{fig:boxplot-willow}, which were combined to produce Figure~\ref{fig:combined-boxplot}.

The average scores of individual questions and corresponding models are given in Table~\ref{tab:question_model_score_table}, which provides additional information on Figure~\ref{fig:individual-model-scores}.

\begin{figure}[t]
\centering
\includegraphics[width=.98\columnwidth]{figures/ash_raw_scores_boxplot.pdf}
\caption{First annotator’s average scores. Model families are grouped together by color.}\label{fig:boxplot-ash}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=.98\columnwidth]{figures/willow_raw_scores_boxplot.pdf}
\caption{Second annotator’s average scores. Model families are grouped together by color.}\label{fig:boxplot-willow}
\end{figure}

\section{Additional experimental details: Prompts}\label{appendix:prompt_details}
We provide additional details on the prompts used in our study for the open-source variants. As these models do not have a default system prompt, we included two styles of system prompts: one that used RAG and one that did not. For the Llama3 models, we used a custom prompt (Appendix Figure~\ref{fig:llama3rag}) for RAG and another prompt (Appendix Figure~\ref{fig:llama3}) otherwise. For the Mistral model, we used a similar prompt (Appendix Figure~\ref{fig:mistral_rag}) for RAG and a standard prompt (Appendix Figure~\ref{fig:mistral}) otherwise.

\section{Restrictions on User Inputs or Outputs}\label{appendix:user_restriction}
Given the problems with LLMs with regards to reward hacking and teacher forcing~\cite{llm_survey_paper} which can lead to hallucination or misinformation. It is prudent to think of the ways that farmers or their advisors will interact with our tool. We denote three possible variants of usage that have to do with the user access or openness to the inputs (questions) and the outputs (LLM responses): 

\begin{enumerate}
    \item Input Open, Output Open: Chat-style interface. Users can freely input questions to produce outputs. This requires the most amount of safeguarding and may be difficult to reliably control in practice.
    \item Input Open, Output Closed: The users may submit questions, however, they will be given responses that are embedded within a pre-filled frequently asked questions (FAQ). This FAQ will be continually updated with LLM responses but can be checked beforehand.
    \item Input Closed, Output Closed: The user cannot control the inputs, and instead are given a response by the LLM based on the information of location and commodity that has been prefilled for a related service.  
\end{enumerate}

% \section{Climate Adaptation Probing Questions}\label{appendix:climate-questions}
% The following climate adaptation questions were used to probe the LLM systems and were then annotated using the criteria in Appendix~\ref{appendix:annotation-guidelines}.

% \section{Annotation Guidelines}\label{appendix:annotation-guidelines}
% The following criteria were used by the experts to annotate the climate adaptation responses to the questions in Appendix~\ref{appendix:climate-questions}. 
% An example of the annotation interface is shown in Figure~\ref{fig-annotation-interface}

% \section{Preliminary Comparisons}
% A problem with using LLMs is the difficulty of using automatic evaluation measures that are not aligned with human preference~\cite{}, ill-defined ranges~\cite{} or are not interpretable~\cite{}. 

% Therefore, we rely on experts to compare our tools against GPT-4, a proprietary state-of-the-art chat model that can offer climate adaptation advice and ChatClimate~\cite{}. We use the following criteria to analyze the outputs of MCV advisor against the GPT models: 

% \todo[inline]{
% XXX Rephrase for later, when the proprietary models beat us XXX
% Given that this is an ongoing work, we do not expect to do better than the proprietary LLMs just yet but allows us to contextualise our model's capabilities and how much improvement is left.

% XXX insert Results here XXX

% XXX What are the types of mistakes our model makes?, What does it improve upon?
% }

\begin{figure*}
    \centering
\begin{tcolorbox}[title= Llama3 RAG prompt, label=llama3rag]
\small
<|begin\_of\_text|><|start\_header\_id|>system<|end\_header\_id|> \\
 
You are a helpful AI assistant designed to help answer a farmer's agriculture-related questions. Use the following documents to help answer the user's questions.
 
If you are unsure of your answer, inform the user to check the information with their farm advisor. <|eot\_id|><|start\_header\_id|>user<|end\_header\_id|> \\

What are the ideal pollination conditions for growing almonds? <|eot\_id|><|start\_header\_id|>assistant<|end\_header\_id|>
\end{tcolorbox}
    \caption{Prompt used for Llama3 + RAG.}
    \label{fig:llama3rag}
\end{figure*}

\begin{figure*}
    \centering
\begin{tcolorbox}[title= Llama3 prompt, label=llama3]
\small
<|begin\_of\_text|><|start\_header\_id|>system<|end\_header\_id|> \\

You are a helpful AI assistant designed to help answer a farmer's agriculture-related questions. 

If you are unsure of your answer, inform the user to check the information with their farm advisor. <|eot\_id|><|start\_header\_id|>user<|end\_header\_id|> \\
    
What are the ideal pollination conditions for growing almonds? <|eot\_id|><|start\_header\_id|>assistant<|end\_header\_id|>
\end{tcolorbox}
    \caption{Prompt used for Llama3.}
    \label{fig:llama3}
\end{figure*}

\begin{figure*}
\small
    \centering
\begin{tcolorbox}[title= Mistral RAG prompt, label=mistral_rag]
<s><|im\_start|>system
    You are a helpful AI assistant designed to help answer a farmer's agriculture-related questions.
    
    If you are unsure of your answer, inform the user to check the information with their farm advisor.<|im\_end|>
    <|im\_start|>user
    What are the ideal pollination conditions for growing almonds?<|im\_end|>
    <|im\_start|>assistant
\end{tcolorbox}
    \caption{Prompt used for Mistral 7b + RAG.}
    \label{fig:mistral_rag}
\end{figure*}

\begin{figure*}
\small
    \centering
\begin{tcolorbox}[title= Mistral prompt, label=mistral]
<s><|im\_start|>system
    You are a helpful AI assistant designed to help answer a farmer's agriculture-related questions. Use the following documents to help answer the user's questions.
    
    If you are unsure of your answer, inform the user to check the information with their farm advisor.<|im\_end|>
    <|im\_start|>user
    What are the ideal pollination conditions for growing almonds?<|im\_end|>
    <|im\_start|>assistant
\end{tcolorbox}
    \caption{Prompt used for Mistral 7b.}
    \label{fig:mistral}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[angle=90,origin=c,scale=0.38]{pictures/chat_interface1.png}
\caption{User interface of the prototype My Climate Advisor. The user inputs their question to the LLM, and the response and the references used to generate that response are provided.} \label{fig:mcva_interface}
\end{figure*}

\begin{figure*}
\centering
\includegraphics[angle=90,origin=c,scale=0.30]{pictures/annotation.png}
\caption{Annotation interface used to grade LLM responses to agriculture questions.}\label{fig-annotation-interface}
\end{figure*}

\clearpage

\begin{sidewaystable*}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrr}
\hline
    & Claude 3 Opus & Claude 3 Haiku & Gemini 1.0 Pro & GPT 4-Turbo & Mistral 7b + RAG & Llama 3 70b & Gemini 1.5 Pro & Mistral 7b & Llama 3 8b & Llama 3 8b + RAG & Llama 3 70b + RAG & GPT 3.5-Turbo \\ \hline
Q1  & 1.79          & 1.00           & 1.61           & 1.18        & 0.79             & 0.83        & 1.42           & 1.02       & 1.08       & 0.41             & 0.52              & 1.18          \\
Q2  & 1.05          & 1.73           & 0.00           & 1.86        & 1.22             & 0.81        & 1.77           & 1.55       & 1.00       & 1.43             & 1.10              & 0.87          \\
Q3  & 1.71          & 1.82           & 0.64           & 1.29        & 0.14             & 1.00        & 0.74           & 1.22       & 1.04       & 0.50             & 0.07              & 1.26          \\
Q4  & 1.59          & 1.49           & 1.00           & 2.00        & 1.50             & 0.36        & 1.06           & 1.19       & 0.96       & 1.28             & 0.99              & 1.03          \\
Q5  & 1.93          & 1.73           & 1.31           & 1.86        & 1.25             & 1.73        & 1.06           & 1.01       & 1.04       & 1.36             & 1.28              & 1.55          \\
Q6  & 1.53          & 1.86           & 0.88           & 1.18        & 1.57             & 1.27        & 1.62           & 1.01       & 0.83       & 1.03             & 0.99              & 0.91          \\
Q7  & 1.26          & 1.15           & 1.80           & 1.04        & 0.94             & 1.89        & 1.69           & 0.00       & 0.88       & 1.59             & 0.62              & 0.31          \\
Q8  & 1.88          & 1.31           & 0.45           & 0.38        & 0.00             & 0.92        & 0.46           & 0.81       & 0.62       & 0.68             & 0.05              & 0.19          \\
Q9  & 1.79          & 1.17           & 2.00           & 1.86        & 1.21             & 1.09        & 1.29           & 1.58       & 1.88       & 1.49             & 1.87              & 1.13          \\
Q10 & 1.55          & 1.82           & 1.25           & 1.86        & 0.94             & 1.27        & 0.30           & 1.16       & 1.37       & 1.21             & 0.82              & 1.29          \\
Q11 & 1.69          & 0.00           & 1.68           & 1.73        & 1.07             & 1.10        & 1.29           & 0.83       & 0.71       & 1.66             & 1.00              & 0.00          \\
Q12 & 0.00          & 0.92           & 1.35           & 1.73        & 1.35             & 1.27        & 1.62           & 1.94       & 2.00       & 1.03             & 1.13              & 1.17          \\
Q13 & 2.00          & 2.00           & 1.13           & 0.43        & 1.27             & 0.77        & 1.82           & 1.55       & 1.67       & 1.50             & 1.26              & 1.44          \\
Q14 & 1.92          & 1.33           & 1.42           & 1.32        & 0.98             & 1.43        & 1.90           & 0.93       & 0.83       & 0.86             & 0.99              & 2.00          \\
Q15 & 1.13          & 1.67           & 1.25           & 1.71        & 1.77             & 1.45        & 1.80           & 1.76       & 1.08       & 2.00             & 1.42              & 1.55          \\ \hline
\end{tabular} %
}
\caption{Normalized sum of average scores from both annotators for each question and model.}
\label{tab:question_model_score_table}
\end{sidewaystable*}
\end{document}
