


\section{Sample Section Title}
\label{sec:sample1}


Lorem ipsum dolor sit amet, consectetur adipiscing \citep{Fabioetal2013} elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud \citet{Blondeletal2008} exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit \citep{Blondeletal2008,FabricioLiang2013} anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum see appendix~\ref{sec:sample:appendix}.


\section{Sample Section Title}
\label{sec:sample1}


Lorem ipsum dolor sit amet, consectetur adipiscing \citep{Fabioetal2013} elit, sed do eiusmod tempor incididunt ut labore et dolore magna \citet{Blondeletal2008} aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit \citep{Blondeletal2008,FabricioLiang2013} anim id est laborum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum see appendix~\ref{sec:sample:appendix}.



\section{Zero-Shot Learning}
\label{sec:ZSL}




In this section, we introduce ZSL in the context of transfer learning and language models (LMs). 
Furthermore, we show that ZSL is in fact an extreme version of transfer learning that takes advantage of LMs.

\subsection{Transfer Learning}
\label{sec:TL}

The traditional supervised ML paradigm is based on learning in isolation, where a single predictive model is trained for a task using a single dataset \cite{ruder2019transfer}. This paradigm requires a large number of labelled training data and it performs best for well-defined and narrow tasks \cite{ruder2019transfer}. However, this paradigm breaks down when we do not have sufficient labelled data for the desired task to train a reliable model. Transfer learning allows us to deal with this problem by leveraging the knowledge gained in solving the source task in the source domain and then transfer this knowledge to the target task and target domain \cite{ruder2019neural,ruder2019transfer,pan2009survey,weiss2016survey}. 

Transfer learning can be broadly classified into two types, based on the relationship between the \textit{source} and \textit{target} tasks or domains: \textit{inductive transfer learning} refers to the setting where the target task is different from the source task while the source and target domains can be the same or different; \textit{transductive transfer learning} refers to the setting where the source and target tasks are the same, but the source and target domains are different \cite{pan2009survey}. 

Currently, the most promising transfer learning approach is \textit{sequential inductive transfer learning} \cite{ruder2019transfer}, where a task is learned in sequence, consisting of a \textit{pretraining} and a \textit{adaptation} phase. In the pretraining phase, a source model is normally trained on a general representation using large quantities of data that can be adapted to a wide range of target tasks in the adaptation phase \cite{ruder2019neural}. Normally, model pretraining uses \textit{unsupervised learning} on unlabeled data whereas model adaptation uses \textit{supervised learning} on task-specific labelled data \cite{ruder2019neural,ruder2019transfer,peters2019tune}. 

Ideally, pretraining should aim to train a source model with general-purpose abilities and knowledge that can then be transferred to a wide range of target tasks \cite{raffel2020exploring}. The pretraining phase is therefore generally (much) more expensive than the adaptation phase \cite{ruder2019neural,peters2019tune}.

\subsection{Language Models}
\label{sec:LM}

Transfer learning is a central concept in \textit{language models} (LMs) \cite{chronopoulou2019embarrassingly,raffel2020exploring}. A LM is a language representation developed for deep learning based NLP tasks. LMs are typically pretrained \textit{unsupervised} on unlabeled text data available \textit{en masse} through the Internet \cite{raffel2020exploring}. They can then be applied --- i.e., \textit{transferred} --- to different downstream tasks \cite{radford2019language}.  Today, transfer learning with LMs have achieved state-of-the-art results across a wide range of NLP tasks (e.g., sentence and token prediction \cite{devlin2018bert}, named entity recognition \cite{peters-etal-2018-deep}, machine translation \cite{ramachandran-etal-2017-unsupervised} and text classification \cite{howard2018universal}).

There are two general approaches for transferring pretrained LMs to downstream tasks: \textit{feature-based} and \textit{fine-tuning} \cite{devlin2018bert}. 
The feature-based approach, such as ELMo \cite{peters-etal-2018-deep}, uses a pretrained model as the input features for the downstream task without touching the pretrained model; the representations resulted from such a LM are used to enrich the features used in the downstream supervised learning. On the other hand, the fine-tuning approach aims to modify the weights and parameters in certain layers of the LM in order to train the LM to perform a specific NLP downstream task such as classification, sentiments analysis, and question-answering; for example, LMs such as OpenAI GPT \cite{radford2018improving} and BERT \cite{devlin2018bert}, can be trained to perform downstream tasks by fine-tuning all or part of the parameters on these pretrained models.  

In the last few years, \textit{word embeddings} --- continuous vector representations of words and phrases based on word co-occurrences in large corpora --- became a popular method for language representations \cite{mikolov2013distributed}. Word embeddings capture syntactic and semantic characteristics of the words and have shown to be very effective on word similarity tasks \cite{sappadla2016using}. The early word embedding models include Skip-gram \cite{mikolov2013distributed}, Word2Vec \cite{mikolov2013efficient}, GloVe \cite{pennington2014glove}, and FastText \cite{bojanowski2017enriching}. More recently, LMs such as BERT, ELMo, and GPT-2, offer an advantage over the early embedding models. In contrast to fixed representations produced by the early models, LMs produce word embeddings that are dynamically informed by the words around them, known as \textit{contextual word embeddings}~\cite{ethayarajh2019contextual}. Contextual word embeddings can tell if the word ``mouse'' refers to a computer device or a small rodent from the surrounding context where this word is mentioned. 



\subsection{Zero-Shot Learning: Concepts and Approaches}
\label{sec:ZSL-approach}

ZSL has been originally used in image processing to predict unseen images \cite{romera2015embarrassingly}, and has recently been adapted to text classification to predict unseen classes \cite{pushp2017train}. ZSL has shown great potential in various NLP tasks, such as entity recognition \cite{ma2016label}, relation extraction \cite{levy2017zero}, document classification \cite{nam2016all}, and text classification \cite{pushp2017train}. Unlike supervised classification models that are trained to classify a text as one of the possible classes, the ZSL models predict whether a text is related to a label or not. Thus, ZSL treats a classification task (binary, or multi-class) as a problem of finding relatedness between texts (e.g., requirements) and labels (classes). Furthermore, unlike the aforementioned transfer learning with LMs, ZSL simply applies LMs without fine-tuning or feature extraction, thus taking transfer learning to the extreme \cite{radford2019language}. 


According to the literature, there are two approaches to ZSL: \textit{entailment-based} and \textit{embedding-based}. The entailment-based approach was initially proposed by \cite{yinroth2019zeroshot}, which treats classification task as a natural language inference (NLI) task.
In the context of ZSL text classification, the entailment-based approach teats an input text sequence as a premise and the labels as a hypothesis, and then infers if the input text is an entailment of any of the labels or not \cite{yinroth2019zeroshot}. For example, given the sentence as a premise \textit{``the system must be deployed on Azure''} and the label as hypothesis string \textit{``this is about software maintenance''}, the entailment-based classifier provides a score which is then translated into one of the following outputs: entailment (yes), contradiction (no), or undecided. This ZSL approach requires a large inference-based LM that can interpret the entailment relation between an input sequence and a label. 


The embedding-based ZSL approach was introduced by Veeranna \textit{et al.}~\cite{sappadla2016using}. Under this approach, both class labels (e.g., Usability, Security) and input text are represented as word sequences using word embeddings. Text classification then involves computing the \textit{semantic similarity} between each label sequence and the text sequence. If the similarity score is greater than a certain threshold, then the document can be classified into a specific category represented by the label; otherwise, the text document does not belong to that category. Note that as a label is treated as a sequence of words, it can contain any number of words or their combinations. 

Due to the simplicity of the embedding-based approach to ZSL, we have applied it in our study. However, unlike the original proposal by Veeranna \textit{et al.} \cite{sappadla2016using}, who used the static word-embedding technique (Skip-gram) for word representation, we take advantage of state-of-the-art LMs and use them to produce contextual word-embeddings for both labels and input text. In so doing, our embeddings not only capture syntactic and semantic characteristics of the words, but also their context. 

Another difference between our approach and the original one is that we do not use similarity thresholds to determine the predicted label; instead, we treat all text classification as a multi-label classification task and rank-order all the labels with their similarity scores. For a binary or multi-class classification task, we select the label with the highest similarity score as the predicted label. For a multi-label task, we check the top-$n$ labels to ensure we do not miss the predicted labels.

The \textit{contextual word embedding-based} approach adopted in our study is depicted in Figure \ref{ZSLApproach}. We illustrate this approach with a simple example. Suppose we have a requirement: \textit{``CNG shall support mechanisms for secure authentication and communication with the remote management system''}. We want to classify this requirement according to these two labels:  ``Usability'' and ``Security''. We feed these three word sequences (a requirement plus two labels) into a LM to produce three contextual word embeddings. We then compute the $Cosine$ similarity\footnote{The $Cosine$ similarity function is the standard way to compute semantic similarity between a label and a text \cite{mikolov2013efficient,pennington2014glove,bojanowski2017enriching,ethayarajh2019contextual}.} between each label and the requirement and obtained two similarity scores: $0.86$ for Security and $0.25$ for Usability. Based on these scores, we deduce that the given requirement is a security requirement. 

\begin{figure}[t!]
    \centering
    \includegraphics[width=\textwidth]{Figures/ZSL-Approach.png}
    \caption{\footnotesize The Contextual Word Embedding-Based ZSL Approach.}
    \label{ZSLApproach}
\end{figure}

Clearly, the accuracy of this approach highly depends on the choice of 1) the \textbf{labels} and 2) the \textbf{LMs}. The above example only shows single word labels, but to fully exploit the benefit and potential of ZSL and contextual word-embeddings for requirements classification, our study will investigate different label configurations. For example, by composing the Usability label as “instructive, easy, helpful, useful,learnable, explainable, affordable, intuitive, or understandable”, the LM can produce a more meaningful embedding that can capture a range of connotations of the Usability requirement, rather than just its face value. Our study will explore different LMs, to find out which one is most effective for requirements classification. 








\section{Conclusion}
\label{sec:con}
This study presents an investigation of the performance of zero-shot learning (ZSL) for the classification of requirements. We consider three main classification tasks, namely: 1) FR/NFR---binary classification into functional and non-functional requirements, 2) NFR---binary and multi-class classification into different non-functional classes, e.g., performance, usability, etc., also considering the multi-label case and 3) Security---classification of security vs non-security requirements. For each task, we identify the most suitable language model (LM) to be used for requirements representation and the most suitable label selection strategy, as the encoding of requirements as well as the terms used as labels to be predicted influence the performance of the approach. Our results show that generic LM, and simple labelling strategies perform better than software engineering/RE-specific LMs, and complex labelling strategies. Best performance are: F1=0.66 for FR/NFR; F1=$\sim0.72-0.80$ for binary NFR; up to F1=76 for multi-class single label; up to F1=0.98 for multi-class multi-label NFR; F1=0.66 for Security, and up to F1=0.78 for part of the Security dataset. Though performance are lower with respect to previously explored supervised techniques, it is worth highlighting that ZSL is fully unsupervised. Therefore, it does not require a pre-labelled dataset, except for its initial evaluation, to be carried out to check its performance. Furthermore, it is inherently flexible to changes in the types of requirements classes to consider, and associated labels. As classification schemes change, the approach naturally adapts to the new labelling. As future work, we will consider the following directions: 1) assess ZSL for the classification of app reviews, using existing datasets made available by previous studies (cf., Dabrowski for a complete list~\cite{dkabrowski2022analysing}); 2) consider other RE tasks, and corresponding strategies to approaches to frame them in terms of classification problems suitable for ZSL; 3) replicate current experiments with the entailed-based ZSL approach, to explore whether better performance can be achieved; 4) consider the few-shot learning approach, and assess to what extent the shortcomings of ZSL can be addressed by including a limited set of labelled examples.  

\section*{Replication}

We shared our experimentation settings including Colab notebook and the results we obtained from all the ZSL classifiers at \url{https://github.com/waadalhoshan/ZSL4REQ}.





\section{Experimental Design}
\label{sec:exp}

\subsection{Research Questions}
The purpose of this paper is to provide insights into how embedding-based ZSL can be used for requirements classification. In particular, we want to understand what are the most suitable LMs to use, in which way the naming of the labels used for requirements classification influences the results, and what are the performance of ZSL compared to previous approaches. 
Accordingly, this paper aims to answer the following research questions (RQs):
\\
\\
\textbf{RQ1} \textit{Which \textbf{language model} achieves the best performance for zero-shot requirements classification?} 
 \\
 In embedding-based zero-shot classification, the LM influences the representation of the requirements. To answer this RQ, we want to compare how generic and domain-specific LMs perform when used in zero-shot requirements classification. The answer to this question will indicate what are the most suitable LMs. 
\\
\\
\textbf{RQ2:} \textit{To what extent can different \textbf{label configurations} affect the performance of zero-shot requirements classification?} 
\\ 
As the embedding-based ZSL approach aims to find semantic similarity between a requirement sentence and a label, the choice of words for labels influences the performance of the zero-shot classifier. The answer to this RQ will suggest which label configuration can improve the performance of the classifier.
\\
\\
\textbf{RQ3:} \textit{Is zero-shot learning \textbf{effective} for requirements classification?} 
\\
This involves a comparison of the best results obtained in our study, against the best results obtained with the performance of previous proposals available in the literature. With this RQ, we also aim to reflect on the practical implications and lessons learned throughout the experiments. 
\\

To answer these RQs, we conduct a series of experiments based on the ZSL approach described in Section \ref{sec:ZSL} 
Each experiment is designed with a specific LM to perform different requirements classification tasks using relevant datasets and label configurations. Our experimental design involves five steps, detailed in the following sections: selection of datasets and tasks; LM selection; label configuration; performance measure selection; experiments.





  



These steps are detailed in the subsections below.

\subsection{Dataset and Task Selection}

We select the following two datasets for our experiments: 

\begin{itemize}
   
    \item \textbf{PROMISE NFR dataset} \cite{jane_cleland_huang_2007_268542}, introduced by Cleland Huang \textit{et al.} \cite{cleland2007automated}: This dataset contains 625 requirements, partitioned into 255 FRs, and 370 NFRs. The NFRs are further partitioned into 11 different classes, namely: A = Availability (21 requirements), L = Legal (13), LF = Look and feel (38), MN = Maintainability (17), O = Operational (62), PE = Performance (54), SC = Scalability (21), SE = Security (66), US = Usability (67), FT = Fault tolerance (10), and PO = Portability (1). These classes are unevenly distributed, ranging from 67 requirements for Usability to one for Portability. Each of the large classes - Usability, Security, Operational, and Performance classes - has more than 50 examples, while the small classes - Fault Tolerance, Legal, Maintainability and Portability - have from one to 17 requirements each. The dataset has been widely used in the literature, e.g., by Kurtanovi{\'c} and Maalej \cite{kurtanovic2017automatically}, and by Hey \textit{et al.} \cite{hey2020norbert}.
    
    \item \textbf{SecReq dataset} \cite{knausseric20214530183}, introduced by Knauss \textit{et al.} \cite{knauss2011supporting}: This dataset contains 510 requirements, made of security-related requirements (187) and non-security related requirements (323). The requirements were collected from three projects: Common Electronic Purse (ePurse), Customer Premises Network (CPN), and Global Platform Spec (GPS). The dataset has been used, e.g., by Varenov \textit{et al.}~\cite{varenov2021security}.

\end{itemize}



We select the following representative classification tasks for our study:


\begin{itemize}
    \item \textbf{Task FR/NFR}---\textit{Binary Classification of FRs vs. NFRs.}
    With this task we aim to distinguish FRs from NFRs, assuming that a requirement belongs to either a FR or a NFR class. We use the PROMISE NFR dataset for this task.
    
    \item \textbf{Task NFR} --- \textit{Binary, Multi-class and Multi-label Classification of NFRs}. This task aims to classify different types of NFRs based on the 10 different classes of the PROMISE NFR dataset (we excluded the Portability class as it only has one single sample in PROMISE dataset). We perform three sub-tasks to understand how ZSL reacts to different ways of classifying NFRs: 1) binary classification which discerns if a NFR belongs to a particular class or not; 2) multi-class single-label classification (simply, \textit{multi-class classification}) which assigns a NFR to one of the top  or all NFR classes; 3) multi-class multi-label classification (simply, \textit{multi-label classification}), which allocates a NFR to one or more NFR classes. The purpose of the third sub-task is to check if the top-n NFR classes returned by the ZSL classifier correlate with the assigned NFR label in the datatset.
    
    \item \textbf{Task Security} ---\textit{Binary Classification of security related vs. non-security related requirements.} This task assumes that a requirement belongs only to one of these two classes: security related and non-security related. We use the SeqReq dataset for this task.
\end{itemize}

These datasets and tasks are selected for our experiments as they are frequently considered in the literature (cf. Sect.~\ref{sec:related}) and will enable us to compare our results directly with those obtained in previous work. Table \ref{tab:Task_Desc} summarizes the above tasks and their associated datasets.

\input{Tables/TaskDesc}

\subsection{Language Model Selection}

 We select the following four BERT-based LMs for our study: two generic and two SE domain-specific. We focus on BERT-based models, due to the popularity of BERT models and the suitability of these models for requirements classification~\cite{hey2020norbert}. Other state-of-the-art LMs, such as GPT-2 and GPT-3 by OpenAI, and XLNet by Carnegie Mellon University ~\cite{yang2019xlnet}, have not been included in our study, as they are not suitable for requirements classification. BERT was considered the most suitable solution for a requirements classification task as ours. For example, GPT-2 and GPT-3 are mainly oriented to generation-related tasks, such as language translation and text summarization~\cite{brown2020language}. XLN-Net is for tasks that involve long contexts, e.g., paragraphs~\cite{yang2019xlnet}, while requirements are typically sentences. 
 
The first two LMs are generic, made available by HuggingFace\footnote{\url{https://huggingface.co}}, the well-known NLP community sharing LMs and other resources. The remaining two are domain-specific for requirements and software engineering. These models are: 
    \begin{itemize}
   
   

        \item \textbf{Sentence-BERT (\textit{Sbert}):}  This generic LM is proposed by Reimers and Gure\-vych~\cite{reimers2019sentence} as a fine-tuned version of BERT LM to enrich the semantic embedding representation, i.e., to aid in deriving semantically meaningful sentence embeddings that can be compared with the unseen labels using the cosine-similarity measure. The solution overcomes the drawbacks of using BERT, for which a sentence embedding is typically computed as averaging BERT output layer for the token in the given sentence. This approach was observed to lead to poor embeddings, and thus semantic representations of sentences/requirements~\cite{reimers2019sentence}. The implementation of Sbert LM is provided at HuggingFace models hub by deepset.ai contributors\footnote{\url{huggingface.co/deepset/sentence\_bert}}. 
       
        \item \textbf{All-MiniLM-L12 (\textit{AllMini}):} This generic LM is introduced by Wang \textit{et al.}~\cite{wang2020minilm}, from Microsoft Research. It overcomes the complexity of BERT, which requires substantial resources and it is often not applicable for real-life applications in which reduced latency is required. The LM uses an approach that reduces (\textit{distils}) the BERT LM, while preserving its performance. It is specifically targeted to sentence embeddings, and therefore suitable for requirements encoding to provide contextual embeddings. In this experiment, we use a fine-tuned version which is pretrained on 1B sentence pairs dataset (All-MiniLM v2), which used as a part of HuggingFace's project for encoding sentences and short paragraphs\footnote{\url{huggingface.co/sentence-transformers/all-MiniLM-L12-v2}}. The LM proves its efficiency for semantic search and sentence clustering tasks.
       
        
        \item \textbf{BERT4RE:}	Developed by Ajagbe and Zhao \cite{ajagbe2022RE}, this RE domain-specific LM was trained on BERT\textsubscript{base} model with more than seven million words from different RE related datasets, including the PROMISE NFR dataset, the PURE dataset \cite{ferrari2017pure}, and app reviews from Google Playstore and App Store. Although BERT4RE aims to support a wide range of RE tasks, it has only been tested on the task of identifying semantic roles from requirements documents. As this is the only publicly available RE-specific LM, we include it in our study. The Bert4RE LM is provided by the authors at Zenodo repository \footnote{\url{zenodo.org/record/6354280}}. 
        
        \item \textbf{BERTOverflow (\textit{SObert}):} This SE domain-specific LM was developed by Tabassum \textit{et al.}\cite{tabassum-etal-2020-code} and was trained on 152 million sentences from \textsc{StackOverflow}. BERTOverflow shares the same vision as BERT4RE, aiming to overcome the problem of generic LMs such as BERT, which may not be able to fully capture the semantics of the SE terminology. Although BERTOverflow has been trained to perform domain-specific \textit{named entity recognition (NER)} tasks in SE, our research shows that it is among the few software engineering-specific LMs that can be potentially adopted requirements classification\footnote{Note that another LM for SE is CodeBERT~\cite{feng-etal-2020-codebert}, but this is trained on both natural language and source code, and it is oriented to tasks that involve these two representations, such as code retrieval based on natural language queries.}. The implementation of the SObert can be found at HuggingfFace models hub \footnote{\url{huggingface.co/jeniya/BERTOverflow}}.
    \end{itemize}


\input{Tables/LMsSpecs}

\paragraph{Technical Specifications} The technical specifications of the selected LMs are listed in Table \ref{tab:LMs-specs}. As shown in the table, all the LMs share the same underlying LM architecture, namely BERT, with an embedding size of 512 tokens, except for Sbert which only embeds sentence with only 128 tokens as a maximum length. The embedding size is not a major concern in our experimentation settings since the average length of requirements in the selected datasets is lower than the specified LM embedding sizes. For example, the longest requirement is found in SeqReq dataset with a total of 94 words which is less than the listed embedding sizes. For the vocabulary size, all of the LMs share approximated number of vocab ($\sim$ 30k words), except for the domain specific SObert which holds more than 80k words in its context files. The vocab size does not affect the ability of these LMs in detecting single words, as any word that does not occur in the vocabulary is splitted into sub-words, for example in Sbert the word \textit{usability} is processed as: us\verb|#|\verb|#| and \verb|#|\verb|#|ability . This processing enables the contextual embedding within these LMs to perform well even with out-vocabulary words, unlike static word embedding such as word2vec. The other specifications in the table are related to the attention-mechanism in BERT-based architecture. Typically in BERT-based related LMs, the number of transformers layers are 12, and each layer contains 12 attention heads. These attention heads aim to compute the interrelationships between input tokens at each layer, where each layer processes the tokens representation using the attention heads and the output of one layer is an input of the next layer. For instance, with 12 attention heads and 12 layers, each token in a given input is analysed by these heads on 12 distinct interrelationships of other tokens, i.e., how likely that token is related to the other tokens, and this process is repeated at each one of the 12 layers. Therefore, the attention head computes the probabilities of different kind of constituent combinations between tokens. This extensive computational process enriches the contextual understanding of words and sentences in these LMs. As shown in the LMs specification table, all the selected LMs share similar attributes to process the tokens, with 12 attention heads and layers. The dissimilarity is only encountered with AllMini with a hidden size of 384 (i.e., output representation size has 384 dimensions, also known as the width of layers output)
less than the hidden size of other selected LMs (768 as hidden size). The other variation can be observed with Bert4RE, with only 6 attention heads compared to 12 heads in the other selected LMs. These variations might have an impact on the contextual representations which could lead to variant representations of requirement inputs across these selected LMs. \cite{kovaleva2019revealing}. 


It is worth remarking that the vector-based representations of sentences, requirements, and labels that are derived from these transformer-based LM are very different from more traditional LM, such as word2vec and gloVe. In particular, transformer-based LM are able to identify the deep meaning of sentences, while traditional \textit{static} LMs, such as Word2Vec word embedding, simply expand words in a sentence with related ones. In practice, while with traditional word embeddings a sentence like ``this is not about usability'' is mapped into a vector that is very similar to ``this is about usability'' (they differ solely by one word), with transformer-based, \textit{contextual}, LMs the vectors are different, since the meaning is actually the opposite. This characteristics naturally plays a crucial role for requirements analysis, as requirements sentences often use a very restricted vocabulary~\cite{ferrari2017pure}, but convey different meanings.

    
\subsection{Label Configuration}
\label{sec:labels}

Different labelling strategies, and combinations thereof, are used to define the strings that represent the labels of each class. 

\begin{itemize}

    \item \textbf{Original labels} are the labels \textit{derived} from the original class names used in the datasets. For example for the task FR/NFR, the original label for the class FR is ``functional'', while for NFR we use two types of labelling: 1) the expression ``not about functional''\footnote{Preliminary experiments have shown that this term is more effective for ZSL with respect to ``non functional''.} or 2) a string including all the NFR class names (``usability, security, availability, ...''). Though the resulting labels are different, the strategy is common, as no additional knowledge is used to derive the labels, except for the information coming from the names used in the datasets. 
    
    \item \textbf{Expert curated labels} are the labels \textit{selected} by the three authors of this paper via brainstorming, and based on their understanding of the requirements classes. The goal of the brainstorming activity was to produce a limited set of relevant terms that represent the fundamental meaning of the class, and which that could be combined in a single string representing the label. For example, for the task FR/NFR, the label for the class FR is ``functional, system, behavior, shall or must''.  
   

    \item \textbf{Word-embedding generated labels} are the terms \textit{extracted} and \textit{selected} by considering terms that are similar to the class name, according to a word-embedding LM learned from the text of Wikipedia pages belonging to the Computer Science (CS) portal. The idea is that the LM learned from the CS portal represents the meaning of words in the CS domain, and it is therefore more suitable than a generic LM in providing similar terms for our CS context. The model was learned based on the approach and code by Ferrari and Esuli~\cite{ferrari2019nlp}. To produce a label for a certain class, we apply the following procedure. Given the LM, and given a class name (e.g., security), we perform the following steps: (1) identify the top-\textit{n} most similar terms according to the LM; (2) each of the three author independently tags the terms with \textit{yes} = the term is representative for the class, \textit{no} = the term is not representative for the class, or \textit{maybe} = the term could be representative for the class; (3) each author revised their maybe as \textit{yes} or \textit{no}, also considering the annotations of other authors. 
    (4) the final tags are decided through majority voting (e.g., \textit{yes}, \textit{yes}, \textit{no} $\rightarrow$ \textit{yes}; \textit{no}, \textit{yes}, \textit{no} $\rightarrow$ no). 
    After this procedure, the terms that were tagged with \textit{yes} were included in the string that represents the label, e.g., for security, the label is ``vulnerability, securing, protecting, protection, cybersecurity, assurance, cyber, countermeasure, threat, privacy, authentication, prevention, or confidentiality''. 
    To assess the consistency of annotations between the three authors at step (2), we evaluate the overall percentages of agreement (i.e., all annotators tagged the term with \textit{yes} or \textit{no}), partial agreement (i.e., two out of the three annotators tagged the term with \textit{yes} or \textit{no}), and disagreement (i.e., the three annotators selected three different tags: \textit{yes}, \textit{no}, and \textit{maybe}). In addition, after resolving the tags \textit{maybe} at step (3), we consider the inter-rater agreement (IRR) based on Krippendorff’s alpha test \cite{krippendorff2018content} as well as Fleiss’s Kappa \cite{fleiss1973equivalence}. These statistical tests are used to measure the level of agreement among multiple annotators and categories. The interpretation of test results follows the guidelines reported in the Koch Kappa benchmark \cite{landis1977measurement}. 
\end{itemize}    

It should be noticed that these are general strategies adopted, which are then combined and instantiated for each specific task. The different combinations of labels produced, and variants of the strategies are reported in Sect.~\ref{sec:res} for each task. Not all possible combinations are tested and reported, but only those that were incrementally defined based on the results obtained from the application of single strategies.
    


     
    
   
   
    
   


       
        
       

           
            
          






\subsection{Performance Measures}


The results from each task, with each LM and label configuration, are measured using the standard metrics of precision (P), recall (R), and F1-score (F1). Particularly, for the binary classification task, we apply the weighted measures of P, R and F1 (respectively represented as wP, wR and wF1) to take into account for unbalanced classes, and enable comparison with other studies in requirements classification~\cite{hey2020norbert,kurtanovic2017automatically}. We also report the same unweighted measures for each class, to study the classifier with best performance in order to discuss possible issues that may not emerge from the weighted measures. For the multi-class and multi-label classification tasks, we report the standard P, R and F1 by class for the best performing classifiers, then we report the overall weighted F1 scores across all classes based on the distributions in the test dataset. 





\subsection{Experimental Setup}


Each experiment is combination of one LM and one label configuration (i.e., the settings of ZSL classifier), applied on a dataset to address a task. 
We use the Transformer \footnote{\url{huggingface.co/docs/transformers}} API Python package to import and prepare the selected LMs with their transformer-like tokenizers, and we use torch.nn module \footnote{\url{pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html}} in PyTorch to compute the cosine similarity score between two tensors (i.e., the PyTorch tensor objects obtained from the contextual representation by the selected LMs). For reporting the results in terms of P, R, and F1 rates we use SciKit package, classification report. In more details, the implementation steps of the embedding-based ZSL text classification approach are:

\textbf{\textit{Preparation.}}
We import the pretrained transformer-like model with its tokenizer–-keeping the model configuration without modification. Then, we import the test dataset, and prepare the sequence \textit{X} (which is the requirement to be labelled) and the label names \textit{Y} (as a list of string labels) to be processed by the LM.

\textbf{\textit{Encoding.}}
We tokenise the input sequence (X) and and each label in the list (Y) using the model’s tokenizer to prepare the input tokens to be processed later by the LM. If length of the sequence and the length of the labels is not the same, a zero-padding will be set to fill in the gaps and to make both lists equal in length.  The output of the encoding step is a PyTorch tensor object for each input --- that is, the output of the model’s tokenizer has two parts: 1) input IDs, and 2) attention mask. Input IDs are obvious: these are simply mappings between tokens and their respective IDs. The attention mask is to prevent the model from looking at padding tokens. Then, we pass the input IDs with their corresponding attention mask to the model to get the contextualized embeddings. This achieved by retrieving the first layer result, i.e., ‘the contextualised word embedding’, to get the sequence and labels representations.

\textbf{\textit{Classification.}}
We apply the cosine similarity measure from PyTorch between the sequence embedding and each label embedding. The output is a list of similarity scores to each label as in multi-label classification. Then, we sort the calculated similarity scores from highest to lowest, and select the highest one as the predicted label in all both binary and multi-class classification tasks, and for the multi-label classification task, we retrieve top-\textit{n} labels as selected labels. 





\section{Research Findings}
\label{sec:find}


This section derives the answer to the three RQs from the reported results and provides some general observations obtained from our study.

\subsection{Best Language Model (RQ1)}


To answer this RQ, we focus on the performance of LMs and ignore their label configurations. 

\begin{itemize}
    \item For \textit{Task FR/NFR}, the best overall performer is Sbert, achieving a wF1 score of 0.66, with wP = 0.71 and wR = 0.66. This indicates that the generic Sbert model, designed to provide a semantic representation for generic sentences, substantially outperforms the other LMs for this task.
    \item For \textit{Task NFR}, we show the performance of LMs in each sub-task as follows: 
    
    \begin{itemize}
    \item For binary classification of NFR, the generic AllMini model outperforms the other three LMs, particularly on the SE class. 
    \item For multi-class classification of NFR, in the majority of the cases, the best results are obtained by the generic LMs (Sbert and AllMini). 
    \item For multi-label classification of NFR, there is no clear winner as each LM appears to be suitable for a certain requirement class.
  \end{itemize}
    \item For \textit{Task Security}, the best overall performer is the generic AllMini model; however, on the other hand, the generic Sbert model achieves the worst results (wF1 = 0.31). This suggests that generic models are not necessarily better for this specific task and a careful selection of the best generic LM is key to the success of ZSL.
\end{itemize}

Based on the above findings, we can state that: 
\begin{mdframed}
\faLightbulbO{} In the majority of the cases, generic LMs perform better than domain-specific LMs on requirements classification tasks.
\end{mdframed}

Our findings thus contrast the claims that generic LMs do not perform particularly well on domain-specific tasks, as they cannot recognize highly domain-specific vocabulary \cite{beltagy2019scibert,chalkidis2020legal,lee2020biobert,sainani2020extracting,ajagbe2022RE}.  

From this we can conclude that \textit{generic LMs, being trained on generic data, are more generalizable and adaptable,---the very sense of being generic}; by contrast, \textit{domain-specific LMs, being trained on domain-specific data, are less generalizable and adaptable},---the very sense of being specific. Future developments of LMs, we posit, should not differentiate between generic vs. specific, but rather, should focus on continual learning on new tasks and new data \cite{ruder2019transfer}. As LMs retain and accumulate knowledge across many tasks, they will become more adaptable to new tasks, domain-specific or otherwise.

\subsection{Best Label Configuration (RQ2)}



From Section \ref{sec:res} we found:

\begin{enumerate}
    \item For \textit{Task FR/NFR}, the best label configuration is \textit{FR\_E} for Sbert. This configuration is composed of the Expert curated and the Original labels, which identifies the NFRs using the names of the NFR classes (Usability, Security, Availability, etc.). This shows that the knowledge of NFR characteristics plays an important role on label configuration for this task.
    
    \item For \textit{Task NFR}, we show the performance of label configurations in each sub-task as follows: 
    
    \begin{itemize}
    \item For binary classification of NFR, the best label configuration is \textit{SE\_D} for AllMini, which uses the word embedding with top-20 words for the SE class, and the original NFR labels for the ``Other'' class. 
    \item For multi-class classification of NFR, in the majority of the cases, the best label configurations for individual NFR classes are \textit{MultiNFR\_A} (Original label) and \textit{MultiNFR\_B} (Expert curated label) for Sbert and AllMini. 
    \item For multi-label classification of NFR, simple label configurations based on either original label (\textit{MultiNFR\_A})  and expert curated label (\textit{MultiNFR\_B}) appear to be most effective for all classes, except US, for which the embedding-based labels (\textit{MultiNFR\_D} and \textit{MultiNFR\_E}) are more effective. This exception could be due to, in comparison with other NFR classes, that US requirements require more contextual information to identify. 
  \end{itemize}
    
    \item For \textit{Task Security}, the best label configuration is \textit{Sec\_B}. Curated by expert, this label only contains a limited set of three security-related words. However, the results show that it is sufficient to use this simple label configuration to identify security requirements in the given dataset. 
\end{enumerate}

Based on the above findings, we can conclude that: 
\begin{mdframed}

\faLightbulbO{}  In general, simple label configurations with the original class names or with a combination of original and expert curated labels appear to be more effective than, more complex word-embedding generated labels. 

\end{mdframed}



Selecting the best---more precisely, the most effective---label configuration is a difficult task and requires testing many different labels by trial and error. Our study shows how we have handcrafted each label using one of the three strategies (namely, using original label, expert curated label and word-embedding generated label). More work is needed in search for a more systematical way of label configuration. We argue that expert knowledge of RE, but also domain-, and possibly project-specific knowledge plays an important part in choosing correct terms for the labels.

\subsection{Effectiveness of ZSL for RE (RQ3)}

Here we address the effective of ZSL by first comparing our best ZSL results to the state-of-the-art results achieved by Kurtanovi\`c and Maleej (K\&M) \cite{kurtanovic2017automatically}, Hey \textit{et al.} (NoRBERT) \cite{hey2020norbert} and Knauss \textit{et al.} (Knauss) \cite{knauss2011supporting}, with respect to the exactly same classification tasks (i.e., binary and multi-class classification). Second, we discuss our best ZSL results obtained from multi-label classification with the state-of-the-art results and provide our insight into ZSL classification. 
\input{Tables/CompareFRvsNFR}

\input{Tables/CompareNFRbinary}

\input{Tables/CompareNFRmulti}
\input{Tables/CompareSec}

\begin{enumerate}
    \item Binary Classification of FR vs. NFR: Table \ref{tab:compareFRvsNFR} shows that both K\&M and NoRBERT outperform all our ZSL classifiers. In particular, on FR, K\&M produces the best results with a SVM model that applies all the word features in the PROMISE dataset (i.e., without feature selection), achieving F1 = 0.93. On NFR, NoRBERT produces the best results with the fine-tuned BERT\textsubscript{large} model, with F1 = 0.93. By contrast, the best ZSL classifier (with Sbert LM) has only managed to achieve F1 = 0.66 on FR and F1 = 0.65 on NFR. On average, the performance of the best ZSL classifier is 0.27 lower than that of K\&M and NoRBERT. Clearly, these results show that the ZSL approach is (much) less effective than K\&M and NoRBERT with respect to this particular task. 
    
    \item Binary Classification of NFRs: Table \ref{tab:compareNFRbinary} shows that overall, both K\&M and NoRBERT outperform our best ZSL classifier, with wF1 = 0.83 achieved by their best model; on the other hand, the performance of our best classifier (ZSL with Sbert) is only slighly worse, with wF1 = 0.73. By examining the results obtained for each class, on US, our ZSL classifier (with Sbert) performs slightly worse than K\&M, but outperforms NoRBERT. On SE, although both K\&M and NoRBERT outperform our  best ZSL classifier (with AllMini), the difference is not large. A similar observation can be made to classes O and PE.
    
    \item Multi-class Classification of NFRs: For this task, we notice in Table \ref{tab:compareNFRmulti} large gaps between the results of K\&M and NoRBERT and our results on every class. As the purpose of this task is basically the same as the binary classification of NFR task, the inconsistent results achieved by ZSL in these two tasks indicate that when requirements belonging to many classes are considered, ZSL does not appear to be sufficiently effective.
   
    \item Binary Classification of Security vs Non-Security Requirements: Table \ref{tab:compareSec} reveals very interesting results: When treating all the security requirements as a whole (i.e., without separating them into different project), Knauss outperforms the best ZSL classifier by 0.18 points on wF1. However, when the requirements are divided by three projects (i.e., CPN, GPS and ePurse), ZSL outperforms Knauss on all individual projects. In particular, ZSL (AllMini + Sec\_B) achieved a high wF1 = 0.78, compared to Knauss's wF1 = 0.40 on CPN. This again seems to suggest that ZSL performs well with binary classification of security requirements when opposite labels are clearly defined.
\end{enumerate}

Based on the above findings, we can conclude that:

\begin{mdframed}
\faLightbulbO{} Unsupervised learning with ZSL achieves acceptable performance for binary and multi-class classification tasks. However, it does not outperform supervised classification models, as RE tasks are narrowly defined, and often require well-trained, specifically fine-tuned models on specifically labelled dataset. Nevertheless, without training or fine-tuning, ZSL is more flexible, open to less data-rich tasks, and easily adaptable to the evolution of classification schemes. 
\end{mdframed} 

In fact, we suggest that ZSL is better to be viewed as a \textit{universal multi-label classification} approach. Intuitively, the more the labels, the better the chance to find a match between a requirement and a label. In our study, we have demonstrated how we fit conventional binary and multi-class classification as one-label or single label classification in ZSL and we still achieved encouraging results.



In relation to multi-label classification, the following results are obtained:

\begin{itemize}
    \item From Table~\ref{tab:TopNFRmultiLabelResults}, concerning the 4 largest NFR classes, best performance for each class are F1 $\sim 0.83-0.94$, which are comparable with the average results of NoRBERT (large + ep.32) for multi-class classification (average F1 = 0.84, cf. Table~\ref{tab:compareNFRmulti}), and higher than those of K\&M.
    \item 
    From Table~\ref{tab:AllNFRmultiLabelResults}, concerning all NFR classes, we observe that the best F1 $\sim 0.64 - 0.98$, with variations that depend on the requirements class.
   
\end{itemize}

\begin{mdframed}
\faLightbulbO{} To achieve state-of-the-art performance of ZSL for multi-class classification, a multi-label strategy is recommended. In practice, this implies that a semi-automated classification approach should be followed, in which a human operator is asked to select the most suitable class among the top ones returned by the classifier. 
\end{mdframed}




 
 






\section{Introduction}
\label{sec:intro}

In requirements engineering (RE), system and software requirements specifications are typically written in natural language (NL) \cite{zhao2021natural,kassab2014state}. In the last few years, natural language processing (NLP) techniques based on supervised machine learning (ML), and, more recently, deep learning (DL), have been applied to address several RE tasks, driven by the success of these techniques in a range of domains, including medical diagnosis, credit card fraud detection, and sentiment analysis \cite{sidey2019machine,sarker2021machine,minaee2021deep}. The majority of RE tasks can be framed as NLP \textit{classification} tasks, to be solved with supervised ML techniques~\cite{alhoshan2022zero}. Relevant examples are: 
classifying requirements into different categories \cite{cleland2007automated,kurtanovic2017automatically,dalpiaz2019requirements,hey2020norbert}; identifying requirements from software contracts \cite{sainani2020extracting}; discerning requirements and non-requirements \cite{abualhaija2020automated}; 
detecting ambiguity from regulatory requirements \cite{massey2014identifying}; establishing relationships between requirements \cite{mills2018automatic}; and discovering requirements-relevant content from app reviews \cite{dkabrowski2022analysing,maalej2016automatic}.
To date, research ML-based RE has been primarily focused on \textit{supervised} classification approaches \cite{binkhonain2019review}. 
However, 
supervised ML has some major limitations. The most notable one is that supervised ML methods need to be trained on a large amount of labelled data before they can be ready for predicting the outcomes on new data \cite{sarker2021machine,wang2019survey,alcoforado2022zeroberto,ferrari2017natural}. This problem is exacerbated in domains like RE where collecting sufficient training data, and labelling them, is often expensive, time-consuming,  error-prone~\cite{dalpiaz2019requirements,sainani2020extracting}, and requires sufficient domain- and even project-specific knowledge~\cite{ferrari2017natural}. 
Furthermore, labelled data used in previous studies are frequently unavailable. This happens even for the lively area of app review analysis, in which most studies have not released their labelled dataset, according to a recent survey (cf. ~\cite{dkabrowski2022analysing}, p. 34).

Another limitation of supervised learning methods is that they can only classify the data belonging to \textit{seen classes} (i.e., classes covered by the training data), but they cannot classify the data into previously \textit{unseen classes} (i.e., classes not covered by the training data) \cite{wang2019survey,socher2013zero}. Although this limitation is inherent in supervised learning, the ability to deal with previously unseen classes can bring huge benefit to many real-world applications where classes are artificially defined, with no common consensus, or where classes may evolve over time, with new classes emerging and old ones becoming obsolete. One such example is requirements classification where several classification schemes exist for non-functional requirements (NFRs) \cite{glinz2007non,eckhardt2016non}. As software applications, their requirements, and the theory of NFRs itself evolve over time, so do the classification schemes. Consequently, a dataset labelled using one set of classes (e.g., the PROMISE NFR Dataset \cite{jane_cleland_huang_2007_268542}) cannot be reused to train a method that intends to predict a different set of classes (e.g., based on the latest ISO/IEC/IEEE 29148 standard \cite{ieeestandard}). Each time a new classification scheme is used, the datasets must be relabelled accordingly, incurring expensive data-labelling costs.

To address these problems, different learning paradigms have been proposed in recent years \cite{sarker2021machine}. One such paradigm is \textit{transfer learning}. Based on the idea of knowledge transfer and domain adaptation \cite{pan2009survey,weiss2016survey}, transfer learning aims at improving the performance of target learners or models on target domains by transferring the knowledge contained in different but related source domains \cite{zhuang2020comprehensive}. In doing so, transfer learning intends to alleviate the problems of data shortages and expensive data labelling efforts. 

Another paradigm is \textit{zero-shot learning} (ZSL) (also known as \textit{zero-data learning} \cite{larochelle2008zero}). ZSL intends to overcome the aforementioned two limitations of supervised learning, by predicting both seen and unseen classes, to classify instances belonging to the classes that have no labeled instances \cite{larochelle2008zero,lampert2009learning,palatucci2009zero}. 

Expanding on our preliminary study \cite{alhoshan2022zero}, this paper aims to study \textit{how Zero-Shot Learning can be used for requirements classification} and to gain insight into this new paradigm in the context of RE. Accordingly, we intend to make the following contributions:

\begin{itemize}

\item We introduce the concept of ZSL and its relationship with transfer learning.
\item We demonstrate how ZSL can be applied to requirements classification through a series of experiments.
\item We discuss the potential and challenges of using this new learning approach in RE.

\end{itemize}

The preliminary study~\cite{alhoshan2022zero} only assessed ZSL on the classification of security and usability requirements selected from a portion of the PROMISE NFR dataset. This paper substantially extends the previous contribution by evaluating ZSL on different tasks, namely classification of functional vs non-functional requirements, identification of non-functional requirements classes, and classification of security vs non-security requirements. To this end, the full PROMISE NFR dataset and the SecReq dataset~\cite{knausseric20214530183} are used. Furthermore, four different language models are compared, two domain-generic ones, i.e., Sentence-BERT and All-MiniLM, and two domain-specific ones based on documents from the software engineering domain, i.e., Bert4RE and BERTOverflow.

The remaining paper is structured as follows: Section \ref{sec:related} briefly reviews the current ML approaches for requirements classification. Section \ref{sec:ZSL} introduces zero-shot learning.
Section \ref{sec:exp} states our research questions and describes our experimental design. Section \ref{sec:res} analyzes the experimental results, while Section \ref{sec:find} answers our research questions based on these results. Section \ref{sec:threats} examines the validity threats to our experiments and our mitigation strategies. Finally, Section \ref{sec:con} concludes the paper.   



\section{Approach}
\label{sec:approach}
\hl{waad: approach should be under ZSL section above}
To demonstrate how ZSL can be used for requirements classification, we adapt a simple zero-shot multi-label text classification approach proposed by Veeranna et al. \cite{sappadla2016using}. The simplicity of this approach enables us to better understand the use of ZSL in the context of requirements classification. Below, we briefly introduce this approach.

The basic idea of this approach is that each document and each label can be respectively represented as a sequence of words and thus we can predict if a label can be assigned to a given document by calculating the semantic similarity between the label and the document. If the similarity score is greater than a certain threshold, then we can classify the document into a specific category represented by the label; otherwise, the document does not belong to that category. Thus, this approach is ZSL, as it does not use any labelled training documents at all to make predictions. 

Clearly, the accuracy of this approach highly depends on the choice of 1) the \textbf{labels} and 2) the \textbf{representations of words}.

In the last few years, \textit{word embeddings} - continuous vector representations of words and phrases based on word co-occurrences in large corpora - became a popular method for NLP tasks \cite{mikolov2013distributed}. Word embeddings are capable of capturing syntactic and semantic characteristics of the words and have shown to be very effective on word similarity tasks \cite{sappadla2016using}. The early word embedding models include Skip-gram \cite{mikolov2013distributed}, Word2Vec \cite{mikolov2013efficient}, GloVe \cite{pennington2014glove}, and FastText \cite{bojanowski2017enriching}. More recently, LMs such as BERT, ELMo, and GPT-2, offer an advantage over the early embedding models. In contrast to fixed representations produced by the early models, LMs produce word embeddings that are dynamically informed by the words around them, known as \textit{contextualized word embeddings} or word embeddings in context \cite{ethayarajh2019contextual}. Contextualized word embeddings can tell if the word "mouse" refers to a computer device or a small rodent. 

The approach by Veeranna et al. \cite{sappadla2016using} is embedding-based that uses Skip-gram to produce word embeddings for both documents and labels. In our study, however, we take advantage of LMs for word embeddings. Figure \ref{ZSLApproach} depicts the ZSL approach used in our study.

\begin{figure}[t!]
    \centering
    \includegraphics[width=\textwidth]{Figures/ZSL_Approach.png}
    \caption{\footnotesize An Embedding-Based ZSL Approach.}
    \label{ZSLApproach}
\end{figure}

However, regardless of which embedding models are used, the $Cosine$ similarity function is the standard way to compute semantic similarity for word embeddings \cite{mikolov2013efficient,pennington2014glove,bojanowski2017enriching,ethayarajh2019contextual}. Accordingly, this function will also be used in our study to measure semantic similarity between requirements sentences and labels.

As a label is treated as a sequence of words, we can configure a label in different ways. In its simplest form, a label may only contain a single word. A more sophisticated label may be composed with single words, multi-words and phrases. Our study will explore different label configurations and select the label with the highest similarity score as the predicted, exclusive category for each given requirement.

\subsection{Comparative Studies}





 






\section{Classification of Non-Functional Requirements}
\label{sec:task1}
To classify non-functional requirements ...


\section{Classification of Functional Requirements}
\label{sec:task2}
To classify functional requirements ...


\section{Classification of Functional Requirements}
\label{sec:task2}
To classify functional requirements ...


\section{Related Work}
\label{sec:related}


Most studies in ML-based requirements classification focus on categorisation between functional (FR) and non-functional (NFR, or ``quality''~\cite{li2014non}) requirements, and on further categorisation of different NFR classes, such as security, performance, usability, \textit{etc.} However, the distinction between FR and NFR has been a matter of debate in the RE community~\cite{broy2015rethinking,dalpiaz2019requirements}, and the empirical study by Eckhardt~\textit{et al.}~\cite{eckhardt2016non} shows that NFRs can include functional aspects. Furthermore, there is a more fine-grained representation of FRs and NFRs given by the ISO/IEC/IEEE 29148:2018(E) Standard~\cite{ieeestandard}, which distinguishes between functional/performance, quality, usability, interface, and other classes, thus refining the conceptualisation already elaborated by the NFR classification from Glinz~\cite{glinz2007non}. Yet, despite the lack of consensus what NFRs are, and how we should classify and represent them, the differentiation between FRs and NFRs is a common categorisation in RE, and in the following we will use this distinction, keeping in mind that it is an artificial construct~\cite{eckhardt2016non}.  


ML-based approaches for requirements classification were examined in a systematic literature review by Binkhonain and Zhao \cite{binkhonain2019review}. One of the findings of that review was that while the majority of proposed solutions focuses on classification of FRs and different NFRs categories, there is a group of solutions targets specifically at security requirements. Accordingly, our discussion of related work will also differentiate these two groups of studies. 

\subsection{Classification of FRs and NFRs}

One of the earliest adoptions of ML to RE was due to Cleland-Huang \textit{et al.}~\cite{cleland2007automated,cleland2006detection}, who proposed to use a set of indicator terms to identify different classes of NFR. The approach was supervised, in that it first identified a set of indicator terms on a set of manually annotated requirements, and then used this set to classify unseen cases. The approach achieved a recall up to 0.80, but suffered from low precision, up to 0.21. This study also introduced the PROMISE NFR dataset~\cite{jane_cleland_huang_2007_268542}, which has been widely used by the research community, and it is also one of the benchmarks of our work. 

To mitigate the problem of dataset annotation, Casamayor \textit{et al.}\cite{casamayor2010identification} proposed a semi-supervised method, based on an iterative process similar to active learning, in which the user provided feedback to the classifier. Their approach used Naive Bayes (NB) as a classification algorithm and the PROMISE NFR dataset as the training set. After multiple iterations in which an increasing number of training examples were used, they obtained a maximum precision of above 0.80 and a maximum recall of above 0.70 on most classes except underrepresented one.
A semi-supervised learning approach was also proposed by Younas \textit{et al.}~\cite{younas2020extraction}. Similar to Cleland-Huang \textit{et al.}, indicator terms/keywords for NFR classes were selected from the literature and the semantic similarity was calculated between the requirement statements and indicator keywords, using the word2vec language model~\cite{mikolov2013distributed}. The method achieved 0.75 precision and 0.59 recall. These results were lower than those of Casamayor \textit{et al.}~\cite{casamayor2010identification}, but considerable less effort was required, as the method did not require labelled requirements as input.

A better-known ML approach is provided by Kurtanovi\'c and Maalej~\cite{kurtanovic2017automatically}, who applied Support Vector Machine (SVM) for requirements classification. They selected relevant features with an ensemble of different supervised classifiers and achieved a precision and recall up to 0.92 for identifying FRs and NFRs. For the identification of specific NFRs classes, they achieved the highest precision and recall for security and performance classes with 0.92 precision and 0.90 recall. Dalpiaz \textit{et al.}~\cite{dalpiaz2019requirements} reconstructed the study of the work by Kurtanovi\'c and Maalej and used the results obtained as a baseline to evaluate their proposed approach using interpretable  linguistic features.

To overcome the problem of labour intensive feature engineering, Navarro \textit{et al.}~\cite{navarro2017towards} proposed one of the first approaches using a deep learning (DL) model.
They used a Convolutional Neural Network (CNN) model on the PROMISE dataset, and obtained precision and recall of 0.80 and 0.79 respectively, thus addressing the problem of limited precision observed by Cleland-Huang \textit{et al.} Similarly, Dekhtyar and Fong~\cite{dekhtyar2017re} also used CNN, together with language representations based on pre-trained word2vec embeddings of the words found in the requirements. They used the PROMISE dataset and obtained precision and recall 0.93 and 0.92 respectively. However, their classification was only binary and only identified between FRs and NFRs. More recently, Aldhafer \textit{et al.}~\cite{aldhafer2022end} used a Bidirectional Gated Recurrent Neural Network (BiGRU) model to classify requirements. On the PROMISE dataset they reached 0.93 and 0.95 precision precision and recall respectively for FRs vs NFRs. On the different NFR classes, they obtain 0.78 and 0.76 precision and recall.


A more closely related work to ours is due to Hey \textit{et al.}~\cite{hey2020norbert}, who proposed NoRBERT, a transfer learning approach for requirements classification. Their approach is based on fine-tuning the BERT model (Bidirectional Encoder Representations from Transformers) \cite{devlin2018bert}. They achieved similar or better results with respect to previous works, with F1-scores of up to 0.94 on the PROMISE dataset for FR vs NFR requirements classification. NoRBERT also outperformed recent approaches at classifying NFRs classes. The most frequent classes were classified with an average F1-score of 0.87. The proposed solution was also applied for the classification of different types of functional requirements concerns in the PROMISE dataset, achieving F1-score up to 0.92. 



\subsection{Classification of Security Requirements}


One of the early works on security requirements classification was by Knauss \textit{et al.}~\cite{knauss2011supporting}, who used a Bayesian classifier to identify security-relevant requirements on three industrial datasets. These datasets are also used in our paper (aggregated into the \textit{SeqReq} dataset). They achieved precision $> 0.8$ and recall $>0.9$. In another work, Riaz \textit{et al.}~\cite{riaz2014hidden} proposed an approach to extract security-relevant sentences from requirements documents. They used a dataset of 10,963 sentences belonging to six different documents from the healthcare domain. The proposed approach was semi-automatic and based on KNN (K-nearest Neighbours) classification. The authors achieved a precision of $0.82$, and a recall of $0.79$.

Addressing the lack of domain-specific data sets, Munaiah \textit{et al.}~\cite{munaiah2017domain} proposed a domain-independent classification model for identifying domain-specific security requirements. The proposed approach, a one-class SVM classifier, was used to identify general descriptions related to software security weaknesses,
but not the actual security requirements \textit{per se}. The authors showed that the one-class classifier achieved an average precision, recall and F1-score of 0.67, 0.70 and 0.68, respectively. Varenov \textit{et al.}~\cite{varenov2021security} compared the performance of different LMs, namely BERT, XLNET, and DistilBERT, for security requirements classification. They identified $1,086$ security requirements of seven different classes collected from multiple existing datasets, such as PURE~\cite{ferrari2017pure}, SecReq~\cite{knauss2011supporting} and Riaz's dataset~\cite{riaz2014hidden}. Unlike  previous studies, this work aimed to classify security requirements into more fine-grained classes, i.e., Confidentiality, Integrity, Availability, Accountability, Operational, Access Control, and Other. DistilBERT achieved the best results, with F1-score of 0.78. 

\subsection{Contribution}
In comparison with the  related work, our study aims to present a comparative analysis of different ZSL configurations for the classification of requirements. Similarly to the proposal of Hey \textit{et al.}~\cite{hey2020norbert}, we explore the potential of a DL solution on the widely used PROMISE dataset.
Differently from Hey \textit{et al.}~\cite{hey2020norbert}, this is the first work in RE that proposes to use ZSL for the classification task. While Hey \textit{et al.} focus on addressing generalisability of the classifier by means of transfer learning, our proposal: (1) avoids the need of a tagged dataset, therefore addressing the well known problem of the scarcity of annotated datasets in RE~\cite{zhao2021natural,ferrari2017natural,dalpiaz2018natural,ferrari2017pure}; (2) is inherently generalisable to different projects, thus addressing the problem of decreasing performance with unseen projects, which typically affects requirements classifiers~\cite{hey2020norbert,dalpiaz2019requirements}. Concerning security requirements classification, our proposal overcomes the problem of dataset annotation as Munaiah \textit{et al.}~\cite{munaiah2017domain}. However, their approach is specific to security requirements, while our proposal is more generalisable and adaptable to different classification tasks.












































































 




\section{Experimental Results}
\label{sec:res}
In this section, we report the results by task. 
For each task, we discuss the labels configurations, and the results. 
\subsection{Task FR/NFR}
For the FR/NFR task we perform a binary classification, which aims to classify a requirement as either FR or NFR, where ``or'' is considered exclusive.


\subsubsection{Label Configuration}
\input{Tables/labelsFRvsNFR}

The label configurations for the FR/NFR Task are reported in Table~\ref{tab:labelsFRvsNFR}. The labels consist of two groups, one group to represent the FR class, and the other the NFR class. Six configurations are used, which combine the different strategies discussed in Sect.~\ref{sec:labels}. We did not apply all the possible configurations derived from the strategies, and in particular we did not apply the word embedding strategy alone. Indeed, the word-embedding LM did not lead to representative words related to the term term ``functional'', with the exception of the terms ``procedural'', ``structural'' and ``characterize''. This is due to the abstract nature of the the word ``functional'', which is hard to associate to concrete words to be used to enrich the labels. For this reason, the word embedding strategy, considering the top-20 terms, is used solely in combination with other strategies. In the selection of the word-embedding generated terms, the annotation procedure produced the following statistics. 75\% perfect agreement, 25 \% partial agreement, and 0\% disagreement. 
We also computed the IRR, and we obtained 0.41 as Krippendorff's alpha and a Fleiss' kappa score of 0.40, and both test results indicate a moderate agreement between the three annotators. 

Concerning the other strategies, it is worth remarking the usage of ``functional'' vs ``not about functional'' (strategy, FR\_1 Original 1). This type of strategy, in which the orginal label is negated with the prefix ``not about'' is also applied for the NFR class label of FR\_B and FR\_C, and will be applied also later on in this paper to represent the negation of a class in a binary classification. The choice of this expression was made after preliminary experiments using different forms of negation, e.g., ``no'', ``not'', which led to lower performance.









\subsubsection{FR vs NFR Binary Classification}
\input{Tables/resultsFRvsNFR}
Table~\ref{tab:resultsFRvsNFR} reports the overall classification results for all LMs and labelling strategy combinations. In \textbf{bold}, we highlight the best combination for each LM. 

The overall best combination is Sbert + FR\_E, achieving a wF1 score of 0.66, with wP = 0.71 and wR = 0.66. This indicates that the domain agnostic Sbert model, designed to provide a semantic-laden representation for generic sentences, substantially outperforms the other models for this task. Furthermore, the best labelling strategy for Sbert is FR\_E, i.e., the one that uses the Expert curated labels + Original labels, which identifies the NFRs using the names of the NFR classes (Usability, Security, Availability, \textit{etc.}).


\input{Tables/resultsdetailFRvsNFR}
Looking at Table~\ref{tab:resultsFRvsNFRdet}, we can see how the performance is divided between FR and NFR classification. We see that the model tends to have higher precision on NFR (P = 0.82), and higher recall on FR (R = 0.82). This is an interesting results, as FRs are less frequent in the dataset (255 FR, 370 NFR), and one would expect to have the opposite result. Indeed, the most frequent class is typically returned more frequently in ML approaches, as it happens, e.g., for NorBERT (cf. Hey \textit{et al.}~\cite{hey2020norbert}, Table III of their paper). This phenomenon occurs also for the other best configurations of LMs. This highlights a  characterising element of ZSL: the performance does not depend on the size of the dataset for each class, because no actual \textit{learning} is performed on the tagged data.


The second-best model is still a domain-agnostic one, AllMini + FR\_D, with wF1 = 0.59, i.e., 0.7 points lower than the first-best. The labelling scheme is only slightly different, as for FR we use solely the label ``functional'', whereas for NFR we use the same set of labels as for the fist-best case. Overall, with the exception of SObert---third-best model---all models have their best performance when using this labelling strategy for NFR.

In Table \ref{tab:resultsFRvsNFR}, we also highlighted in \textit{italic} style those classifiers whose performance could be misleading. 
These classifiers are: BERT4RE and SObert with FR\_A as a label configuration. Both classifiers achieved a modest F1 score, which is not the lowest, and obtained  wF1=0.44, wR=0.59 and wP=0.35. 
The performance rates do not reflect the suitability of these two classifiers, as they tend to classify \textit{all} the input requirements into one class which is NFR (the most frequent class in the dataset). This has no relation to the over-fitting learning issue, since these classifiers are not trained-on or learned-from any features of the given dataset. However, we can attribute these poor results to the fact that 
FRs might have several terms and expressions intersecting with NFRs, and domain-specific LMs, which are trained on smaller dataset with respect to generic LM, can encounter difficulties in distinguishing them. 
Consider these two example requirements from the PROMISE dataset: \textit{Any disputes cases that have been closed for over 6 months must be purged from the online disputes database} and  \textit{All actions that modify an existing dispute case must be recorded in the case history} which they are tagged with FR and NFR labels, respectively, but share several terms. Using original labels, as in FR\_A, the distinction is hard to tackle by a domain-specific LM
compared to the generic LMs (i.e., Sbert and AllMini) which managed to correctly classify the above-mentioned requirement examples. This issue could be potentially solved with a more project-specific labelling strategy for functional requirements (i.e., choosing terms that characterise functional requirements in the specific project).  

\subsection{Task NFR}

In this task, we performed three classification sub-tasks: A binary classification to detect a specific NFR category (e.g., ``usability'' vs ``other''), a multi-class classification to classify a requirement into one class out of a set of NFR classes and a multi-label classification, in which each requirement is associated with a ranked list of NFR classes, and we want to see if the correct label is in the top-\textit{k} classes. This last approach can be applied in a semi-automatic classification context, in which the requirement analyst is proposed the two top-\textit{k} classes, and they are asked to select the correct one. For all the sub-tasks, we evaluate the results: 1) considering only requirements in the largest classes, namely security, usability, performance and operational, which include the majority of the requirements; 2) considering all the classes, except the portability class, which includes one requirement only. For the multi-label classification case, we consider $k = 2$ when only the largest classes are considered, and $k = 3$ when all the classes are considered.


\subsubsection{Label Configuration}
\label{sec:NFRlabels}
\input{Tables/NFRlabelsExample}
\input{Tables/TopNFRmultiLabels}

Two labelling configurations are used for the three sub-tasks, one for the binary case (Table~\ref{tab:labelsNFRbinary}), and the other for the both multi-class and multi-label classification cases (Table~\ref{tab:labelsNFRmulti}). 

\paragraph{Binary classification} For the binary case, we have 5 configurations for each NFR class considered, i.e., each binary ZSL classifier. In Table~\ref{tab:labelsNFRbinary} we report only the labels for the usability and security classes, while the other labels are reported in the Appendix Table \ref{tab:appendixNFRbinLabels}. The strategies are analogous to those already discussed for the FR/NFR Task. The only main difference is the usage of the top-50 words from the word embeddings, besides the top-20. We performed some preliminary experiments and saw that the list of similar words for the NFR class names included relevant words also beyond the top-20, and therefore we considered it reasonable to extend the list of terms to be included in the labels. 

Concerning the agreement in the word selection (top-50) we have the following statistics: 52\% perfect agreement, 44 \% partial agreement, and 2\% disagreement. We computed the IRR using Krippendorff's alpha test at micro (IRR rate of all NFR labels) and macro (an average of the IRR rate per NFR label). We obtained an alpha rate of 0.45 and 0.53 at macro and micro level, respectively. To confirm the IRR results, we ran Fleiss's Kappa test, and we obtained a Fleiss's kappa of 0.42 and 0.52 at macro and micro level, respectively. The overall IRR results indicate a moderate agreement between the three annotators.

\paragraph{Multi-class and multi-label classification.} For these tasks, we have a list of labels for each configuration, cf. Table~\ref{tab:labelsNFRmulti}. The list is represented with squared brackets, the elements in the list are separated by commas, and each elements is a label, expressed between quotes, as in Python code. In the table, each label in the list is associated to one of the top-4 largest classes, namely usability, security, performance and operational NFR. The label configurations considering \textit{all} classes are reported in the Appendix Table \ref{tab:appendixNFRmultiLabels}. We did not use combinations of labeling strategies for these cases, given the extensive number of experiments, and the exploratory nature of the study. 








\subsubsection{NFR Binary Classification}
\input{Tables/TopNFRbinaryResults}
Table \ref{tab:TopNFRbinresults} reports the classification results for the 4 largest NFR classes. The overall results indicate an acceptable performance rates, nearly above 0.7 as a wF1, across all the best rates reported in that table. 

The highest F1 score of 0.84 is achieved for the security class, with AllMini + SE\_D which uses the word embedding selected labels (top-20) for the security class, and the original NFR labels for the ``other'' class. Following that, is the usability class with wF1 = 0.80, using Sbert + US\_E which includes the word embedding selected labels (top-50) for the usability class, and the original labels for the ``other'' class. This suggests that, for this task and for highly represented classes such as security and usability, generic LMs combined with selected terms from word-embedding as labels appears to be the most effective configuration. However, it should be noted that this also depends on the specific requirement class considered. Indeed, for performance requirements we still see that generic models still outperform the others (F1 = 0.78 for both Sbert and AllMini), but the labelling strategy B (Expert curated) appears to be the most effective for AllMini. Furthermore, for Operational requirements, the best performance are achieved with BERT4RE (F1 = 0.72), though comparable with SBert (F1 = 0.70). While, in general, generic models outperforms the specific ones, the performance may depend on the class considered, i.e., the specific task addressed. 


\input{Tables/resultsdetailBinaryNFR}
Table \ref{tab:resultsdetailNFRbin} reports the detailed classification report of the best ZSL classifiers from Table \ref{tab:TopNFRbinresults}, considering P, R, and F1 for each class. We see that all the best classifiers tend to achieve higher performance on the ``other'' class (best F1 for NFR class 0.70, vs 0.89 for the ``other'' class). This suggests that the ZSL binary classifier encounters some difficulty in associating the requirements to the specific labels, despite the extensive set of terms used. A more accurate selection of terms, or the usage of terms directly coming from the requirements themselves\footnote{We did not consider this option, as it would have biased the classification. However, it is a viable choice in practical contexts}, could overcome this issue. 


\input{Tables/BestZSLsys4NFRbinary}
Finally, Table ~\ref{tab:bestZSL4NFRbinary} lists the top performance rates considering all classes, and the entire set of requirements. Comparing these results with Table~\ref{tab:resultsdetailNFRbin}, we see that there is no substantial decrease in terms of performance for the largest classes, e.g., US still achieves wF1 = 0.80, while SE achieves wF1 = 0.85, which is even higher than wF1 = 0.84 in Table~\ref{tab:resultsdetailNFRbin}. However, the results are \textit{all} biased towards the ``other'' class, since, even in the best case, F1 for the NFR class is lower than 0.50. This problem was not so evident in Table~\ref{tab:bestZSL4NFRbinary}, at least for the US and SE classes, where F1 is still acceptable also for the NFR class. We can therefore conclude that, in case of  requirements belonging to many different classes, a binary ZSL classification leads to poor classification results, with the selected labelling strategy. 

\subsubsection{NFR Multi-class Classification}
\input{Tables/TopNFRmultiResults}

Table \ref{tab:TopNFRmultiesults} reports the multi-class classification results for the 4 largest NFR classes. We see that, compared to the binary classification, results are substantially lower, although still acceptable for SE (F1 = 0.76), O (F1 = 0.64) and PE (F1 = 0.64). In the majority of the cases, the best results are obtained again with the domain-generic LMs, and using MultiNFR\_A or MultiNFR\_B as labels. These are the shortest labels, which do not use the word embedding strategies. This result is the opposite of what was observed for binary classification for NFR. We argue therefore that, in a multi-class classification setting, longer and more informative labels can lead to some possible overlapping between the represented meaning of each class. Instead, in a binary classification setting, more informative labels, i.e., using word embeddings, appear to be more effective. 

\input{Tables/AllNFRmultiResults}

Table \ref{tab:NFRmultiAll} reports the multi-class classification results for all the NFR classes, except for the portability class. The performance in terms of F1 remains acceptable only for SE class (F1= 0.69), while for the other classes poor results are obtained. Similarly to what was observed for the binary classification, when requirements belonging to many classes are considered, ZSL does not appear to be sufficiently effective.  

\subsubsection{NFR Multi-label Classification}

\input{Tables/TopNFRmultiLabelResults}
Table \ref{tab:TopNFRmultiLabelResults} reports the multi-label classification results for the 4 largest NFR classes, considering the top-2 labels returned by the classifier. In other terms, when the right label is returned by the classifier in the top-2 labels, we consider it a true positive. We see that in this case performance  substantially increases with respect to the multi-class classification case in Table~\ref{tab:TopNFRmultiesults}, e.g., reaching $F1= 0.94$ for US requirements,  $0.89$ for SE, $0.83$ for O, and $0.89$ for PE. This suggests that the multi-label classification strategy may be the most effective when dealing with NFR classification. 

Looking at the results based on the LMs, we do not have a clear pattern, and each LM appears to be suitable for a certain requirement type. Concerning labels, simple configurations as MultiNFR\_A and MultiNFR\_B appears to be the most effective for all classes, except US, for which the embedding-based labels are more effective. This could be due to a better, and more clear-cut characterisation of US requirements with respect to other types. 

\input{Tables/AllNFRmultiLabelResults}
Table \ref{tab:AllNFRmultiLabelResults} reports the multi-label classification results for all the NFR classes, considering the top-3 results, i.e., if the right label is returned among the top-3 labels, we consider it a true positive. Also in this case, the performance remain rather high, frequently with F1 above $0.90$ for the best configurations. In terms of LMs, a clear pattern cannot be identified, although we see that, for more populated classes (top), the highest F1 is always achieved by generic LM, while for less populated classes (bottom, e.g., SC, A, MN) the best results are obtained with specific models. Overall, we can say that ZSL in the multi-label classification context appears to be effective also in case of NFRs belonging to many classes.

\subsection{Task Security}
This task is based on SeqReq dataset. We performed a series of binary classification tasks on all or parts of the above-mentioned dataset. In the following sub-sections, we report the results of the labels selection and experiment results.

\subsubsection{Security Label Configuration}
\input{Tables/labelsSecurity}




The labelling of the security class (Table~\ref{tab:labelsSecurity}) is very much similar to the labels groups related to security as an NFR class in the binary classification task (cf. \ref{sec:NFRlabels}). The specific results obtained for the word-embedding of the term ``security'' are the following:  50\% perfect agreement, 48 \% partial agreement, and 2\% disagreement. For IRR we obtained 0.46 as Krippendorff's alpha and a Fleiss' kappa score of 0.45, and both test results indicate a moderate agreement between the three annotators. 





\subsubsection{Security Binary Classification}
\input{Tables/resultsSecurity}

Table~\ref{tab:resultsSecurity} reports the results for the Security task, considering all the requirements in the three datasets. The best performance is achieved by AllMini + Sec\_B, with a wF1 score of 0.66, with wP = 0.68 and wR = 0.65. The generic LM, AllMini, thus achieves best results. On the other hand, the other generic model, Sbert, achieves the worst results (wF1 = 0.31), thus suggesting that generic models are not necessarily better for this specific task. 
The best set of labels, Sec\_B, is the expert's curated one, which includes a limited set of three security-related words. This suggests that a limited number of well-selected terms is sufficient to identify security requirements in this dataset.

\input{Tables/resultsdetailSecurity}

Table~\ref{tab:resultsdetailSecurity} shows the performance for the two classes. We see that better performance in terms of F1 is achieved for Non-Security requirements (F1 = 0.70 vs 0.58), using AllMini + Sec\_B, i.e., the best configuration. Looking in more detail, the best recall (R = 0.92) is obtained by  Sbert + Sec\_D. Therefore, if one seeks for a better ability to identify security requirements, i.e., high recall on this set, this configuration---though having the worst overall performance---should be preferred. This is an important observation, since for many requirement  tasks, including this one, high recall is more important than high precision, as remarked by Berry~\cite{berry2021empirical}---if one searches for security requirements, then one wants as less false negatives as possible.

\input{Tables/resultssubsetSecurity}
Table~\ref{tab:resultssubsetSecurity} reports the results for the Security task, divided by each dataset included in SecReq. Best results are achieved for CPN (wF1 = 0.78), while worst results are for GPS (wF1 = 0.63). This could be due to the specific characteristics of the datasets. In some cases, security and non-security requirements in GPS are expressed with very similar sentences are likely to be classified similarly though they belong to different classes (e.g., class Security: \textit{The Load File Data Block Hash is used in the computation of the Load File Data Block Signature} vs Non-Security: 
\textit{The Load File Data Block Hash is used in the computation of The Load Token}). Misclassifications can also be due to the quality of the tagging of the requirements in the annotated datasets. For example, in CPN we have the following requirement marked as non-security related: \textit{CNG shall support mechanisms for secure authentication and communication with the remote management system}. In GPS, we have another example, also marked as non-security: \textit{The security level of the communication with an off-card entity does not necessarily apply to each individual message being transmitted but can only apply to the environment and/or context in which messages are transmitted}.

\section{Threats to Validity}
\label{sec:threats}

\paragraph{Construct Validity} The main threat to construct validity in our study are the adopted concepts of FR and NFRs. As previously discussed, this is an artificial distinction that doe not often apply in practice~\cite{eckhardt2016non}, as non-functional requirements are better identified as \textit{qualities}~\cite{ieeestandard}, and the classification is frequently non-binary, i.e., a multi-label classification. However, FR/NFR is a traditional distinction, still common in the industrial practice, and in research. Furthermore, ZSL is inherently designed for single-label classification, and its usage for multi-label classification---in case binary classification--- would introduce the need for threshold values in the classification (i.e., when both classes has a correlation score above a certain threshold, then classify the requirement as both FR and NFR). To have rigorous results, the threshold value would need to be assessed with sensitivity analysis which are not realistic with the limited datasets available, thus leading to non-representative thresholds. For this reason, we excluded the annotation of PROMISE by Dalpiaz \textit{et al.}~\cite{dalpiaz2019requirements} from our evaluation. For security vs non-security requirements, the same observations as for FR/NFR hold. Finally, the adopted metrics for evaluation (precision, recall, weighted F1, accuracy) are those typically used for ML systems, so we do not foresee any major construct validity issue on this aspect.

\paragraph{Internal Validity} This is an experiment with software subjects, and the intervention of the researchers is limited, thus granting minimal researcher's bias. In the evaluation, we have used established and widely used annotated datasets from the literature. Therefore, the internal validity threats are somewhat inherited from the labelling performed by previous work. While the labelling of the PROMISE dataset has been questioned by previous work~\cite{dalpiaz2019requirements,hey2020norbert}---requirements have been labelled by students, classification is binary, and some classes are not sufficiently represented---it represents a classical benchmark, which can be used to compare our results with previous state-of-the-art proposals. Concerning internal threats due to implementation issues, we have adopted widely used LMs, made available by Hugging Face (domain-generic models), or developed by authors of the software engineering community (domain-specific models). These models have been tested in other environments, thus increasing the confidence on their reliability. Concerning the implementation of the ZSL algorithm, we have used the Transformers package in Python to retrieve the LMs from HuggingFace hub and apply encoding for the labels and requirements representations. However, this package is also widely used, and we have made our code available for inspection in a google Colab Notebook, so that results can be replicated.

\paragraph{External Validity}
The PROMISE dataset include requirements written by students, which may not be representative of industrial requirements. Instead the SecReq dataset includes requirements from three industrial projects. Our evaluation of ZSL is limited to the three tasks of FR/NR, NFR, and Security, and therefore our conclusion apply only to these tasks. Different results may be obtained when other classification schemes are used, or other types of requirements-related information (e.g., user stories, or app reviews) is used. Another limitation is the experimentation with the embedding-based ZSL approach only, and not considering other ZSL strategies, as, e.g., the entailment approach. This is due to the exploratory nature of the study, which is focused on the comparison of different types of LM, and different labelling strategies. Future work will provide a thorough comparison with other ZSL approaches, stemming from our results. Concerning the coverage of possible embedding-based ZSL configurations, we have considered both state-of-the-art domain-generic LMs (SentenceBERT, and All Mini LM) and domain-specific ones (BERT4RE and BERTOverflow), all based on the BERT LM, which is widely used and has been shown to achieve high performance also for RE classification tasks~\cite{hey2020norbert}. Furthermore, we have used multiple labelling strategies, designed for the specific tasks addressed. Therefore, we argue that our analysis can be considered representative of the usage of embedding-based ZSL for requirements classification, using LMs derived from BERT.





























