\section{Introduction}
Human Activity Recognition (HAR) is the understanding of human behaviour from data captured by pervasive sensors, such as cameras or wearable devices. It is a powerful tool in medical application areas, where consistent and continuous patient monitoring can be insightful. Wearable devices provide an unobtrusive platform for such monitoring, and due to their increasing market penetration, feel intrinsic to the user. This daily integration into a user's life is crucial for increasing the understanding of overall human health and wellbeing. This is referred to as the ``quantified self" movement. 


Wearables, such as actigraph accelerometers, generate a continuous time series of a person's daily physical exertion and rest. This ubiquitous monitoring presents substantial amounts of data, which can ({\em i})~ provide new insights by enriching the feature set in health studies, and ({\em ii})~ enhance the personalisation and effectiveness of health, wellness, and fitness applications. By decomposing an accelerometer's time series into distinctive activity modes or actions, a  comprehensive understanding of an individual's daily physical activity can be inferred. The advantages of longitudinal data are however complemented by the potential of noise in data collection from an uncontrolled environment. Therefore, the data sensitivity calls for robust automated evaluation procedures.


In this paper, we present a robust automated human activity recognition (RAHAR) algorithm. We test our algorithm in the application area of sleep science by providing a novel framework for evaluating sleep quality and examining the correlation between the aforementioned and an individual's physical activity. Even though we evaluate the performance of the proposed HAR algorithm on sleep analysis, RAHAR can be employed in other research areas such as obesity, diabetes, and cardiac diseases.

\section{Related Work}

Human activity recognition (HAR) has been an active research area in computer vision and machine learning for many years. A variety of approaches have been investigated to accomplish HAR ranging from analysis of still images and videos to motion capture and inertial sensor data.
 
Video has been the most widely studied data source in HAR literature. Hence, there exists a wealth of papers in this particular domain. The most recent literature on HAR from videos include trajectory-based descriptors~\cite{ HWang:IJCV15, BFernando:TPAMI16, IAtmosukarto:WACV15}, spatio-temporal feature representations~\cite{ SMa:CVPR15, ZZhou:TMM15, DTran:ICCV15}, feature encoding~\cite{ VKantorov:CVPR14, XPeng:ECCV14, HKuehne:WACV16}, and deep learning~\cite{ JDonahue:CVPR15, LWang:CVPR15, LSun:ICCV15}. Reviewing the extensive list of video-based HAR studies, however, goes beyond the scope of this study and we refer the reader to~\cite{ SKe:Computers13, PBorges:TCSVT13} for a collection of more comprehensive surveys on the topic. 

Unlike HAR from video, existing approaches for HAR from still images are somewhat limited, and range from histogram-based representations~\cite{NIkizler:ICPR08, CThurau:CVPR08} and color descriptors~\cite{FKhan:IJCV13} to pose-, appearance- and parts-based representations~\cite{WYang:CVPR10, SMaji:CVPR11, BYao:ICCV11, GSharma:TPAMI16}. Guo and Lai recently provided a comprehensive survey of the studies on still image-based HAR in~\cite{GGuo:PatRec14}.

Several techniques have been proposed, on the other hand, for HAR from 3D data, encompassing representations based on bag-of-words~\cite{ WLi:CVPRW10, LXia:CVPRW12}, eigen-joints~\cite{ XYang:CVPRW12}, sequence of most informative joints~\cite{ FOfli:JVCI14}, linear dynamical systems~\cite{ RChaudhry:CVPRW13}, actionlets~\cite{ JWang:CVPR12}, Lie algebra embedding~\cite{ RVemulapalli:CVPR14}, covariance descriptors~\cite{ MHussein:IJCAI13}, hidden Markov models~\cite{ FLv:ECCV06}, subspace view-invariant metrics~\cite{ YSheikh:ICCV05} and occupancy patterns~\cite{ JWang:ECCV12, AVieira:CIARP12}. Aggarwal and Xia presented a recent survey summarizing state-of-the-art techniques in HAR from 3D data~\cite{JAggarwal:PatRec14}.

Unlike vision-based HAR systems, sensor-based HAR technologies commonly deal with time series of state changes and/or various parameter values collected from a wide range of sensors such as contact sensors, accelerometers, audio and motion detectors, etc. Chen et al.~\cite{LChen:TCMCC12} and Bulling et al.~\cite{ABulling:CSUR14} present comprehensive reviews of sensor-based activity recognition literature. The most recent work in this domain includes knowledge-based inference~\cite{ACalzada:EMBC14, DBiswas:HUMOV15}, ensemble methods~\cite{ATripathi:EAIS15, CCatal:ASOC15}, data-driven approaches~\cite{RAkhavian:WSC15, LLiu:KNOSYS15}, and ontology-based techniques~\cite{GOkeyo:PMCJ14}.

All of the aforementioned studies investigate recognition/classification of fully observed action or activity, e.g., jumping, walking, running, drinking, etc. (i.e., activities of daily living), using well-curated datasets. However, thanks to the ``quantified self" movement, myriad of consumer-grade wearable devices have become available for individuals who have started monitoring their physical activity on a continuous basis, generating tremendous amount of data. Therefore, there is an urgent need for automatic analysis of data coming from fitness trackers to assess the physical activity levels and patterns of individuals for the ultimate goal of quantifying their overall wellbeing. This task requires understanding of longitudinal, noisy physical activity data at a rather higher (coarser) level than specific action/activity recognition level. Main challenges as well as opportunities of HAR from personalized data and lifelogs have been discussed in several dimensions in~\cite{BDobkin:CurrOpinNeurol13, OLara:SURV13, JBort-Roig:Sports14, MRehman:Sensors15, SMukhopadhyay:JSEN15}.

There has been a number of initiatives to overcome the challenge of collecting annotated personalized data to further research on HAR from continuous measurement of real-world physical activities~\cite{MZhou:ICMR13, CMeurisch:UbiComp15}. Even though such systems exhibit a crucial attempt in furthering research in mining personalized data, they have limited practical importance as they rely on manual annotation of the acquired data. There has also been recent attempts to automatically recognize human activities from continuous personalized data~\cite{JHamm:MobiCASE13, CDobbins:CIT15, MUddin:WearSys15, OBanos:EMBC15}. However, most of these studies are designed to recognize only a predefined set of activities, and hence, not comprehensive and robust enough to quantify the physical activity levels for the overall assessment of individuals' wellbeing.


\section{Background}
Sleep pattern evaluation is a paragon of cumbersome testing and requires extensive manual evaluation and interpretation by clinical experts. Unhealthy sleep habits can impede physical, mental and emotional wellbeing, and lead to exacerbated health consequences~\cite{strine2005associations}. Since patient referral to sleep specialists is often based on self-reported abnormalities, exacerbation often precedes diagnosis. 

Clinical diagnosis of complex sleep disorders involves a variety of tests, including an overnight lab stay with oxygen and brain wave monitoring (polysomnography and electroencephalogram, respectively), and a daily sleep history log with a subjective questionnaire. The daily sleep logs and questionnaires are often found to be unreliable and inconsistent with actual observed activity. This is especially true in adolescents~\cite{arora2013investigation}. The overnight stay allows specialists to manually monitor the patient's sleep period. This requires the active involvement of a clinical sleep specialist. Furthermore, the monitoring is only for one night and in a clinical setting, rather than the patient's own home. Using wearable devices provides both a context-aware and longitudinal monitoring. 

The inconvenience and inaccuracy of daily logs, coupled with the invasiveness of an overnight lab stay, substantiate the need and adoption of wearable devices for first pass diagnostic screening. More generally, using our HAR approach with a wearable device empowers users to self-monitor their sleep patterns, and reform their activity habits for optimised sleep and an improved quality of life. 


\section{Preliminaries}
In this section we present a description of the dataset and the context-aware definitions used for our application area.

\subsection{Data}
Data was collected as part of a research study to examine the impact of sleep on health and performance in adolescents by Weil COrnell Medical COllege - Qatar. Two international high schools were selected for cohort development. Student volunteers were provided with an actigraph accelerometer, ActiGraph GT3X+\footnote{http://actigraphcorp.com/support/activity-monitors/gt3xplus/}, to wear on their non-dominant wrist, continuously throughout the study (i.e. even when sleeping). Deidentified data collected in the study were used in the current analysis.

The ActiGraph GT3X+ is a clinical-grade wearable device that has been previously validated against clinical polysomnography~\cite{PFreedson:MedSciSports98}. The device samples the user's sleep-wake activity at 30-100 Hertz. Currently sleep experts use this device in conjunction with the accompanying software, ActiLife\footnote{http://actigraphcorp.com/products-showcase/software/actilife/}, to evaluate an individual's sleep period. We evaluate our results side-by-side with ActiLife's results. 

\subsection{Definitions}

\begin{figure*}
\caption{Sleep science definitions on an example accelerometer data extract}
\centering
   \includegraphics[scale=0.25]{ActivityExample.jpg}
\label{fig:sleep_example}
\end{figure*}

\begin{table*}
\centering
\caption{Relevant sleep science equations~\cite{CIber:AASM07}}
\centering
\begin{tabular}{c|c}
    \hline
    Sleep Period &  $ \big[ \text{Sleep Onset Time}, \text{Sleep Awakening Time} \big] $ \T\B \\
    \hline
    Sleep Period Duration  &   $ \left \| \text{Sleep Awakening Time} - \text{Sleep Onset Time}  \right \| $ \T\B \\
    \hline
    Wake After Sleep Onset (WASO)    &  $ \sum_{\text{n}=\text{onset}}^{\text{awake}} \left \| \text{Wakefulness} \right \| $ \T\B \\
    \hline
    Latency    &  $ \big[ \text{Preceding Sedentary Time}, \text{Sleep Onset Time} \big] $ \T\B \\
    \hline
    Total Minutes in Bed    &   $ \left \| \text{Sleep Awakening Time} - \text{Preceding Sedentary Time} \right \| $ \T\B \\
    \hline
    Total Sleep Time    &   $ \left \| \text{Sleep Period Duration} - \text{WASO} - \text{Latency}  \right \| $ \T\B \\
    \hline
    Sleep Efficiency    &  $ \text{Total Sleep Time} / \text{Total Minutes in Bed}  $ \T\B \\
    \hline
\end{tabular}
\label{tab:sleep_eqs}
\end{table*}

To apply our methodology to the area of sleep science, it is important to note the definitions mentioned in this section. In traditional sleep study literature, a sleep period is bounded between the sleep-onset-time and sleep-awakening-time~\cite{CIber:AASM07}. Experts characterise the sleep-onset-time as the first minute after a self-reported bedtime, that is followed by 15 minutes of continuous sleep~\cite{sadeh2000sleep}. We propose a modified definition, that allows for automatic evaluation and deems sleep diaries unnecessary. As a result, we can infer the ``bedtime" of an individual in reverse, based on their sedentary activity before the onset of sleep. Epoch records that contain no triaxial movement, 0 steps taken, and an inclinometer output of not lying down, are candidate sleep records, and are further tested  for whether they are a component of the sleep period. We define sleep-onset-time as the first candidate epoch record in a series of 15 continuous candidate sleep minutes. Likewise, the sleep-awakening-time is defined as the last epoch record in a series of 15 continuous candidate sleep minutes, that is followed by 30 continuous non-candidate sleep minutes, (i.e. 30 minutes of active awake time). The sleep period duration can be computed as the time passed between sleep onset and sleep awakening.

Within the sleep period, there are periods of unrest or wakefulness. For example, when a user re-adjusts positions, or uses the bathroom. If the duration of movement exceeds 5 consecutive minutes of activity, it is marked as a time of ``wakefulness." The total sum of all moments of wakefulness is referred to as wake-after-sleep-onset, also known as WASO. 

Immediately preceding the start of the sleep onset, is the time-in-bed, which quantifies the sedentary time an individual spends before they have fallen asleep. This sedentary time can be observed in the actigraph accelerometer data. The time that the preceding sedentary activity begins until the time of the sleep onset is called the sleep latency. 

From the aforementioned values, total sleep time and an overall sleep efficiency score can be deduced. Total sleep time covers the defined sleep period, less the wake after sleep onset time and less the latency. Lastly, sleep efficiency is the ratio of total sleep time to total minutes in bed. All of the above definitions are summarised in Table~\ref{tab:sleep_eqs}, and visualised in Fig.~\ref{fig:sleep_example}. In this study, we use sleep efficiency as the metric to measure sleep quality~\cite{JCacioppo:PsycSci02} among other metrics such as latency, wake after sleep onset, awakening index, total sleep time, etc.~\cite{SScholle:SleepMedicine11}.


\section{Methodology}
Our methodology for RAHAR is shown algorithmically in Fig.~\ref{algo:rahar}. We elaborate on the details of our algorithm in the sequel.
\begin{figure}
\begin{algorithmic}[1]
\STATE{\textbf{input:} Raw accelerometer data}
\STATE{\textbf{output:} Time-series segments with activity intensity level annotations}
\FORALL{segment (daily or otherwise)}
	\FORALL{epoch (minutes, hour, etc.)}
		\STATE{implement activity cut points}
	\ENDFOR
	\STATE{change points $\gets$ implement hierarchical divisive estimation}
	\STATE{change point intervals $\gets$ divide time series by change points}
\ENDFOR
\FORALL{change point interval}
	\STATE{activity mode $\gets$ statistical mode of cut points}
\ENDFOR
\end{algorithmic}
\caption{Algorithm for Robust Automated Human Activity Recognition (RAHAR).}
\label{algo:rahar}
\end{figure}




\begin{figure*}
\centering
  \caption{Classification labelling of each change point interval during an example awake time}
  \includegraphics[width=1.0\linewidth]{RAHAR.png}
  \label{fig:change_point_labeling}
\end{figure*}

\subsection{Pre-Processing}
The accelerometer of choice, Actigraph GT3X+, sampled each person's activity at 30-100 Hertz. The stored data included the triaxial accelerometer coordinates as well as a computed epoch step count based on the vertical axis, and post-processed inclinometer orientation.  This raw data was downloaded and aggregated to a minute-by-minute granularity. An epoch of one minute was selected in order to optimise the interpretability of the physical activity~\cite{KGabriel:IJBNPA10}, as well as for implementing the state-of-the-art cut point methodology~\cite{troiano2008physical}. In other contexts, a different granularity may be sufficient.

\subsection{Automated Annotation and Segmentation}
Due to the context of sleep disorders, sleep periods needed to be annotated within the raw ActiGraph output. Candidate sleep records, epochs with no triaxial movement, 0 steps taken, and an inclinometer output of not lying down, were identified in the time series and tested to find the sleep onset time, and sleep awakening time. The details of this terminology is elaborated in the preliminaries section. All test instances that fell within these two boundary times, were annotated as ``Sleep," and constituted the sleep period.  

Whilst analysing the data, we found that several participants had multiple sleep periods in a day, implying that they took daily naps or followed a polyphasic, or biphasic, sleep pattern. Upon closer analysis of the length and time of the sleep period, no discernible patterns were visible. Thus we opted to segment the time series by the end of a sleep period rather than the traditional approach of segmenting by day. Each sleep period was linked to its preceding activity, extending until the previous sleep period. We refer to these segments as sleep-wake segments. The result of this decision is that the activity immediately before each sleep period is used for the correlation analysis for its subsequent sleep period, rather than the total for that day. 

\begin{figure*}
    \centering
     \caption{ROC curves for sleep efficiency}
    \begin{subfigure}[b]{0.49\linewidth}       
        \centering
        \caption{RAHAR}
        \includegraphics[width=\linewidth]{Ours_ROC.png}
        \label{fig:roc_RAHAR}
    \end{subfigure}
    \begin{subfigure}[b]{0.49\linewidth}       
        \centering
        \caption{Sleep Expert + ActiLife}
        \includegraphics[width=\linewidth]{SleepExpert_ROC.png}
        \label{fig:roc_SEAL}
    \end{subfigure}
    \label{fig:roc_curve}
\end{figure*}

\begin{table*}
\centering
\caption{Sleep efficiency results}
	\begin{tabular}{c|cc|cc|cc|cc|cc}
	    \hline
	    &\multicolumn{2}{c}{AU-ROC} & \multicolumn{2}{c}{F1 Score}& \multicolumn{2}{c}{Recall}& \multicolumn{2}{c}{Precision}& \multicolumn{2}{c}{Accuracy}
	    \\
	    \hline
	    &				SE+AL	 &	 RAHAR	 & 	SE+AL 	& 	RAHAR 	&	 SE+AL 	& 	RAHAR 	& 	SE+AL 	& 	RAHAR 	&	 SE+AL 	& 	RAHAR 
	    \\
	    \hline
	    \hline
	    Ada		&	0.7489	&	0.8132	&	0.5574	&	0.6885	&	0.5484	&	0.5526	&	0.5667	&	0.9130	&	0.6966	&	0.7206\\
	    RF 		&	0.8115	&	0.8746	&	0.6885	&	0.7500	&	0.6774	&	0.6316	&	0.7000	&	0.9231	&	0.7865	&	0.7647\\
	    SVM		&	0.7497	&	0.7895	&	0.3721	&	0.7077	&	0.2581	&	0.6053	&	0.6667	&	0.8519	&	0.6966	&	0.7206\\
	    LogR 		&	0.5884	&	0.8649	&	-		&	0.6875	&	-		&	0.5789	&	-		&	0.8462	&	-		&	0.7059\\
	    \hline
	    Average	&	0.7246	&	0.8355	&	0.5393*	&	0.7154*	&	0.4946*	&	0.5965*	&	0.6445*	&	0.8960*	&	0.7266*	&	0.7353*\\
	    \hline
	    \multicolumn{11}{@{}l}{{\scriptsize * logistic regression (LogR) score is not included in averaging.}} \\
	\end{tabular}
	\label{tab:slp_eff_res}
\end{table*}

\subsection{Activity Mode Detection}


The actigraph accelerometer data contains post-filtered ``counts" for each of the axes. These counts quantify the frequency and intensity of the user's activity\footnote{http://actigraphcorp.com/wp-content/uploads/2015/06/ActiGraph-White-Paper\_What-is-a-Count\_.pdf}. Using Troiano's cut point scale~\cite{troiano2008physical}, the age of a user, and their accelerometer triaxial count, each epoch is labeled with an intensity level: Sedentary, Light, Moderate, or Vigorous. Since each epoch is 1 minute in length, this provides an unnecessary granularity to an individual's activity levels and is highly subject to noise. We ``smooth" the activity intensity levels over activity modes using change point detection.

Once the time series is segmented into sleep-wake segments, we identify the distinctive activity modes using the multiple change point detection algorithm, hierarchical divisive estimation~\cite{NJames:JSS15}. We tested the change points to a statistical significance level of 0.01 and used a maximum number of random permutations of 99. Each change point result is treated as the interval boundaries for distinctive activity modes. 
 
Each sleep-wake segment now consists of a series of change point intervals. The activity intensity classification label for each change point interval is computed by taking the statistical mode of the minute-by-minute labels over every epoch existing within the interval. Fig.~\ref{fig:change_point_labeling} illustrates the classification labelling of an individual's awake time.

\subsection{Modeling}
In sleep science, sleep quality is defined by a number of metrics, including total sleep time, wake after sleep onset, awakening index, and sleep efficiency~\cite{SScholle:SleepMedicine11}. In our analysis, we focus on sleep efficiency metric for our experiments~\cite{JCacioppo:PsycSci02}. Sleep efficiency is computed as a numerical value ranging from 0 to 1. According to specialists, a sleep efficiency below 0.85 (i.e., 85\%) indicates poor sleep quality. Thus, each sleep period can be classified as having ``good sleep efficiency" or ``poor sleep efficiency"~\cite{williams1974electroencephalography}. 

To model the effect of daily physical activity on sleep, the duration of each intensity level of activity was aggregated over the sleep-wake segment. The percentage of awake time in each mode was used as the model input. 



\section{Experiments and Results}

\begin{figure*}
    \centering
     \caption{Comparison of the performance of random forest model for each approach}
    \begin{subfigure}[b]{0.49\linewidth}       
        \centering
        \caption{ROC}
        \includegraphics[width=\linewidth]{RandomForest_comp.png}
        \label{fig:rf_roc_comp}
    \end{subfigure}
    \begin{subfigure}[b]{0.49\linewidth}       
        \centering
        \caption{Sensitivity-Specificity}
        \includegraphics[width=\linewidth]{SensSpec_comp.png}
        \label{fig:rf_ss_comp}
    \end{subfigure}
    \label{fig:rf_comp}
\end{figure*}

\begin{table*}
\centering
\caption{Sleep efficiency -- sensitivity and specificity}
	\begin{tabular}{c|cc|cc|cc|cc}
	    \hline
	    &\multicolumn{2}{c}{AU-ROC} & \multicolumn{2}{c}{F1 Score}& \multicolumn{2}{c}{Sensitivity}& \multicolumn{2}{c}{Specificity}
	    \\
	    \hline
	    &				SE+AL	 &	 RAHAR	 & 	SE+AL 	& 	RAHAR 	&	 SE+AL 	& 	RAHAR 	& 	SE+AL 	& 	RAHAR 
	    \\
	    \hline
	    \hline
	    Ada		&	0.7489	&	0.8132	&	0.5574	&	0.6885	&	0.5484	&	0.5526	&	0.7759	&	0.9333\\
	    RF 		&	0.8115	&	0.8746	&	0.6885	&	0.7500	&	0.6774	&	0.6316	&	0.8448	&	0.9333\\
	    SVM		&	0.7497	&	0.7895	&	0.3721	&	0.7077	&	0.2581	&	0.6053	&	0.9310	&	0.8667\\
	    LogR 		&	0.5884	&	0.8649	&	-		&	0.6875	&	-		&	0.5789	&	-		&	0.8667\\
	    \hline
	    Average	&	0.7246	&	0.8355	&	0.5393*	&	0.7154*	&	0.4946*	&	0.5965*	&	0.8505*	&	0.9111*\\
	    \hline
	    \multicolumn{9}{@{}l}{{\scriptsize * logistic regression (LogR) score is not included in averaging.}} \\
	\end{tabular}
	\label{tab:sens_spec}
\end{table*}

RAHAR is fundamentally a feature extraction algorithm for HAR in the context of quantifying daily physical activity levels of individuals. We therefore test the quality of activity recognition by RAHAR as compared to an expert-based HAR using a tool on continuous physical activity data from a wearable sensor. Since there is \textit{no} ground truth on human activity in this context, our objective is to evaluate which HAR approach leads to better quality models for sleep research, i.e., models for predicting sleep quality, specifically, sleep efficiency.

We selected four models for evaluating the performance of RAHAR against the performance of an expert-based HAR using a tool on the described actigraphy dataset: logistic regression, support vector machines with radial basis function kernel, random forest, and adaboost.

\begin{itemize}
  \item Logistic Regression (LogR): We chose this model because it is an easily interpretable binary classifier. It is also relatively robust to noise, which as explained earlier is a complication on data collected in an uncontrolled environment.\footnote{Even though we included logistic regression (LogR) in our experiments, it is important to note that LogR model failed to stratify the dataset successfully for the state-of-the-art baseline approach, and predicted all cases to be in a single class. Therefore, we excluded LogR score of RAHAR from analysis whenever corresponding LogR score of the state-of-the-art baseline approach was not available.}
   \item Support Vector Machine (SVM): This model was selected because it, also, is a binary classifier. We chose a radial basis function kernel, and so it differs from logistic regression in that it does not linearly divide the data. 
    \item Random Forest (RF): This model was tested as an alternative because of its easy straightforward interpretation, which is particularly relevant in the healthcare or consumer domains. It also is not restricted to linearly dividing the data. 
  \item Adaboost (Ada): Lastly, Adaboost was tested because it is less prone to overfitting than random forest. 
\end{itemize}

For comparison purposes, we use the results from a sleep specialist using Actigraph's ActiLife software as a baseline. The sleep specialist segmentation of the ActiLife results uses the preceding day's activity for each sleep period, and aggregates the activity to an epoch of an hour. ActiLife requires the sleep specialist to manually adjust the sleep period boundaries, and then automatically computes the efficiency and other sleep metrics. 


 
Figs.~\ref{fig:roc_RAHAR} and~\ref{fig:roc_SEAL} show the ROC curves for RAHAR and the sleep expert using ActiLife software (denoted as ``SE+AL"), respectively, while Table~\ref{tab:slp_eff_res} summarises the results for both RAHAR and SE+AL. One of the most important performance measures for HAR is the area under ROC (AU-ROC). Based on AU-ROC scores, both RAHAR and SE+AL performed best with random forest model. Furthermore, SE+AL achieved an average AU-ROC of 0.7246 whereas RAHAR achieved 0.8355, a 15\% improvement of AU-ROC score on average by our algorithm as opposed to the sleep expert using ActiLife. With an AU-ROC score of 0.5884 for SE+AL approach, the logistic regression model was, however, unable to stratify the dataset, and so predicted all cases to be in a single class. We considered this to be a failure of the logistic regression model for this problem, and thus, did not include its results in our discussion whenever it was appropriate to do so. For this reason, the misleading results have also been removed from Table~\ref{tab:slp_eff_res}.

Another important performance measure for HAR is the F1 score, which is computed as the harmonic mean of precision and recall. According to Table~\ref{tab:slp_eff_res}, RAHAR performed better than SE+AL in terms of precision and recall for all models, and hence, yielded significantly higher F1 scores. Specifically, F1 score for RAHAR, on average, was 0.7154 whereas it was 0.5393 for SE+AL (excluding logistic regression in both cases), yielding a solid margin of about 0.18 points (i.e., more than 30\% improvement). On the other hand, the accuracy scores, on average, were 0.7353 for RAHAR and 0.7266 for SE+AL (again excluding logistic regression), and exhibited a relatively less significant difference still in favor of RAHAR.









\section{Discussion of Results in Medical Context}

In this section we discuss the results of the best performing model and its broader impact to the area of sleep science. As seen in Fig.~\ref{fig:roc_curve} random forest and logistic regression were the two best performing models with the RAHAR algorithm. Based on the desired threshold value of true positive rate, TPR, (i.e., sensitivity), either model could be preferred to minimize false positive rate, FPR, (i.e., 1-specificity), which is equivalent to maximizing specificity. Random forest was also the best performing model for the SE+AL approach as mentioned earlier. If we compare the ROC as well as the sensitivity-specificity plots of the best model of each approach (i.e., random forest), we see that RAHAR outperforms SE+AL almost always as illustrated in Fig.~\ref{fig:rf_comp}.

Table~\ref{tab:sens_spec}, on the other hand, summarises sensitivity and specificity scores for RAHAR and SE+AL. Average sensitivity score for SE+AL and RAHAR across all models except logistic regression were 0.4946 and 0.5965, respectively. In other words, average sensitivity score for RAHAR is 20\% higher than that of SE+AL. As for specificity, RAHAR with an average score of 0.9111 outperforms SE+AL with an average score of 0.8505, which corresponds to a 7\% improvement.

As we seek to determine in our study whether a person had a ``good quality sleep" based on his physical activity levels during awake period prior to sleep, a false positive occurs when the model predicts ``good quality sleep" when the person actually had a ``poor quality sleep." Therefore, the number of false positives needs to be kept at a minimum for a desired number of true positives. In other words, a high specificity score is sought after while keeping the sensitivity score at the desired level. As can be seen from Fig.~\ref{fig:rf_ss_comp} with this perspective in mind, for a large range of sensitivity scores, RAHAR achieved higher specificity scores almost all the time than SE+AL did. For example, RAHAR achieved a sensitivity score of 0.9 with a specificity score of 0.8 whereas SE+AL remained at a specificity score of 0.6 for the same sensitivity threshold.

In summary, RAHAR outperforms state-of-the-art procedure in sleep research in many aspects. However, its application is not limited to sleep and it can be used for understanding and treatment of other health issues such as obesity, diabetes, or cardiac diseases. Moreover, RAHAR allows for fully automated interpretation without the necessity of manual input or subjective self-reporting.

Given the current interest in deep learning, a natural question that may arise is why an approach based on feature extraction and model building has been used instead of using deep learning models directly on the raw sensor data for HAR. In medical community, the explainability of a model is of utmost important as the medical professionals are interested in learning cause-and-effect relationships and using this knowledge to support their decision making processes. In this particular case, for example, sleep experts are interested in understanding how and when certain physical activity levels effect sleep in order to make decisions to improve sleep quality of individuals accordingly. However, it is an interesting idea to explore deep learning to see what is the best model from a model accuracy perspective to understand the limits of the value of continuous monitoring of individuals' physical activity, not only from a medical perspective in particular but also from a ``quantified self" perspective in general.



\section{Conclusion}
In this paper, we presented a robust automated human activity recognition (RAHAR) algorithm for multi-modal phenomena, and evaluated its performance in the application area of sleep science. We tested the results of RAHAR  against the results of a sleep expert using ActiLife for predicting sleep quality, specifically, sleep efficiency. Our model a) automated the activity recognition, and b) improved the current state-of-the-art results, on average, by ~15\% in terms of AU-ROC and ~30\% in terms of F1 scores across different models. Automating the human activity recognition puts sleep science evaluation in the hands of wearable device users. This empowers users to self-monitor their sleep-wake habits, and take action to improve the quality of their life. The improved results demonstrate the robustness of RAHAR as well as the capabilities of implementing the algorithm within clinical software such as ActiLife. 

The application of RAHAR is, however, not limited to sleep science. It can be used to monitor physical activity levels and patterns of individuals with other health issues such as obesity, diabetes, and cardiac diseases. Besides, RAHAR can also be used in the general context of the ``quantified self" movement, and provide individuals actionable information about their overall fitness levels.









\bibliographystyle{IEEEtran}
