%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{subcaption}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
%\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}{Corollary}[theorem]

\title{Keep-Alive Caching for the Hawkes process}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[]{Sushirdeep Narayana}
\author[]{Ian A. Kash}
\author[]{\\ snaray25@uic.edu}
\author[]{iankash@uic.edu}
% Add affiliations after the authors
\affil[]{%
    Department of Computer Science \\
    University of Illinois at Chicago \\
    Chicago, Illinois, USA
}

  
  
  
\begin{document}
\maketitle

\begin{abstract}
We study the design of caching policies in applications such as serverless computing where there is not a fixed size cache to be filled, but rather there is a cost associated with the time an item stays in the cache.  We present a model for such caching policies which captures the trade-off between this cost and the cost of cache misses. We characterize optimal caching policies in general and apply this characterization by deriving a closed form for Hawkes processes. Since optimal policies for Hawkes processes depend on the history of arrivals, we also develop history-independent policies which achieve near-optimal average performance. We evaluate the performances of the optimal policy and approximate polices using simulations and a data trace of Azure Functions, Microsoft's FaaS (Function as a Service) platform for serverless computing
\end{abstract}

\section{Introduction}\label{sec:intro}


% Motivation
Datacenters provide a variety of caching services in the interest of reducing latency.  While traditional caches have a fixed size determined by the underlying hardware, datacenter resources are fungible and could be used for a variety of purposes.  For example, in serverless computing (also known as Function-as-a-Service) the cloud provider handles all the provisioning and configuration of resources for each user, while the user simply pays for the amount of time of the function execution. The cloud provider could, in principle, keep a container in memory for every user to ensure that function calls can be executed immediately. However, this would be excessively expensive in terms of resources allocated but unused. Instead, user application code which has not been used recently may be kept in persistent storage~\citep{shahrad2020serverless}. Another example involves Time-to-Live (TTL) caches where the cache controller has a timer-based parameter for when to evict objects from the cache. Work on TTL-based caches for Content Delivery Networks (CDNs) has observed that hit rate guarantees can be provided by controlling the cache space used by a customer and suggested pricing based on this space consumption~\citep{basu2018adaptive}.

% Problem
In contrast to traditional caches, in keep-alive caching the decision is not about \em{what} object to evict from the cache when space is needed.  Rather the relevant question is \em{when} is it worth keeping the object in the cache.  Such a decision trades off the opportunity cost of not putting those resources to other uses (which may be caching other objects or some entirely different purpose) against the cost of a cache miss. While prior work has examined the trade-off between cache misses and the overall cache size~\citep{basu2018adaptive}, we focus on precise answers regarding optimal and approximately optimal caching decisions for a single object.  

% Solution to the problems
We model this problem as trading off between the expected {\em time} an object is kept in the cache before it is next accessed and the probability of a {\em cache miss}.  Since a given object has a fixed size, this lets us precisely quantify the two relevant costs.  We characterize the optimal cache policy and show how this characterization yields simple policies for objects whose arrivals have a monotone hazard rate.  This includes Poisson and Hawkes processes~\citep{laub2021elements}.  

One downside of the optimal policy for Hawkes processes is that it depends on the history of arrivals for the object. So, it needs to be recalculated after every arrival.  A natural alternative is a simple TTL-style policy which keeps the object in the cache for a fixed amount of time after each arrival before evicting it. Such policies have been used in practice in serverless systems such as AWS Lambda and Azure Functions~\citep{shahrad2020serverless}.  The classic ski rental problem shows that, if the TTL is optimized solely based on costs, the resulting policy is a worst-case 2-approximation. We derive an approach to optimize the TTL based on the parameters of the Hawkes process (but still independent of the history of arrivals).

We simulate the average performance of all three policies (optimal, optimized-TTL, and fixed-TTL) on arrival requests that follow a Hawkes process. The simulations illustrate how an optimized-TTL in general and our specific optimization procedure in particular yield near-optimal performance.

We also evaluate these policies on Azure traces released by \citet{shahrad2020serverless}.  Here, Hawkes processes naturally capture applications such as web servers where a recent function invocation makes it more likely for additional invocations to occur (perhaps because the same user makes another request).  We show that applying the optimal policy to the 25\% of applications best fit by a  Hawkes process yields meaningful improvements over the fixed-TTL policy at the scale of a datacenter.  Furthermore, our optimized-TTL approach again yields near-optimal performance.  

\section{Related Work} \label{sec:related-work}


Our modeling decisions draw motivation from serverless computing. Cloud providers that use Function as a Service (FaaS) serverless models, such as AWS Lambda, Google Cloud Functions, IBM Cloud Functions, and Azure Functions, handle all the system administration and resource allocations for customers. As \citet{jonas2019cloud} highlight, one of the advantages of serverless computing is that the users have to pay solely for the usage of resources of their applications. The customers do not pay for maintaining and starting up the resources when their application is not running. On the other hand, they point out the challenges for customers in terms of the unpredictable latency of cold starts \footnote{Cold start refers to a cache miss in the context of serverless computing}. \citet{wang2018peeking} measure the cold start latency of popular FaaS serverless platforms. They discuss the historical usage of fixed caching policies by these platforms and quantify the behavior assumed by our model: cloud providers regularly shutting down instances providing FaaS services to reallocate those resources to other uses. \citet{lin2020serverless} discuss the desirability of approaches, such as ours, that could allow customers to agree to pay a higher price in exchange for personalized warm-start performance guarantees.  

Closest to our work, \citet{shahrad2020serverless} propose a keep-alive cache policy based on an approach to predict future arrivals. They show a significant reduction in memory use when compared with the standard fixed keep-alive policies. In contrast, we take the predictions as given and optimize decisions based on them. Other works that have considered approaches to mitigating cold starts include having multiple tiers of hardware~\citep{roy2022icebreaker}, pre-warming just the networking components~\citep{mohan2019agile}, caching common Python libraries~\citep{oakes2018sock}, and overbooking~\citep{kesidis2019overbooking}. \citet{fuerst2021faascache} build a system which allows the size of a FaaS cache to be scaled dynamically based on arrival rates, but they do not provide any theoretical analysis. ~\citep{romero2021faa,mvondo2021ofc} have looked at caching of other aspects pertaining to function execution such as data access.

Most work on cache algorithms works to optimize their decisions about which items to evict when the cache is full.  Closer to our work is work on TTL caches which evict objects after a fixed amount of time.  These are studied both for applications in settings like Content Delivery Networks (CDNs) as well as their use as a more tractable way of approximately analyzing the performance of traditional caches introduced by \citet{che2002hierarchical}. \citet{berger2014exact} analyze TTL cache networks. They consider the inter-request arrival times of objects and TTL values to be two independent renewal processes. They  study three types of TTL cache policies based on the TTL resets and eviction times (we show what they term $\mathcal{R}$ policies are optimal in our setting). \citet{basu2018adaptive} design two TTL based caching algorithms for CDNs like Akamai, but focus on working within a fixed target cache size. \citet{ferragut2016optimizing} study optimal TTL cache policies. Like us they examine the consequences of the monotonicity of the Hazard rate of the inter-arrival distributions. However, unlike us they focus on trading off hit rate and overall cache size.  They formulate the TTL caching problem as a non-linear optimization problem with non-linear constraint, and prove that the convexity of the optimization problem is related to monotonicity  of the Hazard rate. They provide explicit solutions when the inter-arrival distribution follows the Zipf's law. They evaluate the performance of their optimal policies by comparing their policy hit rate against the LRU (Least Recently Used) policy hit rate. \citet{ali2011survey, alamash2004overview} give a broader introduction to and survey of web page caches.


Closer to our work, \citet{dehghan2019utility} share our notion that users may have utilities which depend on the hit rate and they include the possibility of increasing the cache size at a cost. They assume the utility functions for each file to be concave and the arrival requests to follow a Poisson point process, while our model is more general. \citet{babaie2019cache} extend this to a cache hierarchy network. \citet{panigrahy2017and} allow heterogeneity of user preferences for hit rates, but focus on network of capacity-constrained caches with requests modeled as simple Poisson arrivals.

There is substantial work in the AI literature on other optimization problems that arise in the context of cloud computing such as pricing~\cite{blocq2014shared,friedman2015dynamic,babaioff2017era,kash2019simple,dierks2021competitive}, reservation scheduling~\cite{azar2015truthful,wang2015selling}, information elicitation~\cite{ceppi2015personalized,dierks2019cluster}, and fair division of resources~\cite{parkes2015beyond,kash2014no,friedman2014strategyproof,narayana2021fair}.


\section{Model} \label{sec:model}

We consider a cache system where there is a cost for an object to stay in the cache. There is also a cost for a cache miss. For concreteness we describe our model using terms from one natural application (serverless computing), but it is also relevant for other applications such as CDNs~\cite{ferragut2016optimizing}. First, we describe a cache policy in this setting. Next, we detail the parameters of the cache associated with the cloud provider. Then, we express the cost of a cache policy. Since the time of arrival of future requests is unknown, we assume that the cloud provider has access to the distribution of the arrival requests, as they could be estimated by the past arrival requests. 

\textbf{Cache Policy}
We assume the cache has infinite capacity because the provider can always dedicate more resources to its serverless offering, which differentiates our model from those driven by capacity. Thus, for us a cache policy is not about which object should be evicted when space is needed but rather how long (or more generally when) we should keep it in the cache.

Let $\mathcal{H}_{m-1} = \{t_{1}, \; t_{2}, \; \cdots, \; t_{m-1}\}$ denote the history of $m-1$ previous requests for the application.  Here, $t_{1}, \; t_{2}, \cdots \text{and}, \; t_{m-1}$ denote the time of 1st, 2nd , $\cdots$ and, (m-1)-st request respectively. Let $x_{m} = t_{m} - t_{m-1}$ denote the \textit{m}-th inter-arrival time. Our analysis of policies is based on these inter-arrival times.

A cache policy $\pi$ of an application is a sequence of time windows during which a possible requested application is moved in and out of the cache. A keep-alive window is the time interval during which the application is kept in the cache. The policy is reset after each arrival request for an application. More formally, a policy $\pi(x|\mathcal{H}_{m-1})$ can be expressed as an indicator function of a sequence of keep-alive windows for the $m$-th inter-arrival as: 
\begin{equation*}
\pi(x|\mathcal{H}_{m-1}) =  \begin{cases}1 \; , \quad x \in [L_{0} , \; L_{1}] \; \bigcup \cdots \; [L_{2k-2}, L_{2k-1}]\\ 0 \; , \quad \text{otherwise} \end{cases}  
\end{equation*}

\noindent Here, $L_{2i}$ denotes amount of time after $t_{m-1}$ (the time of the most recent request) where the $i$-th keep-alive window starts, while $L_{2i+1}$ denotes the point where the $i$-th keep-alive window end for some $k$ and all \; $i \in \{0,\; 1,\;2, \cdots,k-1\}$.  

A policy $\pi(\cdot)$ consisting of a single keep-alive window across the $m$-th inter-arrival can be represented by the length of pre-warming window and the length of keep-alive window. 
That is, if
\begin{equation*}
\pi(x) =  \begin{cases}1 \; , \quad \text{for} \quad x \; \in [\tau_{\text{pw}}, \; \tau_{\text{pw}} + \tau_{\text{ka}}] \\ 0 \; , \quad \text{otherwise} \end{cases}   
\end{equation*}

\noindent then, the policy can be summarized by the parameters $\tau_{\text{pw}}$, and $\tau_{\text{ka}}$. The parameter $\tau_{\text{pw}}$ refers to the length of pre-warming window which is the time interval during which the policy waits before bringing in the possibly requested application into the cache. The parameter $\tau_{\text{ka}}$ denotes the length of keep-alive window which is the time interval when the application is kept in the cache. The pre-warming window is especially useful when the requests for an application occur at regular intervals, such as requests governed by a timer function.   

\textbf{Cost of a Cache Policy}
In the context of serverless computing, an application encounters a \textit{warm} start (cache hit) if its invocation (arrival request) occurs when the keep-alive window is active. An application has a \textit{cold} start (cache miss) when the keep-alive window is not active during its invocation. Thus, we can describe the costs of a policy using:
\begin{itemize}
    \item $c_{cs}$ denotes the cost associated with a cold start.  This is primarily the latency cost experienced by the user when there is a cold start.  It may also include the cost for the cloud provider  to load the application image (requested object) into the cache.
    \item $c_{p}$ denotes the cost per unit time for the cloud provider for keeping the application image (requested object) active in the cache.
\end{itemize}
    
\noindent  The cost of a policy with a single keep-alive window
%$(\tau_{\text{pw},\mathcal{H}_{m-1}}, \tau_{\text{ka},\mathcal{H}_{m-1}})$ 
$(\tau_{\text{pw}}, \tau_{\text{ka}})$ 
when the inter-arrival time is $x_m$ is given by $cost(x_{m}, (\tau_{\text{pw}, \mathcal{H}_{m-1}}, \tau_{\text{ka}, \mathcal{H}_{m-1}}))=$
\begin{align*} \begin{cases}
&c_{cs}, \mbox{ if }  x_m < \tau_{\text{pw}, \mathcal{H}_{m-1}} \\
&c_{p} x_m, \mbox{ if }   \tau_{\text{pw}, \mathcal{H}_{m-1}} \; \leq x_m \; \leq \; \tau_{\text{pw},\mathcal{H}_{m-1}} + \tau_{\text{ka},\mathcal{H}_{m-1}}  \\
&c_{p}  \tau_{\text{ka}, \mathcal{H}_{m-1}} + c_{cs}, \mbox{ if }  x_m > \; \tau_{\text{pw}, \mathcal{H}_{m-1}} + \tau_{\text{ka}, \mathcal{H}_{m-1}} 
\end{cases}
\end{align*}
The three cases for the cost of a policy correspond to that of a cold start if the invocation occurs during pre-warming, a warm start if the invocation occurs when the keep-alive window is active, and a cold start if the invocation is after the keep-alive window being active, respectively. In the more general case of multiple keep-alive windows the cost $c_p$ must be paid for all prior windows as well (see Equation 1 in the proof of Lemma \ref{expect_cost} for the full formula). This formula implicitly assumes that the time and costs associated with a decision to load the application into the cache (e.g. in the case of multiple windows where it may be moved in and out repeatedly) are zero, and so is in that sense a lower bound on the ``true'' cost. However, we show that for Hawkes processes in particular a single window which starts immediately ($\tau_{\text{pw}} = 0$) is optimal. Such a policy never has to pay such cost except of course during cold starts, but those are already accounted for by $c_{cs}$.  

To compute the cost of a cache policy, $x_{m}\;$ must be known. However, the cloud provider  does not have access to such information. Instead, they can estimate the distribution of $x_m$ from past arrival requests $\mathcal{H}_{m-1}$. We model the application invocations as a point process. Let $f(x| \mathcal{H}_{m-1})$ and $F(x| \mathcal{H}_{m-1})$ denote the conditional probability distribution and the conditional cumulative distribution of an invocation at $x$ units after the most recent invocation given the history $\mathcal{H}_{m-1}$ of previous invocations, respectively. We assume that both are continuous. Let the hazard rate of the based on the inter-arrival be expressed as $\lambda(x|\mathcal{H}_{m-1}) = \displaystyle \frac{f(x|\mathcal{H}_{m-1})}{1 - F(x|\mathcal{H}_{m-1})}$. We use these probability distributions to derive an expression for the expected cost of a cache policy.  


\section{Characterization of Optimal Policies} \label{sec:opt-policies}

We start this section by deriving the expected cost of a caching policy over an inter-arrival. Then in Theorem \ref{opt_policy}, we characterize the optimal policy for application invocations in terms of the behavior of the Hazard rate. We apply this to derive optimal policies when the arrival of application invocations follow a Poisson process or a Hawkes process, which are specific cases of arrival requests that have a constant Hazard rate and a monotone decreasing Hazard rate respectively.     

Proofs omitted from this and subsequent sections of the paper can be found in the Appendix.

\begin{lemma} \label{expect_cost}
The expected cost of a cache policy over an inter-arrival is
$\mathbb{E}[cost(\pi(\cdot |\mathcal{H}_{m-1}))] =$
\begin{equation*}
c_{cs} + \displaystyle \int_{0}^{\infty} \pi(x|\mathcal{H}_{m-1}) \cdot g(x|\mathcal{H}_{m-1})\; dx ,    
\end{equation*} 
where the instantaneous cost at \textit{x} units after the most recent arrival at $t_{m-1}$ is
\begin{equation*}
g(x|\mathcal{H}_{m-1}) = c_{p} \cdot \big(1 - F(x|\mathcal{H}_{m-1})\big) - c_{cs} \cdot f(x| \mathcal{H}_{m-1}).
\end{equation*} 
\end{lemma}

Using the characterization of the cost of a policy from Lemma~\ref{expect_cost}, Theorem \ref{opt_policy} observes that the sign of $g$ (which determines the optimal policy) is entirely determined by the hazard rate and costs. 


\begin{theorem} \label{opt_policy}
The points $L_i$ of the sequence of keep-alive windows over an inter-arrival for the optimal policy $\pi_{\text{opt}}(\cdot | \mathcal{H}_{m-1})$ are at 0, $\infty$, or solutions to the equation $c_{p} - (c_{cs} \cdot \lambda(x | \mathcal{H}_{m-1})) = 0 \;$ where the sign changes.
\end{theorem}


We now examine several special cases of Theorem~\ref{opt_policy} with particularly natural structure.  Our first, Corollary \ref{optimal_decreasing}, describes the optimal policy when the hazard rate of the arrival requests are (weakly) decreasing. This case includes Poisson and Hawkes processes and is the main case we evaluate in our simulations.

\begin{corollary} \label{optimal_decreasing}
If the hazard rate is weakly decreasing, the optimal policy $\pi_{\text{opt}}(x|\mathcal{H}_{m-1})$  is a single keep-alive window starting at $\tau_{pw}=0$, and is given by \\ $\pi_{\text{opt}}(\cdot |\mathcal{H}_{m-1}) =  \begin{cases}1 \; , \quad \forall x \in [0 , \; \tau_{\text{opt}, \mathcal{H}_{m-1}}]\\ 0 \; , \quad \text{otherwise}\end{cases},$ where
\begin{enumerate}
    \item $\tau_{\text{opt}, \mathcal{H}_{m-1}} = \infty$, i.e., to have the keep-alive window always be active when  $\forall x, \; \displaystyle \frac{c_{p}}{c_{cs}} < \; \lambda(x | \mathcal{H}_{m-1}), \; $
    \item $\tau_{\text{opt}, \mathcal{H}_{m-1}} = 0$, i.e.,  not cache and always have a cold start when $\displaystyle \frac{c_{p}}{c_{cs}} >\; \lambda(x = 0 | \mathcal{H}_{m-1})\;$
    \item a keep-alive window of length $\tau_{\text{opt}, \mathcal{H}_{m-1}}$ given by the solution to the equation \\ $ \displaystyle  \frac{c_{p}}{c_{cs}} = \frac{f(x = \tau_{\text{opt}, \mathcal{H}_{m-1}} | \mathcal{H}_{m-1})}{1 - F(x = \tau_{\text{opt}, \mathcal{H}_{m-1}} | \mathcal{H}_{m-1})}$, otherwise. 
\end{enumerate}
\end{corollary}


Next, Corollary \ref{optimal_increasing}, states the optimal policy when the hazard rate is instead (weakly) increasing. 

\begin{corollary} \label{optimal_increasing}
If the hazard rate is weakly increasing, the optimal policy $\pi_{\text{opt}}(\cdot|\mathcal{H}_{m-1})$  is a single keep-alive window with $\tau_{ka} = \infty$ and a pre-warming window, and is given by \\ $\pi_{\text{opt}}(x|\mathcal{H}_{m-1}) =  \begin{cases}1 , \; \tau_{\text{pw}, \mathcal{H}_{m-1}} \leq x \\ 0, \; \text{otherwise}\end{cases},$ where
\begin{enumerate}
    \item $\tau_{\text{pw}, \mathcal{H}_{m-1}} = 0$, i.e., to have the keep-alive window always be active when $\forall x, \quad \displaystyle \frac{c_{p}}{c_{cs}} < \; \lambda(x | \mathcal{H}_{m-1}), \; $
    \item $\tau_{\text{pw}, \mathcal{H}_{m-1}} = \infty$, i.e., to always have a cold start when $\forall x, \; \displaystyle \frac{c_{p}}{c_{cs}} > \; \lambda(x | \mathcal{H}_{m-1}) \;$. 
    \item $\tau_{\text{pw}, \mathcal{H}_{m-1}}$ satisfies the equation\\ $\; \displaystyle \frac{c_{p}}{c_{cs}} = \displaystyle \frac{f(x = \tau_{\text{pw}, \mathcal{H}_{m-1}} |  \mathcal{H}_{m-1})}{1 - F(x = \tau_{\text{pw}, \mathcal{H}_{m-1}} |  \mathcal{H}_{m-1})}$, i.e., an infinite keep-alive window after a pre-warming window of length $\tau_{\text{pw}, \mathcal{H}_{m-1}}$ when $c_{p} - c_{cs} \lambda(x = 0 | \mathcal{H}_{m-1}) \; > 0$ and changes sign. 
\end{enumerate}
\end{corollary}

More generally, we can combine these to understand 
the optimal policy when the hazard rate has a single peak: it is first increasing and then decreasing.  In this case using both $\tau_{pw}$ and $\tau_{ka}$ is optimal (apart from degenerate cases).  This is the form used by \citet{shahrad2020serverless}, and so our results characterize the class of applications for which their approach could be optimal if properly tuned as well as determining how to optimally tune the parameters.\footnote{They use a simple rule of pre-warming at the $5^{th}$ percentile and ending the keep-alive at the $99^{th}$ percentile while our approach would optimize those based on the costs and distribution characteristics.} One notable example of this class of application is applications triggered by a timer.  We would expect $\lambda$ to be 0 until the timer is due to elapse, rapidly increase as we approach our estimate of when the timer will trigger, and then eventually decrease if we appear to be wrong in our estimate of when the timer will next trigger.

We can also apply Theorem \ref{opt_policy} to applications where the hazard rate of the arrival requests for application invocation has a single valley. This might be the case if the initial invocation is likely to trigger several others (in the same spirit as a Hawkes process) but then once the application finishes there is a gap before it is invoked again, perhaps due to a timer.  As Figure~\ref{opt_single_peak} shows the optimal keep-alive policy has at-most two keep-alive windows, whose lengths can be computed using Theorem \ref{opt_policy}.  Of course, we can apply this approach to more complex situations resulting in even more windows if the distributional information available supports that (e.g. for a timer that only causes an arrival under certain additional conditions).
\begin{figure}[t!]
    \centering
    \includegraphics[height=1.5in,trim={6cm 3cm 7cm 0cm},clip]{images/single_peak.png}
        \caption{Optimal policy with a single valley hazard rate}
    \label{opt_single_peak}
\end{figure}


\subsection{Poisson Process} \label{sec:poisson}

Our first concrete application of Theorem~\ref{opt_policy}  is when the distribution of the arrival requests follows a Poisson process. For the Poisson process, the relevant quantities are:
\begin{itemize}
    \item $f(t) = \lambda \cdot e^{- \lambda \cdot t}$ for $t \geq 0$, where $\lambda > 0$ is a constant parameter.
    \item $F(t) = \displaystyle  1 - e^{-\lambda \cdot t}$.
    \item $ \displaystyle \lambda(t) =\frac{f(t)}{1 - F(t)} = \lambda$.
\end{itemize}


From Corollary~\ref{optimal_decreasing} we see that, barring degeneracy, the optimal policy for a Poisson process is one of two possibilities since the Hazard rate is constant. The optimal policy  when $\displaystyle \frac{c_{p}}{c_{cs}} \leq \; \lambda$ is to have a keep-alive window active at all times, while when $\displaystyle \frac{c_{p}}{c_{cs}} > \; \lambda$ the optimal policy is to always experience a cold start. The cost of the optimal policy when $\displaystyle \frac{c_{p}}{c_{cs}} > \; \lambda $ is $c_{cs}$. Since the expected inter-arrival time is $\displaystyle \frac{1}{\lambda}$, the expected cost of the optimal policy when $c_{p} - (c_{cs} \cdot \lambda) \leq 0$  is $\displaystyle \frac{c_p}{\lambda}$.


\subsection{Hawkes Process} \label{sec:hawkes_optimal}

In this subsection, we explore the more interesting case of optimal policies when the distribution of the application invocations follow a Hawkes process. The Hawkes process is a point process whose hazard rate, or conditional intensity, is given by 
\begin{equation*}
    \lambda(t| \mathcal{H}) = \frac{f(t| \mathcal{H})}{1 - F(t| \mathcal{H})} = \lambda_{0} + \sum_{t_{j} \in \mathcal{H}} \mu(t - t_{j})
\end{equation*}
for all $t > t_{i-1}$, where $\mathcal{H} = \{t_{1}, t_{2}, \cdots, t_{i-1} \}$ denotes the history of invocations of the function, $\lambda_{0}$ refers to the background intensity, and $\mu$ is the excitation function. We limit our analysis to the exponential excitation function because we need to pick a concrete instantiation to solve for a closed form for optimal window length. The exponential excitation function is expressed as, $\mu(t) = \alpha e^{ - \beta \cdot t}$. The constant $\alpha > 0$ captures the increase in the intensity from an arrival, while the constant $\beta > 0$ is the decay rate of the arrival's influence. The conditional intensity of the self-exciting Hawkes process with an exponential excitation function is thus,
\begin{equation*}
    \lambda(t| \mathcal{H}) = \lambda_{0} + \sum_{t_{j} \in \mathcal{H}} \alpha \cdot e^{- \beta \cdot (t - t_{j})}  \quad, \forall \; t > t_{i-1} .
\end{equation*}

Corollary \ref{optimal_decreasing} characterizes the optimal policy when the distribution of arrival requests follows the Hawkes process to be one of the following policies,
\begin{itemize}
    \item The keep-alive window is to always be active with $\tau_{\text{opt}, \mathcal{H}_{m-1}} = \infty$ when, as in the Poisson case, the background intensity is sufficiently high: $\displaystyle \frac{c_{p}}{c_{cs}} < \; \lambda_{0}$.
    \item Experience a cold start with $\tau_{\text{opt}, \mathcal{H}_{m-1}} = 0$ when $\displaystyle \frac{c_{p}}{c_{cs}} > \; \lambda(x| \mathcal{H}_{m-1}))$, after the most recent request.
    \item The keep-alive window is given by the expression
    \begin{align*}
    \tau_{\text{opt}, \mathcal{H}_{m-1}} =& \frac{1}{\beta} \Big( \log \alpha + \log \Big(\sum_{j=1}^{m-1} e^{\beta  (t_{j}-t_{m-1})} \Big) \\& - \log \Big( \frac{c_{p}}{c_{cs}} - \lambda_{0}\Big) \Big)
    \end{align*}
    otherwise. This expression is obtained by substituting the conditional intensity of the Hawkes process in Corollary \ref{optimal_decreasing} and solving for $\tau_{\text{opt}, \mathcal{H}_{m-1}}$ as detailed in the Appendix.
\end{itemize}


\subsection{Optimized-TTL Keep-Alive Windows for Hawkes Processes} \label{subsec:avg_optimal}

Corollary~\ref{optimal_decreasing} provides the optimal history-dependent policy for Hawkes processes. We conclude this section by showing how it also provides motivation for a history-independent heuristic for Hawkes processes. In Section~\ref{sec:simulations} we show this heuristic has strong empirical performance.

Our proposal is to empirically determine a keep-alive window length that works well for ``typical'' windows.  Intuitively, this can be done by sampling a number of histories, computing the optimal policy for each history, and computing a summary statistic.  We find that the average of the optimal policies works well.  We discuss the simulation of a Hawkes process from its parameters in Section~\ref{sec:simulations}

\begin{corollary}
When the parameters of the Hawkes process are such that $c_{p} - (c_{cs} \cdot \lambda(x | \mathcal{H})) = 0$ has a solution, the optimal policy has a history independent lower bound,  and an upper bound expressed as follows
\begin{align*}
\tau_{\text{opt}, \mathcal{H}} & \geq \frac{1}{\beta} \cdot \left(\log \alpha - \log \Big(\frac{c_{p}}{c_{cs}} - \lambda_{0}\Big)\right) \\
\tau_{\text{opt}, \mathcal{H}} & \leq \frac{1}{\beta} \cdot \left(\log \alpha + \log \delta + 1 - \log \Big(\frac{c_{p}}{c_{cs}} - \lambda_{0}\Big)\right)
\end{align*}
\noindent where $\delta$ satisfies
\begin{equation*}
\sum_{i=m-\delta}^{m-1} e^{\beta \cdot (t_i - t_{m-1})} 
\geq \frac{1}{2} \sum_{i=1}^{m-1} e^{\beta \cdot (t_i - t_{m-1})} 
\end{equation*}
\end{corollary}


That is, the most recent $\delta$ arrivals provide at least half the total weight of the history dependent term.  This can be thought of as only having $\delta$ arrivals that are recent enough to matter.  Apart from rare scenarios where $\delta$ is much larger or smaller than typical given the Hawkes process parameters, this bound is relatively insensitive to the exact history due to the log.  In our simulations, particularly Section \ref{subsec:azure_experiments}, and Section \ref{azure_results}, rather than estimate $\delta$ we directly estimate a threshold by simulating sample points for the Hawkes process with the estimated parameters.

We also considered other approaches to deriving history-independent policies.  The solution to the classic ski rental problem shows that setting $\tau_{ka} = c_{cs} / c_p$ always has a cost within a factor of 2 of that of the optimal history-dependent policy for Hawkes processes.  This bound can be tightened by setting $\tau_{ka}$ in a way that depends on the parameters of the Hawkes process.  However, we found that in practice this approach was overly conservative, so we defer its analysis to the appendix.


\section{Simulations} \label{sec:simulations}

\subsection{Performances on Simulated Hawkes Processes} \label{sec:hawkes_sim}

Our theoretical results derived the optimal policy and argued that optimized-TTL, which uses averaging, provided a good heuristic which is independent of history.  To better understand its expected performance, we evaluate both policies on simulated application invocations governed by a Hawkes process. We use \textit{Ogata's modified thinning algorithm} \cite{ogata1981lewis} to generate the samples of the Hawkes process.  We generate 600 sample points of function invocations in a single realization of the Hawkes process. We evaluate the policies by taking the mean over 100 realizations. 

\begin{figure*}[t!]
    \centering
    \begin{subfigure}[t]{0.3\textwidth}
        \centering
        \includegraphics[height=1.4in]{images/CostSimulationZeroWind.png}
        \caption{$c_{p} =1, c_{cs} = 1$}
    \end{subfigure}
    ~
    \begin{subfigure}[t]{0.3\textwidth}
        \centering
        \includegraphics[height=1.4in]{images/CostSimulationFiniteOpt.png}
        \caption{$c_{p} =1, c_{cs} = 10$}
    \end{subfigure}
    ~
    \begin{subfigure}[t]{0.3\textwidth}
        \centering
        \includegraphics[height=1.4in]{images/CostSimulationInfniteWind.png}
        \caption{$c_{p} =1, c_{cs} = 120$}
    \end{subfigure}    
    \caption{Comparisons between optimal and fixed policies with Hawkes process parameters $\lambda_{0} = 0.01, \alpha = 0.5, \beta = 1.0$}
    \label{cost_sim}
\end{figure*}


Figure \ref{cost_sim} shows the average cost of the policies for the three possible cases of the optimal policy given in Corollary \ref{optimal_decreasing}.  It does so by varying $c_{cs}$ while the value of $c_{p}$ is normalized to 1. The red curve shows the performance of the fixed policy as the length of the fixed keep-alive window is varied, while the blue horizontal line indicates the performance of the optimal history dependent policy. The red dot indicates the cost of the fixed policy evaluated at the length of the average optimal keep-alive window (i.e optimized-TTL). 


In Figure \ref{cost_sim} (a), since $c_{p}/c_{cs}$ is large compared to $\lambda_{0}$, the optimal keep-alive policy is to have a keep-alive window length of 0. In other words, the optimal policy is to encounter a cold start for every invocation. Here, the optimized-TTL policy is the same as optimal since the optimized-TTL window is 0.


In Figure \ref{cost_sim} (b), the red dot indicating the  average window length used by the optimal policy is also near the point where the average cost of the fixed window keep-alive policy curve changes from a decreasing function of window length to an increasing function of window length. Intuitively, this point on the red curve corresponds to the point $\tau_{\text{opt}}$ where the averaged expected cost function $\overline{g}(x) = c_{p}(1 - \overline{F}(x)) - c_{cs}\overline{f}(x)$, where the averages are taken over the different histories and runs,  changes from negative to positive. It is suggestive of some of the underlying regularities of Hawkes processes that the average of the optimal solutions and the optimal solution to the average problem are close to each other.  Thus, optimized-TTL finds a near-optimal point in the space of fixed policies and the performance there is close to the true optimum.


In Figure \ref{cost_sim} (c), we see that the optimal keep-alive policy is to always be active. Here, there is no indication of a red dot as it is out of bounds for the plot, where the red curve asymptotes toward the blue. 


In the appendix we provide results for additional parameter settings showing the strong performance of optimized-TTL across a wide range of parameter settings.


\subsection{Azure Data trace Experimental Setup} \label{subsec:azure_experiments}

% Experimental Setup
We evaluate the performance of the optimal policy and the optimized-TTL policy by comparing them with the fixed keep-alive policy on a subset of Azure traces released by \citet{shahrad2020serverless}.\footnote{These traces are available at \url{https://github.com/Azure/AzurePublicDatset}}. The traces  collect invocation counts of functions binned in 1-minute intervals. In Azure Functions, an application comprises of multiple functions where each function performs a specific task for the application. Since allocation of resources is based on applications (which are the unit of caching), we aggregate the bin counts of the function invocations belonging to the same application.  

We evaluate performance using two metrics. The first is the amount of memory time that is wasted, normalized to the amount wasted by the default policy of a 10 minute keep-alive window. We accumulate the wasted memory time across all applications for a given policy. We assume the function execution times to be zero, to quantify the worst case wasted memory time. For this calculation we assume that all the applications use the same amount of memory. The second is the number of cold starts. We evaluate the cold start behavior by computing the average number of cold starts per application. We assume the first invocation of an application to be a cold start. These modeling decisions are generally consistent with those of \citet{shahrad2020serverless}.


The fixed keep-alive policy was implemented by adding a fixed keep-alive window after an application invocation of, 5 minutes, 10 minutes, 20 minutes, 30 minutes, 45 minutes, 60 minutes, 90 minutes, or 120 minutes. The other two policies required fitting a Hawkes Process to the invocation pattern of an application. The Azure traces collect the data of application invocations for 14 days from July 15th to July 28th 2019. To avoid horizon effects or assuming unreasonable amounts of prior data about an application, we estimated the Hawkes process parameters based on the application invocations on day 8. We test for the appropriateness of the estimated Hawkes process parameters by using the corresponding application invocations on day 7 . We use a separate day of application invocations to check for the appropriateness of the Hawkes process parameters, since the goodness of fit test is known to have some limitations when the same data is used both to estimate the parameters and to compute the KS- statistic. \cite{reynaud2014goodness} propose a solution associated with sub-sampling. (\cite{van2016deep}  show similar issues for training and applying double Deep Q-learning Networks.) In the Appendix, we analyse this issue in detail. For simplicity, rather than sub-sampling the data we take advantage of additional data we are not currently using (e.g. day 7). We evaluate the policies for the application invocations during day 9. In our initial exploration we found that the results were largely the same when the policies were tested on other days instead, so to save on simulation time (since we are working with a datacenter-scale trace) we limited the evaluation to a single day. 


We know that the data contains applications triggered by timers and other patterns which are quite different from Hawkes processes. Therefore, we applied our policies only to those applications which were a good fit to a  Hawkes process. The estimates of the Hawkes process parameters were computed as the minimum of the negative log-likelihood as described by \citet{laub2021elements}. The log-likelihood for the estimation of Hawkes process parameters $\lambda_{0}, \; \alpha , \;\text{and} \; \beta$  is
\begin{align*}
    \sum_{i=1}^{k} \log(\lambda_{0} + \alpha A(i)) - \lambda t_{k} + \frac{\alpha}{\beta} \sum_{i=1}^{k} \Big[ e^{- \beta(t_{k} - t_{i})} - 1 \Big]
\end{align*}
where $t_{1}, \; t_{2}, \cdots, \; t_{k}$ are the \textit{k} invocations of the application, and $A(i) = \sum_{j=1}^{i-1} e^{- \beta (t_{i} - t_{j})}$. The appropriateness of the application invocations being modeled by a Hawkes process is then determined by applying the Random time change theorem on the estimated parameters as detailed by  \citet{laub2021elements}. The similarity measure of the distribution of the application invocations to a Hawkes process was determined via the Kolmogorov–Smirnov (KS) test.  After testing various thresholds, we applied the optimal keep-alive policy based on the estimated Hawkes process parameters to the 25\% of application processes which had the best goodness of fit.  The default fixed policy was used for the remaining 75\%. 


From Section \ref{sec:hawkes_optimal}, we know that, apart from degenerate cases, the optimal policy is given by $\tau_{\text{opt}, \mathcal{H}} = \frac{1}{\beta} \cdot \left( \log \alpha + \log \big(\sum_{j=1}^{m-1}e^{\beta (t_{j}-t_{m-1})} \big) - \log \Big( \frac{c_{p}}{c_{cs}} - \lambda_{0}\Big) \right)$. The second term of logarithmic sum of exponential expressed as $\log \big(\sum_{j=1}^{m-1}e^{\beta (t_{j}-t_{m-1})} \big)$ represents the weight of the previous arrivals. In order to compute the optimal keep-alive window efficiently for applications with frequent arrivals we consider no more than 200  previous arrivals. We normalize $c_{p} = 1$ and compute the optimal policy for $c_{cs} =$ 5, 10, 20, 30, 45, 60, 90, and 120. These values of $c_{cs}$ are chosen in order to match the lengths used by the fixed policy.


After obtaining the applications that are a good fit for the Hawkes process with its estimated parameters, we apply the optimized-TTL policy described as the average optimal keep-alive window policy over simulated arrivals. We compute the optimized-TTL policy based on the simulated arrivals instead of the arrivals available in the dataset because we eliminate any possible bias from using the same data to both fit the distribution and optimize the window lengths. Also, this shows that if the serverless provider has a lack of data for a particular Hawkes process application, it is still possible to get nearly equivalent performance to the optimal policy. We generate a single simulation of arrivals for each application based on the estimated Hawkes process parameters for 24 hours (1440 minutes). The average of the optimal keep-alive windows were computed for each application as their optimized-TTL based on the simulated arrivals for the various values of $c_{cs}$. 


\subsection{Azure Datatrace Performance Results} \label{azure_results}


% Made Figures larger since we are allowed 9 pages in Camera Ready 

\begin{figure*}[t!]
    \centering
    \begin{subfigure}[t]{0.44\textwidth}
        \centering
        \includegraphics[height=2.1in]{images/TradeoffCurve_fitted_apps_updated _evaluated_day9.png}
        \caption{Hawkes process applications}
    \end{subfigure}%
    ~ 
    \begin{subfigure}[t]{0.44\textwidth}
        \centering
        \includegraphics[height=2.1in]{images/TradeoffCurve_all_apps_updated_evaluated_day9.png}
        \caption{All applications}
    \end{subfigure}
    
    \begin{subfigure}[t]{0.44\textwidth}
        \centering
        \includegraphics[height=2.1in]{images/TradeoffCurveOfflineOptimal25p.png}
        \caption{Hawkes process applications}
    \end{subfigure}%
    ~ 
    \begin{subfigure}[t]{0.44\textwidth}
        \centering
        \includegraphics[height=2.1in]{images/TradeoffCurveOfflineOptimalAllApps.png}
        \caption{All applications}
    \end{subfigure}
   \caption{Trade-off curve of average number of cold starts vs normalized wasted memory for various policies}
    \label{azure_fig}
\end{figure*}


\begin{table*}[t]
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
Policy & Avg. Cold Start Savings (Hawkes) & (All) & Avg. Memory Savings (Hawkes) & (All)\\
\hline
Optimal & 1.012 &  0.1296 &  0.0457 & 0.0095 \\
\hline
Optimized-TTL & 0.965 & 0.1196 & 0.0436 & 0.0088\\
\hline
Offline-Optimal & 4.038 & 1.29 & 0.269 & 0.0948\\
\hline
\end{tabular}
\caption{\label{tab:1} Average performance improvement over fixed policy}
\end{table*}



Overall, our results show that applying the optimal policy to applications fit by a Hawkes process yields benefits that are economically significant at the scale of a datacenter.  Furthermore, the optimized-TTL policy yields near-equivalent results with no need to update based on history.

% Describe the insights of results
Figure \ref{azure_fig} (a,b) plots the trade-off curve between the average number of cold starts per application and the normalized wasted memory for optimal, optimized-TTL, and fixed policies. Figure \ref{azure_fig} (a) includes only the treated (Hawkes-process-like) applications, while Figure \ref{azure_fig} (b) includes all applications. We observe from Figure \ref{azure_fig}, that the optimal policy Pareto dominates the fixed keep-alive policy.  We see that for high values of $c_{cs}$ both metrics are almost the same for all policies. This is because, the optimal policy for many applications was selected to be the upper bound.  Similarly, for low values of $c_{cs}$ many applications use the lower bound. For intermediate values of $c_{cs}$ we see that the optimal policy has almost the same number of normalized cold starts but lower amount of wasted memory. Furthermore, the performance of the optimized-TTL policy is very similar to the optimal policy so we can get these benefits.  Of course, since we only treat 25\% of applications, these benefits are attenuated when considering all applications.

While the benefits are small in absolute terms we argue they are still economically significant.  We quantify them by computing the area between the Pareto curve of the fixed and the other two policies. By  dividing the area between the Pareto curves with the maximum number of average aggregate cold starts encountered by an application (effective x-axis length) or maximum normalized wasted memory (effective y-axis length). This gives a sense of the average improvement across the curve. We record the results in Table 1.

%We see that the area between fixed policy and optimal policy is considerable when only considering Hawkes process applications and is smaller when we look at all applications. This is because when considering all applications, as the fixed keep-alive window and cost of cold start increase (moving to the left of the x-axis) the amount of equal wasted memory added to both policies due to fixed keep-alive window increases for both policies. This causes the positive area where the optimal policy performs better to decrease and the negative area where the fixed policy performs better to increase when considering all applications.

They show that the improvement of the optimized-TTL policy over the fixed policy is around 95\% of that of the optimal policy indicating that the optimized-TTL policy performs effectively close to the history dependent policy. Given that wasted memory in the existing system is normalized to 1,  average improvements of approximately 0.045 on treated applications and 0.0095 overall represent an overall decrease in resource use for the cache on the order of 4.5\% on treated applications (0.95\% overall).\footnote{Because of the shape of the curves in Figure~\ref{azure_fig}, the benefits may be modestly larger in practice because current operating point is toward the right end of the plots where the gap tends to be larger.} While small in absolute terms, percentage improvements of this scale translate to tens or hundreds of millions of dollars in cost savings for a major cloud provider~\cite{dierks2019cluster}.  Alternatively, this same improvement could be used to improve customer experience,  reducing the number of cold starts by an average of about 1.012 per day for treated customers.


To put these results in further context, recall that we are comparing against the fixed policy {\em tuned optimally per our theory}.  In terms of average cost, simply using the default choice of 10 minutes would be substantially worse in many cases.  Furthermore, our theory shows that this fixed policy is not a weak baseline but has strong theoretical properties in its own right. (See discussion at the end of Section~\ref{subsec:avg_optimal}.)

Finally, to give a bound on how much of the possible performance improvement our approach achieves, we also evaluate the performance of the offline optimal policy in Figure \ref{azure_fig} for (c) only the applications that have a Hawkes process distribution (d) and all applications. Here, the offline optimal policy chooses to have a cold start if $c_{p} x_m \geq c_{cs}$, otherwise the keep-alive window is of length $x_m$. Here, $x_{m}$ is the inter-arrival time for the $m$-th arrival of the application invocation. That is, this policy cheats in that it is tuned to the actual realized pattern of arrivals rather than any prediction. In this sense it provides a bound on how much we could hope to achieve. The results from Figure \ref{azure_fig} (c), and also quantified in Table 1, show that the optimal Hawkes policy achieves a meaningful fraction of it.  Figure \ref{azure_fig} (d) includes a comparison to the improvements reported by \citet{shahrad2020serverless}.  They report a 1.5x improvement in wasted memory at essentially no cost in cold starts, which is substantially larger than what even the offline optimal can achieve (1.05x).  This highlights how much of their improvement comes from pre-warming apps, for example timers, which our theoretical results show is not necessary for Hawkes processes.  They combine sophisticated predictive modeling with a simple rule for determining pre-warming and keep-alive decisions.  In contrast, our results provide a sophisticated rule for these decisions, making them complementary.


\section{Conclusion} \label{sec:conclusion}

Motivated by applications such as serverless computing, we presented a model of caching policies which captures the trade-off between the cost of keeping objects in the cache and the cost of cache misses.
We characterized optimal caching policies and examined the optimal policies in detail for Hawkes processes. Since optimal policies for Hawkes processes depend on the history of arrivals, we also developed history-independent policies based on the heuristic of averaging the optimal keep-alive window from simulated predictions of arrivals. Evaluation on Hawkes process simulations provided insights into the tuning and expected performance of these approximations.  Evaluation on a data trace of Azure functions showed this approach can yield small,yet economically meaningful improvements at the scale of a datacenter. Our results point to several avenues for future work.  Since our approach allows us to characterize optimal policies on a per-item basis, it is naturally suited for exploring customization based on individual customer utilities rather than an overall system average as done in our experiments and most prior work.  Another direction would be to use our model to examine optimal policies for more complex scenarios, such as a hierarchy of caches.




% References
\bibliography{narayana_621}

\end{document}