\clearpage
\section{Derivation of Variational Lower Bound}
\label{appendix:derivation}

In this appendix, we derive the variational lower bound used in
Eq.~\eqref{eq:psc_lfq_objective_revised}. The model represents each input video
$\xb^{(n)}$ using a continuous state latent $\sbb$ and a lookup-free identity
representation $\zb_{\mathrm{id}}$. The variational lower bound provides the
reconstruction and state-regularization terms, while the lookup-free
quantization, temporal identity-consistency, and contrastive identity--state
separation terms are added as regularizers for stable and disentangled object
representations.

\subsection{Variational Lower Bound}

For each sample $\xb^{(n)}$, the identity encoder produces a deterministic
identity representation $\zb_{\mathrm{id}}^{(n)}$. Conditioned on this identity
representation, the marginal likelihood is
\begin{align}
    \log p_{\theta}(\xb^{(n)} \mid \zb_{\mathrm{id}}^{(n)})
    &=
    \log \int p_{\theta}(\xb^{(n)}, \sbb \mid \zb_{\mathrm{id}}^{(n)})\,d\sbb \\
    &=
    \log \int
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    \frac{
    p_{\theta}(\xb^{(n)}, \sbb \mid \zb_{\mathrm{id}}^{(n)})
    }{
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    }
    d\sbb \\
    &\geq
    \int
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    \log
    \frac{
    p_{\theta}(\xb^{(n)}, \sbb \mid \zb_{\mathrm{id}}^{(n)})
    }{
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    }
    d\sbb \\
    &=
    \int
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    \log
    \frac{
    p_{\theta}(\xb^{(n)} \mid \sbb, \zb_{\mathrm{id}}^{(n)})p(\sbb)
    }{
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    }
    d\sbb \\
    &=
    \mathbb{E}_{\sbb \sim q_{\phi_s}(\sbb \mid \xb^{(n)})}
    \left[
    \log p_{\theta}
    \left(
    \xb^{(n)} \mid \sbb, \zb_{\mathrm{id}}^{(n)}
    \right)
    \right]
    -
    D_{\mathrm{KL}}
    \left(
    q_{\phi_s}(\sbb \mid \xb^{(n)})
    \,\|\,p(\sbb)
    \right).
\end{align}
The inequality follows from Jensen's inequality. Therefore, maximizing the lower
bound is equivalent to minimizing the negative reconstruction likelihood and the
state KL regularization term.

\subsection{Lookup-Free Identity Representation}

For object track $i$, the identity encoder maps the temporally aggregated object
feature $\bar{\zb}^{(i)}$ to LFQ logits:
\begin{equation}
    \ab^{(i)}
    =
    g_{\phi_z}(\bar{\zb}^{(i)})
    \in \mathbb{R}^{B},
    \qquad
    B=\log_2 K_{\mathrm{id}},
\end{equation}
where $K_{\mathrm{id}}$ is the size of the identity vocabulary. The binary
identity code is obtained by independent sign quantization:
\begin{equation}
    \bb^{(i)}
    =
    q_{\mathrm{LFQ}}(\ab^{(i)}),
    \qquad
    b^{(i)}_r
    =
    2\mathbf{1}[a^{(i)}_r \geq 0]-1,
    \quad r=1,\ldots,B .
\end{equation}
Since the sign operation is non-differentiable, we use the straight-through
estimator
\begin{equation}
    \tilde{\bb}^{(i)}
    =
    \ab^{(i)}
    +
    \mathrm{sg}
    \left[
    \bb^{(i)}-\ab^{(i)}
    \right],
\end{equation}
where $\mathrm{sg}[\cdot]$ denotes the stop-gradient operation. The projected
identity representation used by the decoder is
\begin{equation}
    \zb_{\mathrm{id}}^{(i)}
    =
    W_{\mathrm{id}}\tilde{\bb}^{(i)}+\cb_{\mathrm{id}}.
\end{equation}

\subsection{LFQ Commitment Loss}

The LFQ commitment loss encourages the continuous logits $\ab^{(i)}$ to remain
close to their assigned binary identity code. Since LFQ does not use a learned
embedding codebook, the commitment penalty is applied directly between the
pre-quantized logits and the stop-gradient binary code:
\begin{equation}
    \mathcal{L}_{\mathrm{commit}}
    =
    \frac{1}{N_{\mathrm{trk}}}
    \sum_{i=1}^{N_{\mathrm{trk}}}
    \left\|
    \ab^{(i)}
    -
    \mathrm{sg}
    \left[
    \bb^{(i)}
    \right]
    \right\|_2^2.
\end{equation}
This term stabilizes LFQ training by preventing the encoder outputs from drifting
away from their discrete binary assignments.

\subsection{Entropy Codebook-Utilization Loss}

To encourage confident and balanced binary assignments, we use the LFQ entropy
codebook-utilization loss:
\begin{equation}
    \mathcal{L}_{\mathrm{entropy}}
    =
    \mathbb{E}
    \left[
        H(q_{\mathrm{LFQ}}(\ab))
    \right]
    -
    H
    \left(
        \mathbb{E}
        \left[
            q_{\mathrm{LFQ}}(\ab)
        \right]
    \right).
\end{equation}
The first term penalizes high entropy for individual assignments, encouraging
each object to make a confident binary decision. The second term encourages the
average assignment distribution to have high entropy, which promotes balanced
usage of the identity code space.

For binary LFQ, we approximate the assignment probability of dimension $r$ for
track $i$ as
\begin{equation}
    p^{(i)}_r = \sigma(a^{(i)}_r/\tau),
    \qquad
    \bar p_r =
    \frac{1}{N_{\mathrm{trk}}}
    \sum_{i=1}^{N_{\mathrm{trk}}} p^{(i)}_r,
\end{equation}
where $\sigma(\cdot)$ is the sigmoid function and $\tau$ is a temperature
parameter. The entropy codebook-utilization loss becomes
\begin{equation}
    \mathcal{L}_{\mathrm{entropy}}
    =
    \frac{1}{N_{\mathrm{trk}}}
    \sum_{i=1}^{N_{\mathrm{trk}}}
    \sum_{r=1}^{B}
    h(p^{(i)}_r)
    -
    \sum_{r=1}^{B}
    h(\bar p_r),
\end{equation}
where
\begin{equation}
    h(p)=-p\log p-(1-p)\log(1-p)
\end{equation}
is the binary entropy function.

\subsection{Temporal Identity-Consistency Loss}

Temporal identity consistency encourages all frame-level proposals assigned to
the same object track to agree with the track-level identity representation. Let
$\mathcal{M}_i$ denote the set of frame-level proposals assigned to object track
$i$, and let $\ab_{t,k}$ denote the LFQ logits predicted from proposal $(t,k)$.
The aggregated track-level logits $\ab^{(i)}$ serve as the identity anchor for
all proposals in the same track. We define
\begin{equation}
    \mathcal{L}_{\mathrm{temp}}
    =
    \frac{1}{\sum_i |\mathcal{M}_i|}
    \sum_i
    \sum_{(t,k)\in \mathcal{M}_i}
    \left\|
    \tanh(\ab_{t,k})
    -
    \mathrm{sg}
    \left[
    \tanh(\ab^{(i)})
    \right]
    \right\|_2^2.
\end{equation}
The hyperbolic tangent maps the logits to a soft binary range before comparison.
The stop-gradient operation treats the aggregated identity as a fixed temporal
anchor. This term discourages identity drift by requiring frame-level identity
evidence for the same physical object to remain consistent across time.

\subsection{Contrastive Identity--State Separation Loss}

The variational lower bound encourages accurate reconstruction from both
$\sbb$ and $\zb_{\mathrm{id}}$, but reconstruction alone does not guarantee that
the two variables encode different information. In particular, the state latent
may still encode identity-specific information. To discourage this leakage, we
add a contrastive identity--state separation loss.

Because the state latent and identity representation may have different
dimensions, we first map them to a shared contrastive space using projection
heads:
\begin{equation}
    r_s:\mathcal{S}\rightarrow\mathbb{R}^{d_c},
    \qquad
    r_{\mathrm{id}}:\mathcal{Z}_{\mathrm{id}}\rightarrow\mathbb{R}^{d_c}.
\end{equation}
For object track $i$ at frame $t$, define the normalized projected embeddings
\begin{equation}
    \ub_{t,i}
    =
    \frac{
        r_s(\sbb^{(t,i)})
    }{
        \|r_s(\sbb^{(t,i)})\|_2
    },
    \qquad
    \vb_i
    =
    \frac{
        r_{\mathrm{id}}(\zb_{\mathrm{id}}^{(i)})
    }{
        \|r_{\mathrm{id}}(\zb_{\mathrm{id}}^{(i)})\|_2
    } .
\end{equation}
Their inner product
\begin{equation}
    \ub_{t,i}^{\top}\vb_i
\end{equation}
is the cosine similarity between the projected state and identity
representations. We then define
\begin{equation}
    \mathcal{L}_{\mathrm{con}}^{s,\mathrm{id}}
    =
    \frac{1}{\sum_i |\mathcal{T}_i|}
    \sum_i
    \sum_{t\in\mathcal{T}_i}
    \left[
        \max
        \left(
            0,
            \ub_{t,i}^{\top}\vb_i
            -
            m
        \right)
    \right]^2 ,
\end{equation}
where $\mathcal{T}_i$ is the set of visible frames for object track $i$, and
$m$ is a similarity margin. When $m=0$, the loss penalizes positive cosine
similarity and encourages approximate orthogonality between state and identity.
When $m<0$, the loss enforces a stronger separation by encouraging the two
representations to be anti-correlated.

This term treats the state and identity representations from the same object as
factors that should be separated rather than aligned. Therefore, unlike standard
contrastive objectives that pull positive pairs together, this loss explicitly
pushes the state representation away from the identity representation.

\subsection{Final Training Objective}

Taking the negative of the lower bound and adding the LFQ identity losses,
temporal consistency loss, and contrastive identity--state separation loss gives
the final minimization objective:
\begin{equation}
    \begin{aligned}
    \mathcal{L}_{\mathrm{PSC\text{-}LFQ}}
    =
    &
    -
    \mathbb{E}_{\sbb \sim q_{\phi_s}(\sbb \mid \xb)}
    \left[
    \log p_{\theta}
    \left(
    \xb \mid \sbb, \zb_{\mathrm{id}}
    \right)
    \right] \\
    &+
    \lambda_s
    D_{\mathrm{KL}}
    \left(
    q_{\phi_s}(\sbb \mid \xb)
    \,\|\,p(\sbb)
    \right) \\
    &+
    \beta \mathcal{L}_{\mathrm{commit}}
    +
    \lambda_{\mathrm{entropy}}\mathcal{L}_{\mathrm{entropy}}
    +
    \lambda_{\mathrm{temp}}\mathcal{L}_{\mathrm{temp}} \\
    &+
    \lambda_{\mathrm{con}}
    \mathcal{L}_{\mathrm{con}}^{s,\mathrm{id}} .
    \end{aligned}
\end{equation}
The first two terms are obtained directly from the variational lower bound. The
commitment loss makes the lookup-free identity representation trainable, the
entropy codebook-utilization loss encourages confident and balanced use of the
binary identity code space, and the temporal identity-consistency loss prevents
the same object's identity representation from drifting across frames. The
contrastive identity--state separation loss discourages the continuous state
latent from encoding identity information already captured by
$\zb_{\mathrm{id}}$. We minimize this objective with respect to
$(\phi_s,\phi_z,\theta,W_{\mathrm{id}},\cb_{\mathrm{id}})$ and the projection
heads used in the contrastive separation term.