## Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
### paper ID: 4451
***

### This code snippet contains the model architecture and the loss functions we use to optimize the model.  
### Since the code is incomplete (args, train_loader, and optimizer are not defined), it cannot be executed successfully.
### We will release the complete code in the future.


<br />

### Code variables and paper notations correspondence:
 - x1 (line 148) denotes the audio features $f^a_t$
 - x2 (line 160) denotes the visual features $f^v_t$
 - x1, x2 (line 163) denotes the features $\hat{f}^a_t$ and $\hat{f}^v_t$, respectively, output from HAN
 - frame_prob (line 168) denotes the segment-level audio and visual event probabilities $p^m_t, m\in\{a,v\}$
 - frame_att (line 171) denotes the temporal attention weights $A^m$
 - av_att (line 172) denotes the modality attention weights $B$
 - global_prob (line 174) denotes the video-level event probabilities $p$
 - a_prob (line 176) denotes the video-level audio event probabilities $p^a$
 - v_prob (line 177) denotes the video-level visual event probabilities $p^v$

