# FlexiCodec: Supplementary code

## Architecture

This is our code package for the paper "FlexiCodec: a dynamic neural audio codec for low frame rates". We have removed the copyright information from some of the third-party files to maintain anonymity, but we will properly cite them when we open source our code. This code package is a preview version.

### Core Model (`modeling_flexicodec.py`)
The main model implementation featuring:
- **Dual Stream Encoder**: Processes both acoustic and semantic features
- **Dynamic Frame Rate Mechanism**: Adaptive frame merging/un-merging based on semantic content
- **Transformer Bottlenecks**: Specialized transformers for frame rate conversion
- **Semantic Feature Integration**: Supports multiple pre-trained semantic features (SenseVoice, Whisper, W2vBERT2)

### Neural Audio Processing (`modules/dac_model.py`)
Codec encoder and decoder components including:
- **Residual Units**: Residual blocks with Snake activation and weight normalization
- **Convolutional Architecture**: Efficient 1D convolutions for audio feature extraction
- **Encoder-Decoder Structure**: Symmetric design for audio reconstruction

### Quantization (`modules/quantize.py`, , `modules/fsq_quantizer.py`, `modules/fsq_wrapper.py`)
Vector quantization system for discrete token generation:
- **Residual Vector Quantization (RVQ)**
- **Finite Scalar Quantization (FSQ)**
- **Codebook Learning**: End-to-end optimization of quantization codebooks with straight-through estimator

### Discriminative Training (`modules/discriminator.py`)
Adversarial training components:
- **Multi-Period Discriminator (MPD)**
- **Multi-Resolution Spectrogram Discriminator**

### Loss Functions (`modules/loss.py`)
Comprehensive loss suite for training:
- **Reconstruction Losses**: L1, Mel-spectrogram, multi-band Mel-spectrogram losses
- **Adversarial Losses**: GAN losses for discriminator training

### Transformer Components (`modules/transformer.py`)
The transformer for FlexiCodec's frame merging and unmerging modules. 
It uses local windowed attention.

### CNN Components (`modules/cnn.py`)
Convolutional neural network building blocks:
- **ConvNeXt Blocks**: Modern CNN architecture for feature extraction

## Data Processing

### Dataset (`dataloaders/dataset.py`)
Audio dataset implementation:
- ** Audio Loading**: Supports various audio formats and sampling rates
- **Feature Extraction**: Integrated FBank and SeamlessM4T feature extraction
- **Segment Processing**: Crop to fixed-length audio segments

### Feature Extractors (`dataloaders/feature_extractors.py`)
Specialized feature extractions for each ASR/SSL model.

## Training Infrastructure

### Main Training Script (`train_codec.py`)
Complete training pipeline:
- **Distributed Training**: Multi-GPU training support with DDP
- **Checkpoint Management**: Resume training and model saving
- **Logging and Monitoring**: TensorBoard integration for experiment tracking
- **Configuration Management**: YAML-based configuration system

### Configuration (`conf/flexicodec_example.yaml`)
The configuration that we use to instantiate the main class in `modeling_flexicodec.py`.

### Evaluation (`evaluation/eval_codec.py`)
We also attach our development evaluation code. This includes audio quality metrics, semantic preservation metrics, code utilization, and real-time factor evaluations.