Jasper-Flash: Elastic Token Compression and Progressive Distillation for Inference-Scalable Text Embedding Models
Keywords: Text Embedding Model, Knowledge Distillation, Token Compression
Abstract: Deploying text embedding models under resource constraints is hindered by massive parameters and standard self-attention's quadratic complexity. However, existing sequence reduction strategies remain predominantly static. To address this, inspired by Matryoshka Representation Learning (MRL), we propose an Elastic Token Compression (ETC) framework that enables flexible sequence scaling for inference-time scalability. Furthermore, to stabilize training, we introduce Compression-Adaptive Progressive Distillation (CAPD) utilizing multi-teacher fusion and dynamic sampling to construct a robust, compression-tolerant semantic space. We present Jasper-Token-Compression-600M, which allows on-the-fly adjustment of encoding latency based on resources while maintaining highly competitive performance and demonstrating superior representation capacity across varying compression bounds.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: distillation, NLP in resource-constrained settings, LLM efficiency, dense retrieval, representation learning
Contribution Types: NLP engineering experiment, Approaches to low-compute settings (efficiency), Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15165
Loading