Jasper-Flash: Elastic Token Compression and Progressive Distillation for Inference-Scalable Text Embedding Models

Jasper-Flash: Elastic Token Compression and Progressive Distillation for Inference-Scalable Text Embedding Models

ACL ARR 2026 May Submission15165 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text Embedding Model, Knowledge Distillation, Token Compression

Abstract: Deploying text embedding models under resource constraints is hindered by massive parameters and standard self-attention's quadratic complexity. However, existing sequence reduction strategies remain predominantly static. To address this, inspired by Matryoshka Representation Learning (MRL), we propose an Elastic Token Compression (ETC) framework that enables flexible sequence scaling for inference-time scalability. Furthermore, to stabilize training, we introduce Compression-Adaptive Progressive Distillation (CAPD) utilizing multi-teacher fusion and dynamic sampling to construct a robust, compression-tolerant semantic space. We present Jasper-Token-Compression-600M, which allows on-the-fly adjustment of encoding latency based on resources while maintaining highly competitive performance and demonstrating superior representation capacity across varying compression bounds.

Paper Type: Long

Research Area: Efficient Methods for NLP

Research Area Keywords: distillation, NLP in resource-constrained settings, LLM efficiency, dense retrieval, representation learning

Contribution Types: NLP engineering experiment, Approaches to low-compute settings (efficiency), Publicly available software and/or pre-trained models

Languages Studied: English, Chinese

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 15165

Loading