Language-driven Semantic Segmentation

Boyi Li; Kilian Q Weinberger; Serge Belongie; Vladlen Koltun; Rene Ranftl

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, Rene Ranftl

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 PosterReaders: Everyone

Keywords: language-driven, semantic segmentation, zero-shot, transformer

Abstract: We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., ``grass'' or ``building'') together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., ``cat'' and ``furry''). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.

One-sentence Summary: We present a language-driven approach that enables synthesis of zero-shot semantic segmentation models from arbitrary label sets at test time.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/language-driven-semantic-segmentation/code)

18 Replies

Loading