---
annotations_creators:
- crowdsourced
language_creators:
- found
language:
- en
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- sentiment-classification
paperswithcode_id: sst
pretty_name: Stanford Sentiment Treebank v2
dataset_info:
  features:
  - name: idx
    dtype: int32
  - name: sentence
    dtype: string
  - name: label
    dtype:
      class_label:
        names:
          '0': negative
          '1': positive
  splits:
  - name: train
    num_bytes: 4690022
    num_examples: 67349
  - name: validation
    num_bytes: 106361
    num_examples: 872
  - name: test
    num_bytes: 216868
    num_examples: 1821
  download_size: 7439277
  dataset_size: 5013251
---

# Dataset Card for [Dataset Name]

## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://nlp.stanford.edu/sentiment/
- **Repository:**
- **Paper:** [Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank](https://www.aclweb.org/anthology/D13-1170/)
- **Leaderboard:**
- **Point of Contact:**

### Dataset Summary

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the
compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005)
and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and
includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

Binary classification experiments on full sentences (negative or somewhat negative vs somewhat positive or positive
with neutral sentences discarded) refer to the dataset as SST-2 or SST binary.

### Supported Tasks and Leaderboards

- `sentiment-classification`

### Languages

The text in the dataset is in English (`en`).

## Dataset Structure

### Data Instances

```
{'idx': 0,
 'sentence': 'hide new secretions from the parental units ',
 'label': 0}
```

### Data Fields

- `idx`: Monotonically increasing index ID.
- `sentence`: Complete sentence expressing an opinion about a film.
- `label`: Sentiment of the opinion, either "negative" (0) or positive (1). The test set labels are hidden (-1).

### Data Splits

|                    |    train | validation | test |
|--------------------|---------:|-----------:|-----:|
| Number of examples |    67349 |        872 | 1821 |

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

Rotten Tomatoes reviewers.

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

Unknown.

### Citation Information

```bibtex
@inproceedings{socher-etal-2013-recursive,
    title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
    author = "Socher, Richard  and
      Perelygin, Alex  and
      Wu, Jean  and
      Chuang, Jason  and
      Manning, Christopher D.  and
      Ng, Andrew  and
      Potts, Christopher",
    booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
    month = oct,
    year = "2013",
    address = "Seattle, Washington, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D13-1170",
    pages = "1631--1642",
}
```

### Contributions

Thanks to [@albertvillanova](https://github.com/albertvillanova) for adding this dataset.