Movie Year Prediction from Dialogue Transcripts

Problem statement
Participants are given thousands of full movie dialogue transcripts. The goal is to predict the theatrical release year of each movie from its transcript text. This is a challenging temporal text regression problem: linguistic style, named entities, topics, cultural references, and dialog structure evolve over time.

Files provided to participants
- train.csv: Two columns [id, year]. Each id corresponds to a single transcript file in train_texts/ and the target is the movie release year (integer).
- test.csv: One column [id]. Each id corresponds to a single transcript file in test_texts/ without labels.
- train_texts/: Folder containing the training transcripts as plain UTF-8 .txt files. Filenames are anonymized ids (no title/year to avoid leakage).
- test_texts/: Folder containing the test transcripts as plain UTF-8 .txt files. Filenames are anonymized ids (no title/year).
- sample_submission.csv: Example submission file with the correct columns and random, valid placeholder years.

Do NOT rely on any information outside the text content itself. Filenames and directory names are anonymized specifically to avoid revealing labels. Any attempt to infer labels from original filenames, metadata, or external resources is disallowed.

Task
Given the transcript text of each movie in train_texts/ with labels in train.csv, learn to predict the release year for the transcripts in test_texts/ (ids in test.csv).

Evaluation
- Metric: Mean Absolute Error (MAE) on the predicted year.
- Lower is better.
- Predictions can be any real numbers; the metric is computed as the mean absolute difference between predictions and true years.

Submission format
- CSV with exactly two columns: id, year
- Header row required; ids must exactly match those in test.csv (no missing or extra rows, no duplicates).
- The year column should be numeric. Floating-point predictions are allowed.

Data construction and split
- The data are prepared from a large corpus of full-length movie transcripts. Years are parsed from the original filenames and validated to be between 1900 and 2026. Files are anonymized and split into train/test with an approximately stratified distribution by decade to reduce temporal covariate shift.

Data files
- train.csv
- test.csv
- sample_submission.csv
- train_texts/ (anonymized .txt files)
- test_texts/ (anonymized .txt files)

Notes
- Some transcripts include non-English dialogue; robust models should handle multilingual inputs.
- Transcripts vary widely in length; consider length normalization or truncation strategies tailored to your model.
- This competition is intended for research and educational purposes.
