Competition: arXiv Subject Area Tagging from Titles and Abstracts

Problem statement
Build a multi-label text classification model that predicts the arXiv subject area tags for a paper given its title and abstract. Each paper may belong to one or more subject areas (e.g., cs.CV, cs.LG). Your goal is to produce, for every test paper, the set of predicted tags.

Data description
- train.csv: Training set with columns
  - id: Unique identifier of a paper
  - title: Paper title (string)
  - summary: Paper abstract (string)
  - labels: Space-separated list of subject area tags for the paper (multi-label)
- test.csv: Test set with columns
  - id: Unique identifier of a paper
  - title: Paper title (string)
  - summary: Paper abstract (string)
- sample_submission.csv: Example submission with columns
  - id
  - labels: Space-separated list of predicted subject area tags

Notes
- The label vocabulary is the set of unique tags that appear in train.csv. All predicted tags must come from this vocabulary.
- Text fields may include punctuation and newlines; handle them appropriately in your preprocessing pipeline.
- There is no public ground truth for the test set; test labels are held out for evaluation.

Evaluation metric
Submissions are evaluated using micro-averaged F1 score over the multi-label predictions.
- Let Y_true[i] be the set of true tags for paper i and Y_pred[i] the set of predicted tags.
- True Positives (TP): tags present in both Y_true and Y_pred across all papers
- False Positives (FP): tags present only in Y_pred across all papers
- False Negatives (FN): tags present only in Y_true across all papers
- micro_precision = TP / (TP + FP) (0 if denominator is 0)
- micro_recall = TP / (TP + FN) (0 if denominator is 0)
- micro_F1 = 2 * micro_precision * micro_recall / (micro_precision + micro_recall) (0 if denominator is 0)

Submission format
- A CSV file with exactly two columns: id, labels
- ids must match the ids in test.csv exactly, with no duplicates or missing entries
- labels for each row must be a space-separated list of tags from the training label vocabulary; an empty string is permitted to indicate predicting no tags for that row

Final files provided
- train.csv
- test.csv
- sample_submission.csv

Good luck and have fun building robust multi-label text classifiers!