Competition: PubMed Scientific Article Summarization

Problem statement
You are given the full text of scientific articles and must generate high‑quality abstracts that faithfully summarize the core findings and conclusions. This is an abstractive/extractive text summarization task focused on biomedical literature. The challenge is to produce fluent, concise, and information‑preserving summaries from long, noisy inputs.

Files provided
- train.csv: Training set with columns [id, article, abstract].
- test_data.csv: Test set inputs with columns [id, article].
- sample_submission.csv: Submission template with columns [id, abstract].

Goal
For each id in test_data.csv, predict a single free‑text summary in the abstract column of your submission file.

Evaluation
Submissions are evaluated by the average of ROUGE‑1 F1, ROUGE‑2 F1, and ROUGE‑L F1 across all test examples:
- Tokenization: lowercased, punctuation removed, whitespace tokenization.
- ROUGE‑N: computed on word n‑grams (n∈{1,2}). Precision/recall via overlap counts; F1 = 2·P·R/(P+R), with 0 when denominator is 0.
- ROUGE‑L: computed as F1 based on the Longest Common Subsequence (LCS) of word tokens.
- Per‑example scores are averaged, then the metric is the mean of the three averaged F1 scores.
The evaluation is case‑insensitive and robust to minor formatting differences, but it rewards content coverage and penalizes omissions and hallucinations.

Submission format
- CSV file with exactly two columns named id and abstract.
- The id set must match test_data.csv exactly (order may differ). Duplicate or missing ids, extra rows/columns, or empty abstracts will be rejected.

Final data split
- Train: union of the provided training and validation portions of the source dataset.
- Test: the provided held‑out portion.

Important
- Do not include any extra columns in your submission.
- Ensure deterministic preprocessing and inference for reproducible results.
- Long inputs may require careful memory management and truncation strategies; ensure your system handles long sequences correctly.
