Title: ESG Score Prediction from Sustainability Reports

Problem statement
Participants are challenged to build models that predict four continuous ESG dimensions for S&P 500 firms directly from their sustainability report text:
- e_score (Environmental)
- s_score (Social)
- g_score (Governance)
- total_score (Overall ESG)

This is a multi-target regression problem grounded in long-form documents that combine narrative text and extracted named entities (NER). The task emphasizes document understanding, feature engineering, and robust modeling.

Files provided
- train.csv: training targets with one row per document id. Columns: id, e_score, s_score, g_score, total_score.
- test.csv: test ids (targets withheld). Column: id.
- text/train/*.txt and text/test/*.txt: cleaned report text files per document id.
- ner/train/*.txt and ner/test/*.txt: extracted named entities per document id (one simple text blob per document).
- sample_submission.csv: example submission file with valid schema and placeholder numeric predictions.

Each id corresponds to one document. The raw text and NER text for each id are available in the text/ and ner/ directories. File names are anonymized to avoid label leakage.

Data description
- Documents originate from S&P 500 companies’ sustainability reports and have been preprocessed into plain text. Each document also includes a corresponding bag of named entities extracted from the source report.
- Targets are four numeric scores: e_score, s_score, g_score, total_score.
- Train/test split preserves the target distribution via stratification on binned total_score.

Task
- Use the text and NER content to predict all four targets for the documents in test.csv. Approaches may include classical NLP (TF-IDF, topic modeling, n-grams), modern transformers (document embedding, long-sequence models), and multimodal fusion of text and NER features. Feature engineering is encouraged (e.g., sentiment, ESG lexicons, entity statistics, readability, temporal cues).

Evaluation
Submissions are evaluated using Mean Columnwise Root Mean Squared Error (MCRMSE) over the four targets. For N test items and K=4 targets, with y_true and y_pred matrices of shape (N, K):
1. Compute RMSE per column.
2. Average the four RMSE values.
Lower is better. This metric balances the four dimensions and is robust and interpretable for multi-target regression.

Submission format
A valid CSV must contain the following columns with exactly these names:
- id, e_score, s_score, g_score, total_score
The id column must match test.csv. All prediction values must be finite numeric values.

Train/test files
- Training: train.csv and the corresponding text/train and ner/train folders.
- Test: test.csv and the corresponding text/test and ner/test folders.

Rules and notes
- Do not attempt to reconstruct original filenames, tickers, or years from ids; they were anonymized.
- External data and pretraining are allowed subject to typical Kaggle rules; ensure any leakage is avoided.
- You may use both text and NER folders; they contain complementary signals.
- Runtime is not restricted; prioritize modeling quality.
