Title: Reddit Subreddit Classification Challenge

Problem statement
You are given 1,000,000 Reddit comments drawn uniformly from 40 high-traffic subreddits (25,000 per subreddit). Each comment includes its raw text and two metadata fields. Your task is to build a model that accurately predicts the originating subreddit of each comment. This is a challenging multi-class text classification problem with significant lexical overlap across classes and noisy, multi-line comment bodies.

Deliverables
Participants must submit a probability distribution over the 40 subreddits for every test comment.

Files provided
- train.csv: Training data with labels.
- test.csv: Test data without labels.
- sample_submission.csv: Submission template with the required columns and valid example probability values.

Data schema
All CSV files use UTF-8 encoding and may contain multi-line fields (properly quoted). Columns:
- id: Unique identifier for each comment (string). Deterministic across files.
- body: The full comment text. May contain punctuation, emojis, URLs, and newlines.
- controversiality: Numeric feature (0 or 1). Reddit’s aggregated controversiality flag.
- score: Integer score (upvotes minus downvotes) at snapshot time.
- subreddit: Only present in train.csv. The target label; one of 40 subreddit names.

Task
- Learn from train.csv how text and metadata map to subreddits.
- Produce well-calibrated class probability predictions for every row in test.csv and save them to submission.csv consistent with the sample template.

Submission format
- A single CSV with header: id,<subreddit_1>,<subreddit_2>,...,<subreddit_40>
- One row per id in test.csv.
- Each row must be a valid probability distribution over all 40 classes (non-negative, finite, and sums to 1 within numerical tolerance).
- The set of subreddit columns and their order must exactly match sample_submission.csv. Use the sample file as the authoritative reference for column names and ordering.

Evaluation metric
- Multiclass logarithmic loss (a.k.a. cross-entropy) averaged over all test examples.
- Let K be the number of classes (40), N the number of test rows, and p_i(k) the submitted probability for the true class k of row i. The score is:
  LogLoss = -(1/N) * Σ_i log(max(ϵ, min(1, p_i(true_class)))) with ϵ = 1e-15 to avoid log(0).
- Lower is better. This metric is sensitive to both accuracy and calibration and is robust to class balance (the dataset is balanced by construction).

Constraints and notes
- Multi-line comment bodies are quoted and must be parsed as CSV safely.
- The train and test splits are stratified by subreddit and deterministic, so local validation splits should mimic this to avoid leakage.

Final data files
- train.csv
- test.csv
- sample_submission.csv

Good luck and have fun building robust NLP systems!