Title: Multilabel Prediction of Medicinal Properties of Plants

Problem statement
You are given a rich tabular/text dataset of plants with botanical, cultivation, edibility, hazard, range, and free‑text descriptions. The goal is to build a model that, for each plant record, predicts which medicinal properties apply to that plant. Each row may have multiple properties simultaneously (multilabel classification). This requires robust text processing, feature engineering, and model training on noisy, domain‑rich attributes.

Files
- train.csv: Training data with target column "Medicinal Properties" (a semicolon-separated list of properties per row) plus features such as scientific/common names, cultivation details, edibility, hazards, ranges, and more. Includes a unique id column "id".
- test.csv: Test data with the same feature columns and "id" but without the target column.
- sample_submission.csv: Example submission file. It contains the column "id" and one probability column per medicinal property label (sanitized into snake‑cased columns like label_Antibacterial). Values must be probabilities in [0, 1].

Target
- Medicinal Properties (multilabel): 156 unique medicinal properties extracted from the dataset. Examples include Antibacterial, Astringent, Diuretic, Febrifuge, Stomachic, etc. The label set is fixed and provided implicitly by the submission columns.

Evaluation metric
- Submissions must output, for every test id and for every medicinal property label, a probability p in [0, 1].
- Scoring uses micro‑averaged log loss over all label decisions (each label for each row). This fairly balances positives and negatives without being dominated by rare labels, encourages calibrated probabilities, and is sensitive to both false positives and false negatives. The lower the score, the better.
- Columns must exactly match the sample_submission.csv (same order), and ids must match test.csv.

Train/test split
- The data are split deterministically. All labels present in the test set occur at least once in the training set. The split preserves the global label space while maintaining a challenging distribution.

Submission format
- CSV with columns: id, followed by one column per label as in sample_submission.csv. Column names are sanitized as label_<LabelName>.
- Values must be probabilities in [0, 1]. Any missing/extra columns, wrong order, non‑finite values, or id mismatch will invalidate the submission.

Important notes
- Do not attempt to infer labels from filenames or URLs; no external images are provided in the split.
- Text fields may include bracketed citations like [238]; treat them as plain text.
- The order of labels in the target string is arbitrary; treat as a set.

Final deliverables
- train.csv, test.csv, sample_submission.csv

We look forward to your creative approaches to robust, calibrated multilabel prediction on botanical text/tabular data.