Competition: Multimodal Hate Speech Classification (Text + Image)

Problem statement
Build a model that detects the presence and type of hate speech in social media posts using both text and images. Each post includes the tweet text, OCR text extracted from the image, and the image itself. Your task is to predict one of six categories per post:
- NotHate
- Racist
- Sexist
- Homophobe
- Religion
- OtherHate

This is a challenging multimodal problem with subjective, noisy annotations aggregated from multiple annotators. Robust data processing, feature engineering, and modeling across modalities are key to strong performance.

Files you will receive
- train.csv: Training metadata with columns [id, image, tweet_text, ocr_text, label].
- test.csv: Test metadata with columns [id, image, tweet_text, ocr_text]. No labels.
- train_images/: Folder of training images. Filenames match the image column in train.csv.
- test_images/: Folder of test images. Filenames match the image column in test.csv.
- train_text/: Folder of OCR JSON files for training samples (one per id). Names match id with .json extension.
- test_text/: Folder of OCR JSON files for test samples (one per id). Names match id with .json extension.
- sample_submission.csv: A valid example submission with random labels.

Notes on the data
- Labels originate from three crowd annotators per post. We provide a single resolved label for supervision via majority vote with deterministic tie-breaking.
- OCR text is provided as a convenience feature extracted from images. It may be empty for some samples.
- Filenames are anonymized to avoid any leakage from original identifiers.

Target labels
The task is single-label multiclass classification over the six categories listed above. All categories appearing in the test set occur in the training set.

Submission format
- CSV file with header and two columns: [id, label]
- id: Must match exactly the id values from test.csv
- label: One of {NotHate, Racist, Sexist, Homophobe, Religion, OtherHate}

Evaluation
- Macro-averaged F1 score across the six classes.
- For each class, compute F1 = 2 * (precision * recall) / (precision + recall). The final score is the unweighted mean of F1 over all classes.
- Predictions must specify a single label per test sample. Tie-breaking is up to the participant.

Recommended approaches (non-exhaustive)
- Text modeling: Preprocess, handle slang/typos, leverage subword tokenizers, pretrained language models, and class-weighted loss.
- Image modeling: Fine-tune CNN/ViT backbones, augmentations, and CLIP-like multimodal encoders.
- Multimodal fusion: Early/late fusion, cross-attention, or co-training. Use the OCR text as an additional textual modality.
- Handling subjectivity/noise: Label smoothing, robust loss, ensembling, calibration, and abstention-aware decision rules.

Important constraints
- Do not use external resources that reveal labels for the provided samples.
- The metadata CSV files do not include any file paths; use the image and id names to locate assets within the provided folders.
- All evaluation is case-sensitive for label names. Use the exact category strings.
