UCF101 Action Recognition Challenge

Problem statement
Identify the human action depicted in each short video clip. Each clip belongs to exactly one of 101 action classes (e.g., PlayingGuitar, Kayaking, BasketballDunk). You are given a labeled training set of videos and an unlabeled test set. Your goal is to build an accurate action classifier that generalizes to the test set.

Data description
- train_videos/: video files for training (filenames are anonymized to avoid label leakage)
- test_videos/: video files for inference (filenames are anonymized)
- train.csv: mapping of training video ids to labels and file names
  - id: unique identifier for a video
  - file_name: the corresponding file name in train_videos/
  - label: the action class for this video (string; one of 101 classes)
- test.csv: mapping of test video ids to file names
  - id: unique identifier for a video in the test set
  - file_name: the corresponding file name in test_videos/
- sample_submission.csv: example of a valid submission file (random labels drawn from the label set). Use this as a template.

Important notes
- Video file names are randomized and do not contain label information. Always use the CSV files to obtain labels and file locations.
- Videos are standard UCF101 clips (trimmed action segments) with varied viewpoints, motion, and lighting. All test labels are part of the same 101-class label space present in the training data.

Task and rules for participants
- Train any machine learning model(s) to predict the action class for each test video id.
- The final submission must follow the required CSV format outlined below.

Submission format
- A single CSV file with two columns and a header:
  - id: test video id (exactly as in test.csv)
  - label: predicted class name for that id (must be one of the 101 labels used in train.csv)
- The set of ids in your file must match test.csv exactly (no missing, no extra, no duplicates). Case-sensitive label names must match the training labels.

Evaluation metric
- Primary metric: Macro-averaged F1 score across the 101 classes.
  - For each class, we compute the F1 score (the harmonic mean of precision and recall), treating the class as the positive class and all others as negative; the final score is the unweighted mean across all classes.
  - This choice emphasizes balanced performance across classes and is robust to mild class imbalance.
- Tie-breaker (display only): Top-1 accuracy may be reported on the leaderboard for reference.

Competition files
- train_videos/
- test_videos/
- train.csv
- test.csv
- sample_submission.csv

Getting started baseline ideas
- Decode a fixed number of uniformly sampled frames per video and fine-tune an ImageNet-pretrained 2D CNN with temporal pooling.
- Use a pretrained 3D CNN/Video Transformer (e.g., R(2+1)D, X3D, TimeSformer) with 8–32 frame clips and simple aggregation.
- Add optical flow or motion vectors for motion-aware features.
- Apply standard techniques like class-weighted loss, label smoothing, and test-time augmentation to improve macro-F1.

Good luck and have fun building robust video understanding systems!