Title: Predict Congestion Event Duration from Geo-Spatiotemporal Traffic Signals

Problem statement
Given a large-scale, countrywide dataset of traffic congestion events (February 2016–September 2022) with geo-spatiotemporal, traffic, and weather attributes, predict the duration (in minutes) of each congestion event at the time it starts. This is a challenging regression problem that requires thoughtful data processing, feature engineering (spatial, temporal, and weather-driven), and robust modeling to generalize across locations, seasons, and routing networks.

Target
- Duration_Minutes: computed as the difference (in minutes) between EndTime and StartTime for each congestion event. To prevent label leakage, EndTime is removed from the provided train/test features. Your task is to predict Duration_Minutes for the events in test.csv using all other available signals at StartTime.

Provided files
- train.csv: Feature table including the target column Duration_Minutes. EndTime is removed; all other raw columns are provided to maximize modeling flexibility.
- test.csv: Same columns as train.csv except the target Duration_Minutes is not included. EndTime is removed.
- sample_submission.csv: A file with two columns [ID, Duration_Minutes] containing random but valid baseline predictions for all test IDs. Use this as a template for your submissions.

Data schema (key columns)
- ID: Unique congestion event identifier.
- Start_Lat, Start_Lng: Event location.
- StartTime: Start timestamp of the congestion event.
- Distance(mi): Affected segment length.
- DelayFromTypicalTraffic(mins), DelayFromFreeFlowSpeed(mins), Congestion_Speed: Traffic/flow indicators.
- Description, Street, City, County, State, Country, ZipCode: Textual and administrative attributes.
- LocalTimeZone: Local time zone of the event (useful for deriving local hour-of-day, day-of-week, season).
- WeatherStation_AirportCode, WeatherTimeStamp, Temperature(F), WindChill(F), Humidity(%), Pressure(in), Visibility(mi), WindDir, WindSpeed(mph), Precipitation(in), Weather_Event, Weather_Conditions: Weather context near StartTime.
- Duration_Minutes (train only): Target variable.


Train/test split
- The split is temporal to emulate realistic deployment: train includes events strictly before 2022; test includes events starting in 2022.
- This split encourages models that capture temporal dynamics and generalize across time while minimizing leakage.

Evaluation
- Metric: Root Mean Squared Logarithmic Error (RMSLE) on Duration_Minutes.
  Rationale: event durations are positive, long-tailed, and multiplicative errors are typically more relevant than additive ones. RMSLE rewards getting short and long events right proportionally, is robust to outliers, and stabilizes variance.
- Submission format: a CSV with columns [ID, Duration_Minutes] for all rows in test.csv. IDs must match those in test.csv exactly, with no duplicates.
- Score computation: The leaderboard score is RMSLE between your predictions and the hidden ground truth durations for test events.

Rules and important notes
- Use only the columns provided in train.csv and test.csv. EndTime is intentionally removed to prevent leakage.
- Do not derive features from the target in train or from any future information not present at or before StartTime.
- Robust preprocessing is essential: handle missing values, heterogeneous types, and any malformed entries.
- You are free to use any ML approach.


File size and performance considerations
- The dataset is large. Efficient, streaming-friendly preprocessing and modeling pipelines are recommended.

Deliverables
- A submission CSV in the exact format [ID, Duration_Minutes].
- Reproducible code is encouraged: ensure deterministic preprocessing and modeling where applicable.
