[
  {
        "file_path": "aaronisomaisom3_canine-wellness-dataset-synthetic-10k-samples_class",
        "question_v1": "Please complete the task as described below without asking any follow-up questions or requesting additional information, and proceed under the assumption that all required information is provided. You have access to a training SQLite file named `train_v1.sqlite` that contains multiple tables sharing the common key column `row_id`, and you must join these tables on `row_id` to reconstruct the full training dataset. You are also provided with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Your goal is to perform binary classification using this data and to predict the `Healthy` column for the validation set located in `val_v1.sqlite`. Unless stated otherwise, use default parameters for all steps, including model training and preprocessing.  \n\nBegin by reconstructing the full training dataset by joining all tables on `row_id`. Next, filter the data by enforcing the expected values for each feature while retaining any rows that have missing values in the relevant columns and excluding only those rows where a non-missing value violates the expected range. Specifically, ensure that the size classification based on breed is one of Medium, Small, or Large; whether the dog is currently on medication is Yes or No; the average number of hours of play per day is between 0.0 and 4.0; the weight of the dog in pounds is between 10.0 and 109.0; the age of the dog in years is between 1.0 and 13.0; the average daily walking distance is between 0.0 and 8.0; the owner\u2019s lifestyle or activity level is Active, None, Low, Moderate, or Very Active; the dog food brand or \u201cSpecial\u201d for home-cooked meals is one of Wellness, Special, Purina, Iams, Blue Buffalo, Royal Canin, Nutro, Pedigree, or Hill\u2019s Science; the sterilization status is Spayed, Neutered, or None; the biological sex of the dog is Male or Female; the dog breed is one of Australian Shepherd, Dachshund, Chihuahua, Siberian Husky, Boxer, Labrador Retriever, Bulldog, Rottweiler, German Shepherd, Golden Retriever, Poodle, Doberman, Great Dane, Beagle, or Yorkshire Terrier; the history of seizures is No or Yes; the average number of hours the dog sleeps per day is between 8.0 and 14.0; the number of veterinary visits per year is between 0.0 and 4.0; the average local temperature is between 30.0 and 100.0; the dog\u2019s average daily activity level is None, Very Active, Active, Moderate, or Low; whether other pets live in the same home is No or Yes; and the type of diet is Home cooked, Wet food, Special diet, or Hard food.  \n\nOnce the data is filtered, select only the following features in the specified order for training: type of diet; average number of hours of play per day; average daily walking distance; size classification based on breed; weight of the dog in pounds; sterilization status; dog breed; average number of hours the dog sleeps per day; dog\u2019s average daily activity level; owner\u2019s lifestyle or activity level; and biological sex of the dog.  \n\nTreat the features average number of hours of play per day, average daily walking distance, weight of the dog in pounds, and average number of hours the dog sleeps per day as numeric features in the listed order. Treat the features type of diet, size classification based on breed, sterilization status, dog breed, dog\u2019s average daily activity level, owner\u2019s lifestyle or activity level, and biological sex of the dog as categorical features in the listed order.  \n\nHandle missing values by imputing the median for numeric features and imputing the most frequent category for categorical features. Preprocess the data by applying a standard scaler to the numeric features and one-hot encoding to the categorical features with handle_unknown set to ignore and sparse_output set to False.  \n\nTrain a single LinearSVC model using scikit-learn with random_state set to 62. Finally, make predictions on the validation set and save the results to a CSV file at `prediction.csv`, ensuring that the file contains the column `row_id` as provided in the original `val_v1.sqlite` and the corresponding predictions aligned with each `row_id` so that performance can be computed correctly.",
        "question_v2": "Please complete the task as described below without asking any follow-up questions or requesting additional information. Your task is to achieve good performance while balancing training time and accuracy in the sandbox. You are provided with a processed dataset, along with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Load data in `train_v2.sqlite` as your full training set. Once that's done, using only CPU resources, train a classification model on this training data. Then use `val_v2.sqlite` to generate a prediction for `Healthy` for each `row_id`. The model should output a file named `prediction.csv`. The file must contain the column row_id (as provided in the original `val_v2.sqlite` and the corresponding predictions. Each prediction should be aligned with its row_id so that performance can be computed correctly.",
        "available_tools": [
            "python_executor"
        ],
        "task": "classification",
        "usability_rating": 1.0,
        "lastUpdated": "2025-04-14T14:01:14.000Z",
        "needed_files_v1": [
            "train_v1.sqlite",
            "val_v1.sqlite",
            "metadata.txt"
        ],
        "needed_files_v2": [
            "train_v2.sqlite",
            "val_v2.sqlite",
            "metadata.txt"
        ],
        "sandbox_time": 2.4926462173461914
    },
    {
        "file_path": "abdelrahmannasrsleem_us-monthly-retail-and-food-services-sales_ts",
        "question_v1": "Please complete the task as described below without asking any follow-up questions or requesting additional information. Your task is to achieve good performance while balancing time and accuracy in the sandbox environment. Proceed under the assumption that all required information is provided. You are provided with a processed dataset, along with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Load data in `train.csv` as your full training set. The validation file `val_v1.csv` contains future time points where all non-target features are provided, but the target column `value` is unknown. Your task is to predict the target column `value` at those validation time points using the provided features. Train your model(s) on the training data using only CPU resources. Then generate predictions for `value` in `val_v1.csv` for each `row_id`. The model should output a file named `prediction.csv`. The file must contain the column row_id (as provided in the original `val_v1.csv` and the corresponding predictions. Each prediction should be aligned with its row_id so that performance can be computed correctly.",
        "question_v2": "Please complete the task as described below without asking any follow-up questions or requesting additional information. Your task is to achieve good performance while balancing time and accuracy in the sandbox environment. You are provided with a processed dataset, along with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Load the file `train.csv` as your complete training data. This file contains the raw, non-resampled time series data. Once that's done, using only CPU resources, train time series analysis model(s) on the training data. Then, based on the file `val_v2.csv`, forecast the target column `value`. Generate predictions exactly for each row_id present in the validation data, Output a file named `prediction.csv`. The file must contain the column row_id (as provided in the original `val_v2.csv` and the corresponding predictions. Each prediction should be aligned with its row_id so that performance can be computed correctly.",
        "contain_multiple_files": true,
        "available_tools": [
            "python_executor"
        ],
        "task": "time_series_analysis",
        "usability_rating": 1.0,
        "lastUpdated": "2025-03-16T08:46:17.013Z",
        "needed_files_v1": [
            "train.csv",
            "val_v1.csv",
            "metadata.txt"
        ],
        "needed_files_v2": [
            "train.csv",
            "val_v2.csv",
            "metadata.txt"
        ]
    },
    {
        "file_path": "abhiramasdf_indian-used-cars-dataset_reg",
        "question_v1": "Please complete the task as described below without asking any follow-up questions or requesting additional information. Proceed under the assumption that all required information is provided. You are given access to a training CSV file named `train_v1.csv`, which contains a single table, and you are also provided with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Your task is to perform regression using this data and predict the target column `price` for the validation set located at `val_v1.csv`. Unless stated otherwise, you should use default parameters for all steps including model training and preprocessing.\n\nFirst, load the training file directly, being careful to treat only empty strings (\"\") as missing values. Next, filter the training dataset using the expected ranges defined below, but retain any rows that have missing values in the relevant columns. In other words, exclude only rows that violate an expected range with a known (non-missing) value; if a value is missing (null), keep the row. The filter must ensure that the body type is one of hatchback, suv, sedan, muv, wagon, or crossover; the number of kilometers driven is between 167 and 149792 inclusive; the exterior color is one of silver, red, grey, white, blue, black, brown, golden, beige, orange, green, bronze, maroon, violet, yellow, or purple; the manufacturing year is between 2011 and 2025 inclusive; the fuel type is one of petrol, cng, diesel, hybrid, petrol+cng, or Petrol; the transmission type is manual or automatic; the number of previous owners is between 1 and 3 inclusive; the registration year is between 2011 and 2025 inclusive; and the city is one of delhi-ncr, hyderabad, noida, gurgaon, delhi, ahmedabad, chennai, kolkata, bangalore, pune, mumbai, jaipur, lucknow, chandigarh, faridabad, or ghaziabad.\n\nSelect only the features listed in the following order for training the model: fuel type, transmission type, exterior color, manufacturing year, registration year, and number of kilometers driven. The following features should be treated as numeric features, also maintaining their listed order: manufacturing year, registration year, number of kilometers driven. The following features should be treated as categorical features, also maintaining their listed order: fuel type, transmission type, exterior color.\n\nHandle missing values by imputing numeric features using the most frequent value and categorical features using \"most_frequent\". Preprocess the data by applying a standard scaler to the numeric features and one-hot encoding to the categorical features with handle_unknown set to ignore and sparse_output set to False. Train a single DecisionTreeRegressor model using the scikit-learn package with random_state set to 22. Finally, make predictions on the validation set and save the results to a csv file at `prediction.csv`. The file must contain the column row_id (as provided in the original `val_v1.csv`) and the corresponding predictions, with each prediction aligned with its row_id so that performance can be computed correctly.",
        "question_v2": "Please complete the task as described below without asking any follow-up questions or requesting additional information. Your task is to achieve good performance while balancing training time and accuracy in the sandbox. You are provided with a processed dataset, along with a metadata file `metadata.txt` that describes the original dataset and each of its columns. Load data in `train_v2.csv` as your full training set. Once that's done, using only CPU resources, train a regression model on this training data. Then use `val_v2.csv` to generate a prediction for `price` for each `row_id`. The model should output a file named `prediction.csv`. The file must contain the column row_id (as provided in the original `val_v2.csv` and the corresponding predictions. Each prediction should be aligned with its row_id so that performance can be computed correctly.",
        "available_tools": [
            "python_executor"
        ],
        "task": "regression",
        "usability_rating": 1.0,
        "lastUpdated": "2025-08-18T14:48:45.397Z",
        "needed_files_v1": [
            "train_v1.csv",
            "val_v1.csv",
            "metadata.txt"
        ],
        "needed_files_v2": [
            "train_v2.csv",
            "val_v2.csv",
            "metadata.txt"
        ],
        "sandbox_time": 0.853938102722168
    }
]