{"task":{"0":"Load JSON data from files under a given directory, parse the JSON content into pairs of (file_name, list_of_json_object), then flatten the json objects in one place (one list_of_json_object). Add an universal index column to each JSON object, batch them then transpose the rows into columns, finally store the shuffled data to memory.","1":"Load and process data from a tar archive containing multiple files. Some different extension type files are compressed using bz2 and archived in one tar file in an S3 bucket. Load these files, decompress and process them into strings, then group them by file type. Batch the groups, shuffle, then unbatch them. Finally, the processed groups should be zipped with a list of filetype labels and held in memory.","2":"Load and text data from various compressed files (bz2, tar, zip) under a given directories into one datapipe of (filename, str line) pairs. Filter out lines containing a specific keyword, then concatenate the processed data with additional data from another source. Finally, create mini-batches of data by a maximum token count.","3":"Load compressed csv.bz2 data from files under a given root directory,  examine whether the hash of files are correct befor decompressing them. Then parse the CSV files into rows of dictionary, filter out rows based on a given function, batch and map the data into (index, value_sum) pairs (make use of the batch_mapper_fn function). Then split the pairs data into training and validation sets, and save them to disk, the file name should be \"train_{index}.json\" or \"valid_{index}.json\". Also return the training and validation data in memory.","4":"Load JSON data from compressed files (zip or rar) under a given directory, parse the JSON data, convert it to a DataFrame, and create mini-batches of the DataFrame. The data should be cycled through a specified number of times, prefetched, and finally held in memory."},"prompt":{"0":"from torchdata.datapipes.iter import *\nfrom typing import List, Dict, Any\n\ndef group_key_fn(json_obj: Dict[str, Any]) -> str:\n    # Define the key to group JSON objects\n    return json_obj.get(\"group_key\", \"default_group\")\n\ndef build_json_data_pipe(\n        file_dir: str=\".\/torchdata-programming-tasks\/task_2\",  # Directory containing JSON files\n        batch_size: int=16,  # Batch size\n        index_column_name: str=\"index\"  # Directory to save the batched data\n    ):\n    \"\"\"\n    Load JSON data from files under a given directory, parse the JSON content into pairs of (file_name, list_of_json_object), then flatten the json objects in one place (one list_of_json_object). Add an universal index column to each JSON object, batch them then transpose the rows into columns, finally store the shuffled data to memory.\n    \"\"\"","1":"from torchdata.datapipes.iter import *\nfrom typing import List, Tuple\nfrom io import BytesIO\ndef process_data_fn(batch: List[Tuple[str, BytesIO]]) -> List[str]:\n    # Process the batch of data\n    return [(filename, bytestream.read().decode('utf-8')) for filename, bytestream in batch]\n\ndef group_key_fn(file: Tuple[str, str]) -> str:\n    filename = file[0]\n    # Group by file extension\n    return filename.split('.')[-1]\n\ndef build_data_pipeline(\n        s3_prefix: str=\"s3:\/\/torchdata-programming-tasks\/task_3\/sample_data.tar\",  # S3 prefix to list files\n        label_list: List[str]=[\"filetype1\", \"filetype2\", \"filetype3\"],  # Sequence of labels\n        batch_size:int=2\n    ):\n    \"\"\"\n    Load and process data from a tar archive containing multiple files. Some different extension type files are compressed using bz2 and archived in one tar file in an S3 bucket. Load these files, decompress and process them into strings, then group them by file type. Batch the groups, shuffle, then unbatch them. Finally, the processed groups should be zipped with a list of filetype labels and held in memory.\n    \"\"\"","2":"from torchdata.datapipes.iter import *\nfrom typing import List, Tuple\n\n\ndef filter_keyword_fn(line: str) -> bool:\n    # note the input is a line of string, not a byte or stream\n    return \"KEYWORD\" in line\n\n\ndef build_text_data_pipe(\n        tar_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/tar\",  # Directory containing tar files\n        zip_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/zip\",  # Directory containing zip files\n        bz2_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/bz2\",  # Directory containing bz2 files\n        additional_data: List[str] = [(\"extra_file.txt\", \"extra_data_1\"), (\"extra_file.txt\", \"extra_data_2\")], # Additional data to concatenate\n        max_token_count: int = 5  # Maximum token count for mini-batches\n):\n    \"\"\"\n    Load and text data from various compressed files (bz2, tar, zip) under a given directories into one datapipe of (filename, str line) pairs. Filter out lines containing a specific keyword, then concatenate the processed data with additional data from another source. Finally, create mini-batches of data by a maximum token count.\n    \"\"\"","3":"from torchdata.datapipes.iter import *\nfrom typing import Tuple, List, Dict\nimport os\n\ndef filter_fn(row: Dict[str, str]) -> bool:\n    # Define the filter function to filter out rows based on a specific condition\n    return int(row['value']) > 10\n\n\nimport itertools\nimport json\ncounter = itertools.count()\ndef batch_mapper_fn(batch: List[Dict[str, str]]) -> List[Tuple[str, Dict[str, str]]]:\n    # Turn a batch of dict rows into one (index, {'sum': sum_of_values}) tuple\n    index = next(counter)\n    content = json.dumps({\"sum\": sum(int(item['value']) for item in batch)})\n    return [(str(index) , content)]\n\nimport hashlib\ndef _get_hash_dict(root):\n    # just to generate the gt hash as input, do not use this function in the solution\n    hash_dict = {}\n    for file in os.listdir(root):\n        filepath = os.path.join(root, file)\n        abs_filepath = os.path.abspath(filepath)\n        with open(filepath, 'rb') as f:\n            hash_val = hashlib.sha256(f.read()).hexdigest()\n            hash_dict[filepath] = hash_val\n            hash_dict[abs_filepath] = hash_val\n    return hash_dict\ndef build_csv_data_pipe(\n        root: str=\".\/torchdata-programming-tasks\/task_6\",  # Directory containing CSV files\n        save_dir: str=\".\/torchdata-programming-tasks\/outputs\/task_6\",  # Directory to save processed data\n        gt_hash_dict: Dict[str, str] = _get_hash_dict(\".\/torchdata-programming-tasks\/task_6\"),  # Dictionary of file hashes\n        batch_size: int = 32,  # Batch size\n        train_split: float = 0.8,  # Proportion of data to use for training\n        seed: int = 42  # Random seed for splitting\n    ):\n    \"\"\"\n    Load compressed csv.bz2 data from files under a given root directory,  examine whether the hash of files are correct befor decompressing them. Then parse the CSV files into rows of dictionary, filter out rows based on a given function, batch and map the data into (index, value_sum) pairs (make use of the batch_mapper_fn function). Then split the pairs data into training and validation sets, and save them to disk, the file name should be \"train_{index}.json\" or \"valid_{index}.json\". Also return the training and validation data in memory.\n    \"\"\"","4":"from torchdata.datapipes.iter import *\nfrom typing import List, Tuple\nimport torch\n\ndef build_json_data_pipe(\n        root: str=\".\/torchdata-programming-tasks\/task_8\",  # Directory containing compressed files\n        batch_size: int=16,  # Batch size\n        cycle_count: int=2,  # Number of times to cycle through the data\n    ):\n    \"\"\"\n    Load JSON data from compressed files (zip or rar) under a given directory, parse the JSON data, convert it to a DataFrame, and create mini-batches of the DataFrame. The data should be cycled through a specified number of times, prefetched, and finally held in memory.\n    \"\"\""},"canonical_solution":{"0":"from torchdata.datapipes.iter import IterableWrapper, FSSpecFileLister, ShardExpander, FileOpener, JsonParser, FlatMapper, IndexAdder, Batcher, Rows2Columnar, Multiplexer, SampleMultiplexer, Shuffler, InMemoryCacheHolder\nfrom typing import List, Dict, Any\n\ndef group_key_fn(json_obj: Dict[str, Any]) -> str:\n    # Define the key to group JSON objects\n    return json_obj.get(\"group_key\", \"default_group\")\n\ndef build_json_data_pipe(\n        file_dir: str=\".\/torchdata-programming-tasks\/task_2\",  # Directory containing JSON files\n        batch_size: int=16,  # Batch size\n        index_column_name: str=\"index\"  # Directory to save the batched data\n    ):\n    \"\"\"\n    Load JSON data from files under a given directory, parse the JSON content into pairs of (file_name, list_of_json_object), then flatten the json objects in one place (one list_of_json_object). Add an universal index column to each JSON object, batch them then transpose the rows into columns, finally store the shuffled data to memory.\n    \"\"\"\n    dp = FSSpecFileLister(root=IterableWrapper([file_dir]), masks=\"*.json\")  # List JSON files\n    dp = ShardExpander(dp)  # Expand shards Expands incoming shard strings into shards.\n    dp = FileOpener(dp, mode='rb')  # Open files\n    dp = JsonParser(dp)  # Parse JSON files (file_name, list_of_row_dicts)\n    dp = FlatMapper(dp, lambda x: x[1]) # Flatten the list of row dicts\n    dp = IndexAdder(dp, index_name=index_column_name)  # Add index to JSON objects\n    dp = Batcher(dp, batch_size=batch_size, drop_last=False)\n    dp = Rows2Columnar(dp)  # Convert rows to columnar format\n    dp = Multiplexer(dp)  # Multiplex data\n    dp = SampleMultiplexer({dp: 1.0})  # Sample data\n    dp = Shuffler(dp)  # Shuffle data\n    dp = InMemoryCacheHolder(dp) # store the data to memory\n    return dp\n","1":"from torchdata.datapipes.iter import BatchMapper, Batcher, Bz2FileLoader, Grouper, IterableWrapper, S3FileLister, S3FileLoader, TarArchiveLoader, UnBatcher, Zipper, Shuffler, InMemoryCacheHolder\nfrom typing import List, Tuple\nfrom io import BytesIO\ndef process_data_fn(batch: List[Tuple[str, BytesIO]]) -> List[str]:\n    # Process the batch of data\n    return [(filename, bytestream.read().decode('utf-8')) for filename, bytestream in batch]\n\ndef group_key_fn(file: Tuple[str, str]) -> str:\n    filename = file[0]\n    # Group by file extension\n    return filename.split('.')[-1]\n\ndef build_data_pipeline(\n        s3_prefix: str=\"s3:\/\/torchdata-programming-tasks\/task_3\/sample_data.tar\",  # S3 prefix to list files\n        label_list: List[str]=[\"filetype1\", \"filetype2\", \"filetype3\"],  # Sequence of labels\n        batch_size:int=2\n    ):\n    \"\"\"\n    Load and process data from a tar archive containing multiple files. Some different extension type files are compressed using bz2 and archived in one tar file in an S3 bucket. Load these files, decompress and process them into strings, then group them by file type. Batch the groups, shuffle, then unbatch them. Finally, the processed groups should be zipped with a list of filetype labels and held in memory.\n    \"\"\"\n    dp = IterableWrapper([s3_prefix])\n    dp = S3FileLister(dp)  # List files from S3\n    dp = S3FileLoader(dp)  # Load files from S3\n    dp = TarArchiveLoader(dp)  # Load tar archives\n    dp = Bz2FileLoader(dp)  # Decompress bz2 files\n    dp = BatchMapper(dp, fn=process_data_fn, batch_size=10)  # Process data in batches\n    dp = Grouper(dp, group_key_fn=group_key_fn)  # Group by file type\n    dp = Batcher(dp, batch_size=batch_size)  # Batch the grouped data\n    dp = Shuffler(dp)  # Shuffle the data\n    dp = UnBatcher(dp)  # Unbatch the data\n    label_dp = IterableWrapper(label_list)  # Create a sequence wrapper\n    dp = Zipper(dp, label_dp)  # Zip processed data with the sequence\n    dp = InMemoryCacheHolder(dp)  # Save data in memory\n    return dp","2":"from torchdata.datapipes.iter import Bz2FileLoader, LineReader, FSSpecFileOpener, FileLister, Concater, Filter, Flattener, MaxTokenBucketizer, Multiplexer, TarArchiveLoader, ZipArchiveLoader, IterableWrapper\nfrom typing import List, Tuple\n\n\ndef filter_keyword_fn(line: str) -> bool:\n    # note the input is a line of string, not a byte or stream\n    return \"KEYWORD\" in line\n\n\ndef build_text_data_pipe(\n        tar_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/tar\",  # Directory containing tar files\n        zip_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/zip\",  # Directory containing zip files\n        bz2_file_dir: str = \".\/torchdata-programming-tasks\/task_5\/bz2\",  # Directory containing bz2 files\n        additional_data: List[str] = [(\"extra_file.txt\", \"extra_data_1\"), (\"extra_file.txt\", \"extra_data_2\")],\n        # Additional data to concatenate\n        max_token_count: int = 5  # Maximum token count for mini-batches\n):\n    \"\"\"\n    Load and text data from various compressed files (bz2, tar, zip) under a given directories into one datapipe of (filename, str line) pairs. Filter out lines containing a specific keyword, then concatenate the processed data with additional data from another source. Finally, create mini-batches of data by a maximum token count.\n    \"\"\"\n    # List files in the directory\n    tar_dp = FileLister(root=tar_file_dir, recursive=True)\n    zip_dp = FileLister(root=zip_file_dir, recursive=True)\n    bz2_dp = FileLister(root=bz2_file_dir, recursive=True)\n\n    # Open tar files\n    tar_dp = Filter(tar_dp, filter_fn=lambda x: x.endswith('.tar'))\n    tar_dp = FSSpecFileOpener(tar_dp, mode='rb')\n    tar_dp = TarArchiveLoader(tar_dp)\n    # Open bz2 files\n    bz2_dp = Filter(bz2_dp, filter_fn=lambda x: x.endswith('.bz2'))\n    bz2_dp = FSSpecFileOpener(bz2_dp, mode='rb')\n    bz2_dp = Bz2FileLoader(bz2_dp)\n    # Open zip files\n    zip_dp = Filter(zip_dp, filter_fn=lambda x: x.endswith('.zip'))\n    zip_dp = FSSpecFileOpener(zip_dp, mode='rb')\n    zip_dp = ZipArchiveLoader(zip_dp)\n    # Combine all file streams\n    dp = Multiplexer(bz2_dp, tar_dp, zip_dp)\n    # Read lines from files\n    dp = LineReader(dp, decode=True)\n    # Filter lines containing the keyword\n    dp = Filter(dp, filter_fn=filter_keyword_fn, input_col=1)\n    # Flatten the data structure\n    dp = Flattener(dp)\n    # Concatenate with additional data\n    additional_dp = IterableWrapper(additional_data)\n    dp = Concater(dp, additional_dp)\n    # Create mini-batches with a maximum token count\n    dp = MaxTokenBucketizer(dp, max_token_count=max_token_count)\n\n    return dp","3":"from torchdata.datapipes.iter import BatchMapper, Bz2FileLoader, CSVDictParser, FileLister, Filter, HashChecker, InMemoryCacheHolder, IoPathSaver, RandomSplitter, FileOpener\nfrom typing import Tuple, List, Dict\nimport os\n\ndef filter_fn(row: Dict[str, str]) -> bool:\n    # Define the filter function to filter out rows based on a specific condition\n    return int(row['value']) > 10\n\n\nimport itertools\nimport json\ncounter = itertools.count()\ndef batch_mapper_fn(batch: List[Dict[str, str]]) -> List[Tuple[str, Dict[str, str]]]:\n    # Turn a batch of dict rows into one (index, {'sum': sum_of_values}) tuple\n    index = next(counter)\n    content = json.dumps({\"sum\": sum(int(item['value']) for item in batch)})\n    return [(str(index) , content)]\n\nimport hashlib\ndef _get_hash_dict(root):\n    # just to generate the gt hash as input, do not use this function in the solution\n    hash_dict = {}\n    for file in os.listdir(root):\n        filepath = os.path.join(root, file)\n        abs_filepath = os.path.abspath(filepath)\n        with open(filepath, 'rb') as f:\n            hash_val = hashlib.sha256(f.read()).hexdigest()\n            hash_dict[filepath] = hash_val\n            hash_dict[abs_filepath] = hash_val\n    return hash_dict\ndef build_csv_data_pipe(\n        root: str=\".\/torchdata-programming-tasks\/task_6\",  # Directory containing CSV files\n        save_dir: str=\".\/torchdata-programming-tasks\/outputs\/task_6\",  # Directory to save processed data\n        gt_hash_dict: Dict[str, str] = _get_hash_dict(\".\/torchdata-programming-tasks\/task_6\"),  # Dictionary of file hashes\n        batch_size: int = 32,  # Batch size\n        train_split: float = 0.8,  # Proportion of data to use for training\n        seed: int = 42  # Random seed for splitting\n    ):\n    \"\"\"\n    Load compressed csv.bz2 data from files under a given root directory,  examine whether the hash of files are correct befor decompressing them. Then parse the CSV files into rows of dictionary, filter out rows based on a given function, batch and map the data into (index, value_sum) pairs (make use of the batch_mapper_fn function). Then split the pairs data into training and validation sets, and save them to disk, the file name should be \"train_{index}.json\" or \"valid_{index}.json\". Also return the training and validation data in memory.\n    \"\"\"\n    dp = FileLister(root=root, masks=\"*.csv.bz2\", recursive=True)  # List CSV files\n    dp = FileOpener(dp, mode='rb')  # Open files\n    dp = HashChecker(dp, hash_dict=gt_hash_dict, hash_type='sha256', rewind=True)  # Check file hashes\n    dp = Bz2FileLoader(dp)  # Decompress bz2 files\n    dp = CSVDictParser(dp)  # Parse CSV data into dictionaries\n    dp = Filter(dp, filter_fn=filter_fn)  # Filter rows based on a specific condition\n    dp = BatchMapper(dp, fn=batch_mapper_fn, batch_size=batch_size)  # Apply a function to each batch\n    dp = RandomSplitter(dp, weights={\"train\": train_split, \"valid\": 1 - train_split}, seed=seed, total_length=len(list(dp)))  # Split data into training and validation sets\n    train_dp, valid_dp = dp[0], dp[1]\n    train_save_dp = IoPathSaver(train_dp, mode='w', filepath_fn=lambda x: f\"{save_dir}\/train_{x}.json\")  # Save training data\n    valid_save_dp = IoPathSaver(valid_dp, mode='w', filepath_fn=lambda x: f\"{save_dir}\/valid_{x}.json\")  # Save validation data\n    list(train_save_dp)\n    list(valid_save_dp)\n    train_dp = InMemoryCacheHolder(train_dp)  # Save training data in memory\n    valid_dp = InMemoryCacheHolder(valid_dp)  # Save validation data in memory\n    return train_dp, valid_dp","4":"from torchdata.datapipes.iter import FileLister, FileOpener, JsonParser, Batcher, Cycler, Prefetcher, InMemoryCacheHolder, Demultiplexer,ZipArchiveLoader, TarArchiveLoader, Concater\nfrom typing import List, Tuple\nimport torch\n\ndef build_json_data_pipe(\n        root: str=\".\/torchdata-programming-tasks\/task_8\",  # Directory containing compressed files\n        batch_size: int=16,  # Batch size\n        cycle_count: int=2,  # Number of times to cycle through the data\n    ):\n    \"\"\"\n    Load JSON data from compressed files (zip or rar) under a given directory, parse the JSON data, convert it to a DataFrame, and create mini-batches of the DataFrame. The data should be cycled through a specified number of times, prefetched, and finally held in memory.\n    \"\"\"\n    dp = FileLister(root=root, masks=[\"*.zip\", \"*.tar\"], recursive=True)  # List compressed files\n    dp = FileOpener(dp, mode='b')  # Open files\n    # Demultiplex based on file extension\n    def classifier_fn(data: Tuple[str, bytes]) -> int:\n        if data[0].endswith('.zip'):\n            return 0\n        elif data[0].endswith('.tar'):\n            return 1\n        return None\n    zip_dp, tar_dp = Demultiplexer(dp, num_instances=2, classifier_fn=classifier_fn, drop_none=True)\n    zip_dp = ZipArchiveLoader(zip_dp)\n    tar_dp = TarArchiveLoader(tar_dp)\n    dp = Concater(zip_dp,tar_dp)\n    dp = JsonParser(dp)  # Parse JSON data\n    dp = Batcher(dp, batch_size=batch_size, drop_last=False)  # Create mini-batches\n    dp = Cycler(dp, count=cycle_count)  # Cycle through the data\n    dp = Prefetcher(dp, buffer_size=10)  # Prefetch data\n    dp = InMemoryCacheHolder(dp)  # Hold data in memory\n    return dp"},"test":{"0":"list_dp = list(build_json_data_pipe(file_dir=\".\/torchdata-programming-tasks\/task_2\", batch_size=16, index_column_name=\"index\"))\nassert len(list_dp) == 32, \"length of the data pipe is not correct\"\nassert isinstance(list_dp[0], dict), \"type of data pipe elements is not dict type\"\nassert 'index' in list_dp[0], \"index is not added to the data pipe elements\"\nassert len(list_dp[0]['index']) == 16 or len(list_dp[0]['index']) == 4, \"batch size of data pipe is not correct\"","1":"list_dp = list(build_data_pipeline(s3_prefix=\"s3:\/\/torchdata-programming-tasks\/task_3\/sample_data.tar\", label_list=[\"filetype1\", \"filetype2\", \"filetype3\"], batch_size=2))\nassert len(list_dp) == 3, \"length of the data pipe is not correct, the data is not grouped by file type correctly\"\nassert isinstance(list_dp[0], tuple) and len(list_dp[0])==2, \"type of data pipe elements format is not correct\"\nassert list_dp[0][0].startswith('filetype') if isinstance(list_dp[0][0], str) else list_dp[0][1].startswith('filetype'), \"label is not in the data pipe\"","2":"list_dp = list(build_text_data_pipe())\nassert len(list_dp) == 3, \"length of the data pipe is not correct\"\nassert isinstance(list_dp[0], list), \"data is not batched\"\nif isinstance(list_dp[0][0], tuple):\n    assert all([(\"KEYWORD\" in line or \"extra_data\" in line) for batch in list_dp for filename, line in\n                batch]), \"data is not filtered correctly\"\nelif isinstance(list_dp[0][0], str):\n    assert all([(\"KEYWORD\" in line or \"extra_data\" in line) for batch in list_dp for line in\n                batch]), \"data is not filtered correctly\"\nelse:\n    assert False, \"data pipe elements format is not correct\"","3":"import hashlib\ndef _get_hash_dict(root):\n    # just to generate the gt hash as input, do not use this function in the solution\n    hash_dict = {}\n    for file in os.listdir(root):\n        filepath = os.path.join(root, file)\n        abs_filepath = os.path.abspath(filepath)\n        with open(filepath, 'rb') as f:\n            hash_val = hashlib.sha256(f.read()).hexdigest()\n            hash_dict[filepath] = hash_val\n            hash_dict[abs_filepath] = hash_val\n    return hash_dict\nroot=\".\/torchdata-programming-tasks\/task_6\"\nsave_dir=\".\/torchdata-programming-tasks\/outputs\/task_6\"\ngt_hash_dict = _get_hash_dict(\".\/torchdata-programming-tasks\/task_6\")\ntrain_dp, valid_dp = build_csv_data_pipe(root=root, save_dir=save_dir, gt_hash_dict=gt_hash_dict, batch_size=32, train_split=0.8, seed=42)\nassert len(list(train_dp)) == 6, \"length of the training data pipe is not correct\"\nassert len(list(valid_dp)) == 2, \"length of the validation data pipe is not correct\"\nassert isinstance(list(train_dp)[0], tuple), \"type of training data pipe elements is not tuple type\"\nassert isinstance(list(valid_dp)[0], tuple), \"type of validation data pipe elements is not tuple type\"\nassert os.listdir(\".\/torchdata-programming-tasks\/outputs\/task_6\") != [], \"the processed data is not saved to disk\"","4":"list_dp = list(build_json_data_pipe(root=\".\/torchdata-programming-tasks\/task_8\", batch_size=4, cycle_count=2))\nassert len(list_dp) == 6, \"length of the data pipe is not correct\"\nassert len(list_dp[0]) == 4, \"batch size of data pipe is not correct\"\nassert isinstance(list_dp[0], torch.torch.utils.data.datapipes.datapipe.DataChunk), \"batch of data pipe is not generated correctly\""}}