Texylon: Dataset of Log-to-Description and Description-to-Log Generation for Text Analytics Tools

Published: 2024, Last Modified: 05 Jan 2026JSAI-isAI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose two tasks for text analytics tools, description generation and log estimation, and a dataset to solve them. Text analytics tools, popular in the digital humanities, provide researchers with various functions for texts based on natural language processing (NLP). Our first task, description generation from operational logs, helps these researchers write papers easily and accurately. Our second task, log estimation given a paper, is just the opposite and helps readers reproduce the analyses in the paper. For those tasks, we created a dataset consisting of descriptions and logs in text analytics experiments. Because our dataset is not large enough, we also propose some methods for data augmentation: swapping a value of each configuration with another value, pseudo-labeling using BERT and NER, pseudo-labeling using BERT and T5, and a combination of them. The highest BLEU score for that model in the description generation task is 36.98, and F1 score in the log generation task is around 0.7.
Loading