Introducing "Forecast Utterance" for Conversational Data Science

Published: 20 Mar 2024, Last Modified: 20 Mar 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Envision an intelligent agent capable of assisting users in conducting forecasting tasks through intuitive, natural conversations, without requiring in-depth knowledge of the underlying machine learning (ML) processes. A significant challenge for the agent in this endeavor is to accurately comprehend the user's prediction goals and, consequently, formulate precise ML tasks. In this paper, we take a pioneering step towards this ambitious goal by introducing a new concept called Forecast Utterance and then focus on the automatic and accurate interpretation of users' prediction goals from these utterances. Specifically, we frame the task as a slot-filling problem, where each slot corresponds to a specific aspect of the goal prediction task. We then employ two zero-shot methods for solving the slot-filling task, namely: 1) Entity Extraction (EE), and 2) Question-Answering (QA) techniques. Our experiments, evaluated with three meticulously crafted data sets, validate the viability of our ambitious goal and demonstrate the effectiveness of both EE and QA techniques in interpreting Forecast Utterances.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: # Revisions We thank the action editors and the reviewers for their valuable feedback. We also thank them for considering our rebuttal and giving us the valuable opportunity to revise paper. ## Completed Action Items for Minor Revision #### Action Item #1: Authors need to clarify the nature of the techniques. Are they really zero-shot learning? We agree about the confusion about the nature of the technique. Calling our proposed technique zero-shot learning was misleading to the readers. We have renamed our method to "Self-Supervised Learning with Synthetic Examples" throughout the paper. #### Action Item #2: Is hand-crafted dataset needed for each specific domain? If so, how difficult is it to construct them. How does presenting them differently improve clarity of the paper? Our methodology generates synthetic data for training the transformer-based language models. The hand-crafted dataset was created to evaluate our proposed method. We have added a discussion in section 5.1 of the paper. This hand-crafted dataset proves our method's ability to flexibly and effectively handle natural utterances and work with any dataset. For your convenience We have included the revised part of the section here: Our methodology stands out by dynamically processing any dataset provided by the user, instantly generating synthetic examples tailored to that dataset. Subsequently, it trains a transformer-based language model (LM), specifically designed to extract information relevant to the dataset directly from user utterances. This streamlined process ensures efficient and accurate retrieval of dataset-specific insights, catering to the immediate needs of users without the prerequisite of pre-established datasets. In the course of our research, we put our system to the test using three publicly available Kaggle datasets: Flight Delay (FD), Online Food Delivery Preferences (OD) and Student Performance (SP). These datasets were chosen to cover a range of topics and data types, showcasing the versatility and robustness of our approach. For evaluating how well our method works, we created test sets for the three datasets using human volunteers with data science expertise, who generated utterances expressing forecasting goals. Each instance consists of a user utterance and associated ground truth slot-value labels. To minimize bias, three volunteers independently created and labeled datasets. ## Other Comments **Comment #1. Intro states “...each user may have unique data sets and datascience needs, it is unrealistic to assume that any training data is available to pre-train these conversational agents, making the supervised learning paradigm impractical.” Are there any stats or references that show whether user data science needs are diverse or not?** We have added reference to support our claim in the introduction. **Comment #2. PeTEL seems like a schema of key-value pairs. Is there a fixed vocabulary or grammar syntax rules for this if calling it a language?** We have renamed Prediction Task Expression Language to Prediction Task Expression Dictionary (PeTED) which aligns with its key-value nature. **Comment #3. Section 4.2 mentions “Given the lack of a pre-existing training dataset tailored to each unique schema/domain, fine-tuning pre-trained models is unattainable.” But the methodology presented in the paper involves fine-tuning. So isn't this contradictory?** We've updated our method from "zero-shot approach" to "self-supervised learning with synthetic examples" to clear up confusion about fine-tuning without training data. Details on fine-tuning a pre-trained language model are now included in Section 4.2 of our revised paper. **Comment #4. Algorithm 3 in appendix does not mention how attributes are used to create the utterance in the keyword-to-user utterance task. Is this done for each attribute or a subset of attributes (ie. 1 utterance for a randomly selected subset of attributes?). This is a crucial detail that's missing.** We have included the details in Appendix A.1 which describes how attributes are used to create the utterance in the keyword-to-user utterance task. **#Comment #5. There are too many examples of ChatGPT’s output to qualitatively analyze the performance, but this isn’t of much relevance to the paper. Include more of the examples of the BERT Based models used. What do the generated/synthetic utterances look like? What does the CoNNL-like annotation dataset look like? What do the responses look like?** In Appendix A.10 of the revised manuscript, we have included examples us user-interaction with BERT based models. In Appendix A.1.1 we have included the annotation details for **CoNLL-2003 standard** and **SQuAD standards for Question Answering** with examples. **Comment #6. Almost all of A.1 is in Section 4.2. Please dedupe.** In the revised paper we have rewritten the Appendix A.1 and deduped this section with Section 4.2.
Supplementary Material: zip
Assigned Action Editor: ~Yang_Li2
Submission Number: 1548