examples = [
"""Question: What is the corresponding F-Measure score of the EAST method on IC15 dataset for Scene_Text_Detection task?\n\nThought: The question is asking some performance information about F-Measure score of the EAST method on IC15 dataset for Scene_Text_Detection task, we need to retrieve some useful information from the scirex database.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"F-Measure score of the EAST method on IC15 dataset for Scene_Text_Detection task\"}\n\nObservation: section : B. ICDAR 2013 To evaluate our method 's adaptability , we conduct experiments on ICDAR 2013 [ reference ] . ICDAR 2013 test dataset consists of 233 focused scene text images . The texts in the images are horizontal . As we can estimate both the axis - aligned box and the inclined box , we use the axis - aligned box as the output for ICDAR 2013 . We conduct experiments on Faster R - CNN model and R 2 CNN - 5 model trained in last section for ICDAR 2015 . Table 3 shows our results and the state - of - the - art results . Our approach could reach the result of F - measure 87.73 \% . As the training data we used does not include single characters but single characters should be detected in ICDAR 2013 , we think our method could achieve even better results when single characters are used for training our model . To compare our method with the Faster R - CNN baseline , we also do a single - scale test in which the short side of the image is set to 720 pixels . In Table 3 , both Faster R - CNN and R 2 CNN - 720 adopt this testing scale . The result is that R 2 CNN - 720 is much better than the Faster R - CNN baseline ( F - measure : 83.16 % vs. 78.45 % ) . This means our design is also useful for horizontal text detection . Figure 6 shows some detection results on ICDAR 2013 . We can see R 2 CNN could detect horizontal focused scene texts well . The missed text in the figure is a single character . section : Test time . The test times in Comparisons with state - of - the - art . Table 2 shows the comparison of R 2 CNN with state - of - the - art results on ICDAR 2015 [ reference ] . Here , R 2 CNN refers to R 2 CNN - 5 with inclined NMS . We can see that our method can get competitive results of Recall 79.68 % , Precision 85.62 % and F - measure 82.54 % . As our approach can be considered as learning the inclined box based on the axis - aligned box , it can be easily adapted to other architectures , such as SSD [ reference ] and YOLO [ reference ] . [ reference ] 80.00 % 82.00 % 81.00 % EAST [ reference ] 78.33 % 83.27 % 80.72 % RRPN [ reference ] 82.17 % 73.23 % 77.44 % SegLink [ reference ] 76.80 % 73.10 % 75.00 % DMPNet [ reference ] 68.22 % 73.23 % 70.64 % CTPN [ reference ] 51.56 % 74.22 % 60.85 % MCLAB_FCN [ reference ] 43.09 % 70.81 % 53.58 % Figure 5 demonstrates some detection results of our R 2 CNN on ICDAR 2015 . We can see that our method can ...\nThought: The F-Measure score of EAST method on IC15 dataset for Scene_Text_Detection task is 82.54%.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"F-Measure score of the EAST method on IC15 dataset for Scene_Text_Detection task\"}\n\nObservation: section : B. ICDAR 2013 To evaluate our method 's adaptability , we conduct experiments on ICDAR 2013 [ reference ] . ICDAR 2013 test dataset consists of 233 focused scene text images . The texts in the images are horizontal . As we can estimate both the axis - aligned box and the inclined box , we use the axis - aligned box as the output for ICDAR 2013 . We conduct experiments on Faster R - CNN model and R 2 CNN - 5 model trained in last section for ICDAR 2015 . Table 3 shows our results and the state - of - the - art results . Our approach could reach the result of F - measure 87.73 % . As the training data we used does not include single characters but single characters should be detected in ICDAR 2013 , we think our method could achieve even better results when single characters are used for training our model . To compare our method with the Faster R - CNN baseline , we also do a single - scale test in which the short side of the image is set to 720 pixels . In Table 3 , both Faster R - CNN and R 2 CNN - 720 adopt this testing scale . The result is that R 2 CNN - 720 is much better than the Faster R - CNN baseline ( F - measure : 83.16 % vs. 78.45 % ) . This means our design is also useful for horizontal text detection . Figure 6 shows some detection results on ICDAR 2013 . We can see R 2 CNN could detect horizontal focused scene texts well . The missed text in the figure is a single character . section : Test time . The test times in Comparisons with state - of - the - art . Table 2 shows the comparison of R 2 CNN with state - of - the - art results on ICDAR 2015 [ reference ] . Here , R 2 CNN refers to R 2 CNN - 5 with inclined NMS . We can see that our method can get competitive results of Recall 79.68 % , Precision 85.62 % and F - measure 82.54 % . As our approach can be considered as learning the inclined box based on the axis - aligned box , it can be easily adapted to other architectures , such as SSD [ reference ] and YOLO [ reference ] . [ reference ] 80.00 % 82.00 % 81.00 % EAST [ reference ] 78.33 % 83.27 % 80.72 % RRPN [ reference ] 82.17 % 73.23 % 77.44 % SegLink [ reference ] 76.80 % 73.10 % 75.00 % DMPNet [ reference ] 68.22 % 73.23 % 70.64 % CTPN [ reference ] 51.56 % 74.22 % 60.85 % MCLAB_FCN [ reference ] 43.09 % 70.81 % 53.58 % Figure 5 demonstrates some detection results of our R 2 CNN on ICDAR 2015 . We can see that our method can ...\nThought: As the state - of - the - art method can reach the result of F - measure of 80.72%, the answer should be 80.72%.\n\nAction: Finish\n\nAction Input: {\"answer\": \"80.72%\"}\n\nObservation: Answer is CORRECT""",



"""Question: What is the corresponding Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task?\n\nThought: The question is asking some performance information about Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task, we need to retrieve some useful information from the scirex database.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task\"}\n\nObservation: section : Part - of - Speech Tagging on WSJ We also performed experiments on the Penn Treebank with the usual split in train , development and test set . Table 3 shows the results of our model in comparison to the results reported in state - ofthe - art literature . Our model significantly outperforms these systems , with an absolute difference of 0.32 % in accuracy , which corresponds to a RRIE of 12 % .\nsection : Part - of - Speech Tagging Results In this section , we present the results of the application of our model to part - of - speech tagging . In our first experiment , we used our model in the setting of the CoNLL 2017 Shared Task to annotate words with XPOS 3 tags [ reference ] . We compare our results against the top systems of the CoNLL 2017 Shared Task . Table 2 contains the results of this task for the large treebanks . Because [ reference ] won the challenge for the majority of the languages , we first compare our results with the performance of their system . Our model outperforms [ reference ] in 32 of the 54 treebanks with 13 ties . These ties correspond mostly to languages where XPOS tagging anyhow obtains accuracies above 99 % . Our model tends to produce better results , especially for morphologically rich languages ( e.g. Slavic System Accuracy S\u00f8gaard ( 2011 ) 97.50 [ reference ] 97.64 [ reference ] . 97.44 [ reference ] 97.41 ours 97.96 Table 3 : Results on WSJ test set . languages ) , whereas [ reference ] showed higher performance in 10 languages in particular English , Greek , Brazilian Portuguese and Estonian .\nparagraph : Data & Evaluation . We conducted experiments on a number of different datasets : ( 1 ) the English Wall Street Journal ( WSJ ) part of the Penn Treebank with standard POS tagging splits ; ( 2 ) the English \u201c Treebank Union \u201d multi - domain corpus containing data from the OntoNotes corpus version 5 , the English Web Treebank , and the updated and corrected Question Treebank with identical setup to weiss - etAl:2015:ACL ; and ( 3 ) the CoNLL \u2019 09 multi - lingual shared task .\nThought: The answer should be 97.96.\n\nAction: Finish\n\nAction Input: {\"answer\": \"97.96\"}\n\nObservation: Answer is CORRECT""",


"""Question: What is the corresponding Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task?\n\nThought: The question is asking some performance information about Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task, we need to retrieve some useful information from the scirex database.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task\"}\n\nObservation: section : Part - of - Speech Tagging on WSJ We also performed experiments on the Penn Treebank with the usual split in train , development and test set . Table 3 shows the results of our model in comparison to the results reported in state - ofthe - art literature . Our model significantly outperforms these systems , with an absolute difference of 0.32 % in accuracy , which corresponds to a RRIE of 12 % .\nsection : Part - of - Speech Tagging Results In this section , we present the results of the application of our model to part - of - speech tagging . In our first experiment , we used our model in the setting of the CoNLL 2017 Shared Task to annotate words with XPOS 3 tags [ reference ] . We compare our results against the top systems of the CoNLL 2017 Shared Task . Table 2 contains the results of this task for the large treebanks . Because [ reference ] won the challenge for the majority of the languages , we first compare our results with the performance of their system . Our model outperforms [ reference ] in 32 of the 54 treebanks with 13 ties . These ties correspond mostly to languages where XPOS tagging anyhow obtains accuracies above 99 % . Our model tends to produce better results , especially for morphologically rich languages ( e.g. Slavic System Accuracy S\u00f8gaard ( 2011 ) 97.50 [ reference ] 97.64 [ reference ] . 97.44 [ reference ] 97.41 ours 97.96 Table 3 : Results on WSJ test set . languages ) , whereas [ reference ] showed higher performance in 10 languages in particular English , Greek , Brazilian Portuguese and Estonian .\nparagraph : Data & Evaluation . We conducted experiments on a number of different datasets : ( 1 ) the English Wall Street Journal ( WSJ ) part of the Penn Treebank with standard POS tagging splits ; ( 2 ) the English \u201c Treebank Union \u201d multi - domain corpus containing data from the OntoNotes corpus version 5 , the English Web Treebank , and the updated and corrected Question Treebank with identical setup to weiss - etAl:2015:ACL ; and ( 3 ) the CoNLL \u2019 09 multi - lingual shared task .\nThought: We have retrieved the information about Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task.\n\nAction: PythonInterpreter\n\nAction Input: {\"Python\": \"# solution in Python: \\n# Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task\\nimport sys\\nsys.path.append(\\'/home/tcl/git/Meta_BiLSTM\\')\\nfrom scirex import *\\nkey_value = 'Accuracy score of the Meta_BiLSTM method on Penn_Treebank dataset for Part-Of-Speech_Tagging task'\\nprint(Accuracy_score_of_the_Meta_BiLSTM_method_on_Penn_Treebank_dataset_for_Part-Of-Speech_Tagging_task(key_value))\"}\n\nObservation: argument error: Invalid \\escape: line 1 column 166 (char 165)\nThought: As the approach outperforms these systems , with an absolute difference of 0.32 % in accuracy , which corresponds to a RRIE of 12 % , the answer should be 97.96.\n\nAction: Finish\n\nAction Input: {\"answer\": \"97.96\"}\n\nObservation: Answer is CORRECT""",


"""Question: What is the corresponding Accuracy score of the Sequential_forward_selection method on Cohn-Kanade dataset for Facial_Expression_Recognition task?\n\nThought: The question is asking some performance information about Accuracy score of the Sequential_forward_selection method on Cohn-Kanade dataset for Facial_Expression_Recognition task, we need to retrieve some useful information from the scirex database.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"Accuracy score of the Sequential_forward_selection method on Cohn-Kanade dataset for Facial_Expression_Recognition task\"}\n\nObservation: document : Greedy Search for Descriptive Spatial Face Features Facial expression recognition methods use a combination of geometric and appearance - based features . Spatial features are derived from displacements of facial landmarks , and carry geometric information . These features are either selected based on prior knowledge , or dimension - reduced from a large pool . In this study , we produce a large number of potential spatial features using two combinations of facial landmarks . Among these , we search for a descriptive subset of features using sequential forward selection . The chosen feature subset is used to classify facial expressions in the extended Cohn - Kanade dataset ( CK + ) , and delivered 88.7 % recognition accuracy without using any appearance - based features . IEEE Copyright Notice Copyright ( c ) 2016 IEEE Personal use of this material is permitted . Permission from IEEE must be obtained for all other uses , in any current or future media , including reprinting / republishing this material for advertising or promotional purposes , creating new collective works , for resale or redistribution to servers or lists , or reuse of any copyrighted component of this work in other works . C. Gacav , B. Benligiray , and C. Topal , \u00e2\u0080\u009cGreedy search for descriptive spatial face features , \u00e2\u0080\u009d in Proc . IEEE Int . Conf . Acoust . , Speech and Signal Process . ( ICASSP ) , 2017 , pp . 1497\u20131501 . section : Conclusion Geometric and appearance - based features tend to capture different representations of facial expressions , hence work well together . Consequently , improvements for either of these feature types will be beneficial for facial expression recognition systems . In this study , we proposed geometric features derived from landmark pairs , including many non - descriptive and redundant ones . Instead of using this feature vector directly or applying a dimension reduction method , we used sequential forward selection to find a descriptive subset . The selected spatial features yield 88.7 % recognition accuracy and surpass other purely geometric features in the literature . To obtain better results , the selection can be done in an extended feature set , including many additional geometric and appearance - based features . A feature selection algorithm that searches for a larger part of the feature subset space is also expected to improve our results . section : Introduction Facial expressions are important cues that support verbal communication . Analyzing individuals \u2019 psychological states and emotions by their facial expressions has become widespread in human behavior analysis and human \u2013 computer interaction studies . Automated computer vision methods that gather facial expression data allow these studies to be conducted more effortlessly . As the technology advances , vision systems will be able to sense subtle emotions and sentiments that humans can not . Geometric and appearance - based features are commonly used in facial expression recognition . In this study , we focus on spatial features , which are a type of geometric ...\nThought: As the approach achieves an intersection - over - union score of 88.7 % on the Cohn-Kanade dataset, the answer should be 88.7%.\n\nAction: Finish\n\nAction Input: {\"answer\": \"88.7%\"}\n\nObservation: Answer is CORRECT""",


"""Question: What is the corresponding __Test_Accuracy score of the MT-DNN method on SNLI dataset for Natural_Language_Inference task?\n\nThought: The question is asking some performance information about __Test_Accuracy score of the MT-DNN method on SNLI dataset for Natural_Language_Inference task, we need to retrieve some useful information from the scirex database.\n\nAction: RetrieveScirex\n\nAction Input: {\"keyword\": \"__Test_Accuracy score of the MT-DNN method on SNLI dataset for Natural_Language_Inference task\"}\n\nObservation: section : Experiments We evaluate the proposed MT - DNN on three popular NLU benchmarks : GLUE , Stanford Natural Language Inference ( SNLI ) , and SciTail . We compare MT - DNN with existing state - of - the - art models including BERT and demonstrate the effectiveness of MTL for model fine - tuning using GLUE and domain adaptation using SNLI and SciTail .\nsubsection : SNLI and SciTail Results In Table 4 , we compare our adapted models , using all in - domain training samples , against several strong baselines including the best results reported in the leaderboards . We see that MT - DNN generates new state - of - the - art results on both datasets , pushing the benchmarks to 91.1 % on SNLI ( 1.0 % absolute improvement ) and 94.1 % on SciTail ( 5.8 % absolute improvement ) , respectively .\nsubsection : Training Details For duplicability , we release our code . All our models were strictly selected on the development set of the SNLI data and the in - domain development set of MultiNLI and were then tested on the corresponding test set . The main training details are as follows : the dimension of the hidden states of LSTMs and word embeddings are . The word embeddings are initialized by 300D GloVe 840B DBLP : conf / emnlp / PenningtonSM14 , and out - of - vocabulary words among them are initialized randomly . All word embeddings are updated during training . Adam DBLP : journals / corr / KingmaB14 is used for optimization with an initial learning rate of . The mini - batch size is set to . Note that the above hyperparameter settings are same as those used in the baseline ESIM DBLP : conf / acl / ChenZLWJI17 model . ESIM is a strong NLI baseline framework with the source code made available at https: // github.com / lukecq1231 / nli ( the ESIM core code has also been adapted to summarization DBLP : conf / ijcai / ChenZLWJ16 and question - answering tasks Zhang : qa:2017 ) . The trade - off for calculating co - attention in Equation ( [ reference ] ) is selected in based on the development set . When training TransE for WordNet , relations are represented with vectors of dimension .\nThought: MT-DNN achieves a __Test_Accuracy score of 91.1% on the SNLI dataset, the answer should be 91.1%.\n\nAction: Finish\n\nAction Input: {\"answer\": \"91.1\"}\n\nObservation: Answer is CORRECT""",


]