This will run all the evaluations to produce most of the tables in the paper. This requires Python2+numpy and Java, tested on python 2.7.12 and OpenJDK 1.8.0_232. This will take approx 10 minutes to run.

Table 1 - Comparison with existing question generation methods on the test set of SQuAD Split 1 and Split 2

Split 1

BertGen(Large) + ASGen

scores:

Bleu_1: 0.52595
Bleu_2: 0.37683
Bleu_3: 0.28959
Bleu_4: 0.22763
METEOR: 0.25292
ROUGE_L: 0.51224

UniLM + ASGen

scores:

Bleu_1: 0.54466
Bleu_2: 0.39158
Bleu_3: 0.30081
Bleu_4: 0.23673
METEOR: 0.25925
ROUGE_L: 0.52284

Split 2

BertGen(Large) + ASGen

scores:

Bleu_1: 0.52283
Bleu_2: 0.38413
Bleu_3: 0.30316
Bleu_4: 0.24588
METEOR: 0.25820
ROUGE_L: 0.53047

UniLM + ASGen

scores:

Bleu_1: 0.55356
Bleu_2: 0.40445
Bleu_3: 0.31578
Bleu_4: 0.25294
METEOR: 0.26662
ROUGE_L: 0.53282

Table 2 - Application of ASGen to other question generation models

Split 1

Zhao et al. + ASGen

scores:

Bleu_1: 0.42711
Bleu_2: 0.27264
Bleu_3: 0.19326
Bleu_4: 0.14180
METEOR: 0.19415
ROUGE_L: 0.42735

UniLM + ASGen

scores:

Bleu_1: 0.54466
Bleu_2: 0.39158
Bleu_3: 0.30081
Bleu_4: 0.23673
METEOR: 0.25925
ROUGE_L: 0.52284

Split 2

Zhao et al. + ASGen

scores:

Bleu_1: 0.45208
Bleu_2: 0.29714
Bleu_3: 0.21666
Bleu_4: 0.16405
METEOR: 0.20582
ROUGE_L: 0.44694

UniLM + ASGen

scores:

Bleu_1: 0.55356
Bleu_2: 0.40445
Bleu_3: 0.31578
Bleu_4: 0.25294
METEOR: 0.26662
ROUGE_L: 0.53282

Table 5 - Comparison with existing question generation methods on the test set of MS MARCO and NewsQA

MS MARCO

 Due to license Restrictions of MS Marco, we cannot distribute the contexts and ground truth questions. See the file ./predictions/Table_3/msmarco_predictions/BertGen_Large_ASGen.txt for our model outputs

NewsQA

 Due to license Restrictions of NewsQA, we cannot distribute the contexts and ground truth questions. See the file ./predictions/Table_3/newsqa_predictions/BertGen_Large_ASGen.txt for our model outputs

Table 4 - Ablation  of  pre-training  methods, i.e.,pre-training on NS, ASGen, and ASGen without conditioning on a given answer (woans),  on  the test set of SQuAD splits.

Test Wiki

Small-Wiki BertGen + ASGen woans

scores:

Bleu_1: 0.13302
Bleu_2: 0.09001
Bleu_3: 0.06708
Bleu_4: 0.05195
METEOR: 0.11323
ROUGE_L: 0.29557

Small-Wiki BertGen + ASGen

scores:

Bleu_1: 0.13340
Bleu_2: 0.09052
Bleu_3: 0.06759
Bleu_4: 0.05245
METEOR: 0.11348
ROUGE_L: 0.29629

Full-Wiki BertGen + ASGen

scores:

Bleu_1: 0.19631
Bleu_2: 0.13659
Bleu_3: 0.10408
Bleu_4: 0.08239
METEOR: 0.13482
ROUGE_L: 0.33134

Full-Wiki BertGen (Large) + ASGen

scores:

Bleu_1: 0.20027
Bleu_2: 0.13878
Bleu_3: 0.10551
Bleu_4: 0.08347
METEOR: 0.13556
ROUGE_L: 0.33038

Split 1

Small-Wiki BertGen

scores:

Bleu_1: 0.43993
Bleu_2: 0.28965
Bleu_3: 0.20611
Bleu_4: 0.14982
METEOR: 0.20698
ROUGE_L: 0.46069

Small-Wiki BertGen + NS

scores:

Bleu_1: 0.46851
Bleu_2: 0.32687
Bleu_3: 0.24605
Bleu_4: 0.19013
METEOR: 0.22470
ROUGE_L: 0.48508

Small-Wiki BertGen + ASGen woans

scores:

Bleu_1: 0.48389
Bleu_2: 0.34014
Bleu_3: 0.25759
Bleu_4: 0.19949
METEOR: 0.23278
ROUGE_L: 0.49309

Small-Wiki BertGen + ASGen

scores:

Bleu_1: 0.48851
Bleu_2: 0.34224
Bleu_3: 0.25891
Bleu_4: 0.20091
METEOR: 0.23334
ROUGE_L: 0.49143

Full-Wiki BertGen + NS

scores:

Bleu_1: 0.48651
Bleu_2: 0.34470
Bleu_3: 0.26335
Bleu_4: 0.20595
METEOR: 0.23598
ROUGE_L: 0.49477

Full-Wiki BertGen + ASGen

scores:

Bleu_1: 0.50590
Bleu_2: 0.36403
Bleu_3: 0.28070
Bleu_4: 0.22242
METEOR: 0.24604
ROUGE_L: 0.51102

Full-Wiki BertGen (Large) + ASGen

scores:

Bleu_1: 0.52595
Bleu_2: 0.37683
Bleu_3: 0.28959
Bleu_4: 0.22763
METEOR: 0.25292
ROUGE_L: 0.51224

Split 2

Small-Wiki BertGen

scores:

Bleu_1: 0.45890
Bleu_2: 0.30994
Bleu_3: 0.22647
Bleu_4: 0.17054
METEOR: 0.21871
ROUGE_L: 0.47907

Small-Wiki BertGen + NS

scores:

Bleu_1: 0.48338
Bleu_2: 0.34058
Bleu_3: 0.25875
Bleu_4: 0.20232
METEOR: 0.23359
ROUGE_L: 0.49360

Small-Wiki BertGen + ASGen woans

scores:

Bleu_1: 0.48341
Bleu_2: 0.34422
Bleu_3: 0.26500
Bleu_4: 0.21016
METEOR: 0.23545
ROUGE_L: 0.50218

Small-Wiki BertGen + ASGen

scores:

Bleu_1: 0.48659
Bleu_2: 0.34804
Bleu_3: 0.26885
Bleu_4: 0.21398
METEOR: 0.23795
ROUGE_L: 0.50523

Full-Wiki BertGen + NS

scores:

Bleu_1: 0.50439
Bleu_2: 0.36372
Bleu_3: 0.28256
Bleu_4: 0.22565
METEOR: 0.24740
ROUGE_L: 0.50857

Full-Wiki BertGen + ASGen

scores:

Bleu_1: 0.52012
Bleu_2: 0.38048
Bleu_3: 0.29908
Bleu_4: 0.24165
METEOR: 0.25659
ROUGE_L: 0.52325

Full-Wiki BertGen (Large) + ASGen

scores:

Bleu_1: 0.52283
Bleu_2: 0.38413
Bleu_3: 0.30316
Bleu_4: 0.24588
METEOR: 0.25820
ROUGE_L: 0.53047

Table 7 - Comparison of downstream MRC task EM/F1 scores after pre-training on synthetic data (Syn). The scores are obtained from development set of SQuAD-v1.1 and SQuAD-v2.0.

Dev v1.1

BERT Large + syn data

{"f1": 92.66872807457196, "exact_match": 86.27246925260171}

BERT WWM  + syn data

{"f1": 93.4482596293062, "exact_match": 87.4077578051088}

Dev v2.0

BERT Large + syn data

{
  "exact": 84.5026530784132,
  "f1": 87.41196733762554,
  "total": 11873,
  "HasAns_exact": 77.88461538461539,
  "HasAns_f1": 83.7115870782102,
  "HasAns_total": 5928,
  "NoAns_exact": 91.1017661900757,
  "NoAns_f1": 91.1017661900757,
  "NoAns_total": 5945,
  "best_exact": 84.5026530784132,
  "best_exact_thresh": -3.604159355163574,
  "best_f1": 87.41196733762536,
  "best_f1_thresh": -3.604159355163574
}

BERT WWM

{
  "exact": 83.0876779247031,
  "f1": 85.94095618777116,
  "total": 11873,
  "HasAns_exact": 77.5472334682861,
  "HasAns_f1": 83.26197247257224,
  "HasAns_total": 5928,
  "NoAns_exact": 88.61227922624053,
  "NoAns_f1": 88.61227922624053,
  "NoAns_total": 5945,
  "best_exact": 83.0876779247031,
  "best_exact_thresh": -4.982823491096497,
  "best_f1": 85.94095618777067,
  "best_f1_thresh": -4.982823491096497
}

BERT WWM + syn data

{
  "exact": 85.53861703023667,
  "f1": 88.39391027314987,
  "total": 11873,
  "HasAns_exact": 80.78609986504723,
  "HasAns_f1": 86.50487460747469,
  "HasAns_total": 5928,
  "NoAns_exact": 90.2775441547519,
  "NoAns_f1": 90.2775441547519,
  "NoAns_total": 5945,
  "best_exact": 85.53861703023667,
  "best_exact_thresh": -1.261772632598877,
  "best_f1": 88.39391027314946,
  "best_f1_thresh": -1.1740407943725586
}

Appendix Table 9 : Additional experiments on the effectiveness of AS on the test set of SQuAD Split3.

Zhao et al. + ASGen

scores:

Bleu_1: 0.46437
Bleu_2: 0.31000
Bleu_3: 0.22928
Bleu_4: 0.17616
METEOR: 0.21201
ROUGE_L: 0.45834

Appendix Table 12 : EM/F1 scores of the BERT fine-tuned on QUASAR-Tdataset. The used synthetic data is generated from ASGen trained on SQuAD-v1.1

Short Dev

BERT

{"f1": 78.61442359340914, "exact_match": 74.28917120387175}

BERT + SQuAD-v1.1

{"f1": 80.09823409097456, "exact_match": 76.4670296430732}

Short Test

BERT

{"f1": 77.79497033441206, "exact_match": 74.08980582524272}

BERT + SQuAD-v1.1

{"f1": 79.98950453023404, "exact_match": 76.51699029126213}

Long Dev

BERT

{"f1": 75.57422669922671, "exact_match": 72.13541666666667}

BERT + SQuAD-v1.1

{"f1": 77.3297414981819, "exact_match": 74.16666666666667}

Long Test

BERT

{"f1": 74.84569699708977, "exact_match": 72.05957883923986}

BERT + SQuAD-v1.1

{"f1": 76.51674126250406, "exact_match": 73.80585516178736}
