Abstract: Natural language (NL) price regression is the task of predicting prices from text.
Like other NL applications, NL price regression uses a pipeline that typically comprises four steps: preprocessing, tokenization, featurization and modeling.
Each step has multiple options, from traditional and modern approaches, giving many possible pipelines.
However, there is no work systematically comparing different combinations of these steps for NL regression.
We systematically generate and evaluate hundreds of random valid pipeline configurations, including combinations not commonly studied.
For example, approaches with Transformers for featurization and gradient boosted trees for modeling.
Then, we evaluate these pipelines on two real datasets.
These experiments reveal several interesting aspects of pipeline construction:
i) BERT contextual featurization outperforms GloVE non-contextual featurization,
ii) BERT featurization needs to be finetuned to outperform bag of words, with implications for resource constrained applications,
iii) the variance associated with choosing steps upstream from modelling is comparable to that of selecting the model, and
iv) vector embeddings (BERT and GloVE) perform worse than bag of words for GBDT models.
This study provides systematic evidence highlighting the need for holistic pipeline optimization for price regression.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Changes follow rebuttal comments.
New or altered text is in blue.
Substantive additions:
1. Addition of AOE-BERT as a featurizer
2. Addition of a Linear Kernel SVR as a model
Changes within figures and tables are not highlighted, as a guide:
1. Figure 1 has been updated
2. Tables and Figures in Section 5 have new numbers (due to alignment of pipelines between datasets). However, this has not changed any of the conclusions.
Minor spelling/grammar fixes and tightening of prose is also included.
Assigned Action Editor: ~Andriy_Mnih1
Submission Number: 4537
Loading