Abstract: Although ingredients are important items of information in recipes, it is difficult to process them, especially for computers, because they are user-generated informal text. To normalize ingredients, we can use a character-based encoder-decoder model that takes the character sequence of an ingredient as an input and outputs its canonical form. However, the model still has two problems: The first is that the model often generates unnatural sequences as outputs. The second problem is that the generated sequences are sometimes unrelated to the original ingredient. Therefore, we propose a two-step validation to generate better normalizations. In the first validation step, we use a trie to limit the normalization candidates to existing sequences. In the second validation step, we rerank the normalization candidates based on their similarity to the original ingredient. We conducted experiments using a corpus that includes approximately 10 thousand pairs of ingredients and their canonical forms and showed that our proposed validation improved the performance of encoder-decoder models.
0 Replies
Loading