Program Synthesis for Character Level Language Modeling

Pavol Bielik, Veselin Raychev, Martin Vechev

Nov 04, 2016 (modified: Mar 03, 2017) ICLR 2017 conference submission readers: everyone
  • Abstract: We propose a statistical model applicable to character level language modeling and show that it is a good fit for both, program source code and English text. The model is parameterized by a program from a domain-specific language (DSL) that allows expressing non-trivial data dependencies. Learning is done in two phases: (i) we synthesize a program from the DSL, essentially learning a good representation for the data, and (ii) we learn parameters from the training data - the process is done via counting, as in simple language models such as n-gram. Our experiments show that the precision of our model is comparable to that of neural networks while sharing a number of advantages with n-gram models such as fast query time and the capability to quickly add and remove training data samples. Further, the model is parameterized by a program that can be manually inspected, understood and updated, addressing a major problem of neural networks.
  • Conflicts: inf.ethz.ch
  • Authorids: pavol.bielik@inf.ethz.ch, veselin.raychev@inf.ethz.ch, martin.vechev@inf.ethz.ch

Loading