The use of English, Czech and French punctuation marks in reference, parallel and comparable web corpora: a question of methodology
Abstract: This paper analyses the frequency of six punctuation marks (the comma, period, colon, semicolon,
question mark and exclamation mark) in three languages (English, French and Czech) in three different
types of corpora — comparable web corpora, large monolingual general (reference) corpora
and parallel (translation) corpora. The aim of the analysis is to identify which type of corpus and
which methodology are the most suitable for contrastive research into punctuation. The data shows
that the frequency of different punctuation marks is very sensitive to the text type. Therefore, the
web corpora, containing uncontrollable amounts of various text types, cannot provide specific and
reliable information about the use of punctuation marks in a given language. We argue that despite
their limitations in terms of size and composition as well as the potential specific features of the language
of translation, the parallel corpora used in combination with the general (reference) corpora
provide the best data for such research.
Loading