Abstract: Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives
rise to a surprisingly large amount of ambiguities in its parsing and
interpretation. We summarize the state-of-the-art in CSV parsers,
which typically make a linear series of parsing and interpretation
decisions, such that any wrong decision at an earlier stage can
negatively affect all downstream decisions. Since computation time
is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis
CSV parsing approach generates several concurrent hypotheses
about dialect, table structure, etc. and ranks these hypotheses based
on quality features of the resulting table. This approach makes it
possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in
infrastructure. The complex interactions between these decisions
are taken care of by searching the hypothesis space rather than by
having to program these many interactions in code. We show that
our approach leads to better parsing results than the state of the art
and facilitates the parsing of large corpora of heterogeneous CSV
files.
0 Replies
Loading