# Benchmarking top-k search on simple document search implementations

## Methodology

The benchmark setup is close to the one used in the ESA 2010 article
of Culpepper, Navarro, Puglisi and Turpin.

Explored dimensions:

  * text type
  * instance size (just adjust the `test_case.config` file for this)
  * index implementations
    - [Sadakane's method](src/doc_list_index_sada.hpp)
    - [Wavelet tree greedy traversal](src/doc_list_index_greedy.hpp)
    - [Wavelet tree quantile probing](src/doc_list_index_qprobing.hpp)
 
## Directory structure
  * [bin](./bin): Contains the executables of the project.
    - `build_*` index build executables 
    - `gen_pattern*` executable to generate pattern sets
    - `query_*` index query executables
    - `size_of_*` generate size and space breakdowns
  * [dic](./dic): Contains dictionaries for integer inputs.
  * [indexes](./indexes): Contains the generated indexes.
  * [info](./info): Contains space breakdowns.
  * [pattern](./pattern): Contains generated pattern.
  * [results](./results): Contains the results of the experiments.
  * [src](./src): Contains the source code of the benchmark.
  * [visualize](./visualize): Contains a `R`-script which generates
                              a report.

## Test data
  * ENWIKISML was generated by 
    - downloading a dump of a prefix of the English wikipedia (at the 5th of August 2013)
    - applying the `WikiExtractor.py` program Version 2.5 
      from Giuesppe Attardi and Antonio Fuschetto (Univserity of Pisa). 
    - Removing the `<doc id="*">`-tag and replacing the `</doc>`
      take by `\1`
  * ENWIKIBIG was generated the same way but for the complete
    English wikipedia (retrieved at the 8th of July 2013).
  * The integer version, ENWIKISMLINT and ENWIKIBIGINT, were
    generated from ENWIKISML and ENWIKIBIG by
    - applying the [stanford-parser.jar][SP] from the NL group
      at Stanford to tokenize the input (options 
      `untokenizable=allKeep,normalizeParentheses=false,normalizeOtherBrackets=false`)
    - assign the document seperator the ID `1`, and the ID of the other
      tokens is their rank in the reverse sorting of frequency (starting at `2`).
    - The resulting sequence of integers is stored in a bit-compressed
      `int_vector`.
    - The generated dictionaries containing on each line
      a `(word, ID, occurrences)`-tuple.
  * PROTEINS is the concatenation of 143,244 Human and Mouse
    proteins sequences from the swissport database. 
  * Availability: ENWIKIBIG (character and integer version)
    are available on request. The other files are downloaded
    automatically during the execution of the benchmark.

## Prerequisites
  * For the visualization you need the following software:
    - [R][RPJ] with package `tikzDevice`. You can install the
      package by calling 
      `install.packages("filehash", repos="http://cran.r-project.org")`
	  and 
	  `install.packages("tikzDevice", repos="http://R-Forge.R-project.org")`
	  in `R`.
    - Compressors [xz][XZ] and [gzip][GZIP] are used to get
	  compression baselines.

    - [pdflatex][LT] to generate the pdf reports.

## Usage

  Command `make timing` will download the small test cases, compile executables,
  build the indexes, run the queries, and generate a report. The
  benchmark run 5 minutes and 40 seconds (without downloading the files)
  and generated [this report on my machine][RES].
	
## Customization of the benchmark
  The project contains several configuration files:
 
  * [index.config](./index.config): Specify character
       based indexes' ID, sdsl-class and LaTeX-name for the report.
  * [index_int.config](./index_int.config): Specify word
       based indexes' ID, sdsl-class and LaTeX-name for the report. 
  * [test_case.config](./test_case.config): Specify character based collections' 
       ID, path, LaTeX-name for the report, and download URL.
  * [test_case_int.config](./test_case_int.config): Specify word based collections'
       ID, path, LaTeX-name for the report, and download URL.
  * [pattern_length.config](./pattern_length.config): Specify the
       lengths of queried pattern for character based indexes.
  * [pattern_length_int.config](./pattern_length_int.config): Specify
       lengths of queried pattern for word based indexes.

[RPJ]: http://www.r-project.org/ "R"
[LT]: http://www.tug.org/applications/pdftex/ "pdflatex"
[RPJ]: http://www.r-project.org/ "R"
[XZ]: http://tukaani.org/xz/ "XZ Compressor"
[GZIP]: http://www.gnu.org/software/gzip/ "Gzip Compressor"
[SP]: http://nlp.stanford.edu/software/tokenizer.shtml
[RES]: https://github.com/simongog/simongog.github.com/raw/master/assets/images/doc_re_time.pdf "doc_re_time.pdf"
