# Overview

## Description

With nearly 1.4 billion people, India is the second-most populated country in the world. Yet Indian languages, like Hindi and Tamil, are underrepresented on the web. Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to subpar experiences in downstream web applications for Indian users. With more attention from the Kaggle community and your novel machine learning solutions, we can help Indian users make the most of the web.

Predicting answers to questions is a common NLU task, but not for Hindi and Tamil. Current progress on multilingual modeling requires a concentrated effort to generate high-quality datasets and modelling improvements. Additionally, for languages that are typically underrepresented in public datasets, it can be difficult to build trustworthy evaluations. We hope the dataset provided for this competition—and [additional datasets generated by participants](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview/sharing-datasets)—will enable future machine learning for Indian languages.

In this competition, your goal is to predict answers to real questions about Wikipedia articles. You will use chaii-1, a new question answering dataset with question-answer pairs. The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators. You will be provided with a [baseline model](https://www.kaggle.com/deeplearning10/chaii-1-starter-notebook) and [inference code](https://www.kaggle.com/deeplearning10/chaii-1-inference) to build upon.

If successful, you'll improve upon the baseline performance of NLU models in Indian languages. The results could improve the web experience for many of the nearly 1.4 billion people of India. Additionally, you’ll contribute to multilingual NLP, which could be applied beyond the languages in this competition.

### Acknowledgments

**Google Research India** contributes fundamental advances in computer science and applies their research to big problems impacting India, Google, and communities around the world. The Natural Language Understanding group at Google Research India works specifically with ML to address the unique challenges in the Indian context (such as code mixing in Search, diversity of languages, dialects and accents in Assistant), learning from limited resources and advancing multilingual models.

**chaii ([Challenge in AI for India](https://events.withgoogle.com/chaii2021))** is a [Google Research India](https://research.google/teams/india-research-lab/) initiative created with the purpose of sparking AI applications to address some of the pressing problems in India and to find unique ways to address them. Starting with a focus on NLU, chaii hopes to make progress towards multilingual modelling, as language diversity is significantly underserved on the web. Google Research India is working on transformational approaches to healthcare, agriculture and education, and also improving apps and services such as search, assistant and payments, e.g., to deal with challenges arising out of the diversity of languages in India. We also acknowledge the support from the [AI4Bharat](https://indicnlp.ai4bharat.org/home/) Team at the Indian Institute of Technology Madras.

## Evaluation

The metric in this competition is the [word-level Jaccard score](https://en.wikipedia.org/wiki/Jaccard_index). A good description of Jaccard similarity for strings is [here](https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50).

A Python implementation based on the links above, and matched with the output of the C# implementation on the back end, is provided below.

```python
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))
```

The formula for the overall metric, then, is:
\text{score} = \frac{1}{n} \sum_{i=1}^n \text{jaccard}(gt_i, dt_i)

where:
$n$ = number of documents

$\text{jaccard}$ = the function provided above

$gt_i$ = the ith ground truth

$dt_i$ = the ith prediction

### Submission File

For each ID in the test set, you must predict the string that best answers the provided question based on the context. Note that the selected text needs to be quoted and complete to work correctly. Include punctuation, etc. - the above code splits ONLY on whitespace. The file should contain a header and have the following format:

```
id,PredictionString
8c8ee6504,"1"
3163c22d0,"2 string"
66aae423b,"4 word 6"
722085a7b,"1"
etc.
```

## Timeline

    August 11, 2021 - Start Date.

    November 8, 2021 - Entry Deadline. You must accept the competition rules before this date in order to compete.

    November 8, 2021 - Team Merger Deadline. This is the last day participants may join or merge teams.

    November 15, 2021 - Final Submission Deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

## Prizes

    1st Place - USD$2,000
    2nd Place - USD$2,000
    3rd Place - USD$2,000
    4th Place - USD$2,000
    5th Place - USD$2,000

## Code Requirements

This is a Code Competition

Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

- CPU Notebook <= 5 hours run-time
- GPU Notebook <= 5 hours run-time
- Internet access disabled
- Freely & publicly available external data is allowed, including pre-trained models
- Submission file must be named `submission.csv`

Please see the [Code Competition FAQ](https://www.kaggle.com/docs/competitions#notebooks-only-FAQ) for more information on how to submit. And review the [code debugging doc](https://www.kaggle.com/code-competition-debugging) if you are encountering submission errors.

## Sharing Datasets

Participants are highly encouraged to supplement the dataset by creating their own additional datasets and we expect this will help towards NLU model improvements.

- To be compliant with the competition rules on external data use, please ensure that any datasets you create, source or use are publicly and freely available to all participants in the competition. You may share your datasets on [this forum thread](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264581), in order to ensure it is made available for all participants to use.
- Please submit your data using [this form](https://forms.gle/P4ZdoDzyJ2iatzYo6) for Google to be able to release the dataset at the end of the contest
- Data and baseline model provided by Google as well as data contributed by participants will be released as an open-source dataset at the end of the contest.
- Please be mindful to not violate the rights of a third party and ensure the data does not contain lewd, obscene, pornographic, racist, sexist, or otherwise inappropriate data to the contest.
- We highly encourage you to create datasets from Wikipedia text.
- Participants may find tools such as [cdQA-Annotator](https://github.com/cdqa-suite/cdQA-annotator) to annotate your data for ease and data consistency.

## Citation

Addison Howard, Deepak Nathani, divy thakkar, Julia Elliott, Partha Talukdar, Phil Culliton. (2021). chaii - Hindi and Tamil Question Answering. Kaggle. https://kaggle.com/competitions/chaii-hindi-and-tamil-question-answering

# Dataset Description

In this competition, you will be predicting the answers to questions in Hindi and Tamil. The answers are drawn directly (see the Evaluation page for details) from a limited context. We have provided a small number of samples to check your code with. There is also a hidden test set.

**All files should be encoded as UTF-8.**

## Files

- **train.csv** - the training set, containing context, questions, and answers. Also includes the start character of the answer for disambiguation.
- **test.csv** - the test set, containing context and questions.
- **sample_submission.csv** - a sample submission file in the correct format

## Columns

- `id` - a unique identifier
- `context` - the text of the Hindi/Tamil sample from which answers should be derived
- `question` - the question, in Hindi/Tamil
- `answer_text` (train only) - the answer to the question (manual annotation) (note: for test, this is what you are attempting to predict)
- `answer_start` (train only) - the starting character in `context` for the answer (determined using substring match during data preparation)
- `language` - whether the text in question is in Tamil or Hindi

## Data Annotation Details

chaii 2021 dataset was prepared following the two step process as in [TydiQA](https://arxiv.org/pdf/2003.05002.pdf).

- In the **question elicitation** step, the annotators were shown snippets of Wikipedia text and asked to come up with interesting questions that they may be genuinely interested in knowing answers about. They were also asked to make sure the elicited question was **not** answerable from the snippet of wiki text shown. Annotators were asked to elicit questions which were likely to have precise, unambiguous answers.
- In the **answer labelling** step, for each question elicited in the previous step, the first Wikipedia page in the Google search results for that question was selected. For Hindi questions, the selection was restricted to Hindi Wikipedia documents, and similarly for Tamil. Annotators were then asked to select the answer for the question in the document. Annotators were asked to select the first valid answer in the document as the correct answer.
- Questions which were not answerable from the selected document were marked as non-answerable. These question-document pairs were not included in the chaii 2021 dataset.
- With (question, wiki_document, answer) now in place, the first substring occurrence of the answer in the wiki_document was automatically calculated and provided as `answer_start` in the dataset. Since this part was done automatically, some amount of inaccuracy is possible. This was included only for convenience, and participants may consider ignoring this offset during model development (or come up with their own mechanism for offset selection). Please note that during test, the model is only required to predict the answer string, and not its span offset.
- Answers in the training data were produced by **one** annotator, while those in the test were produced by **three** annotators. Majority voting was then used to come up with the final answer. In test data with minor disagreements, a separate annotator pass was done to select the final answer. For both train and test answer labelling, sampling based quality checks were carried out and the answer accuracy were routinely observed to be quite high.
- In spite of all these multi-step checks, some amount of noise in the training data is likely. This is expected and meant to reflect real-world settings where slight noise in the training data may be unavoidable to achieve larger volumes of it. Moreover, this may also result in development of more robust methods which are more noise tolerant during training.
- Update: we ran a few random sampling based quality checks on the datasets. Based on these checks, we found the Hindi train and test datasets to be 93.8% and 97.8%, respectively. No issues were identified in the sampled Tamil train and test instances.