# Difficulty-Based Active Learning for Classification and Regression Tasks

This repository contains the simulation code for the paper: "Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits"

The code implements/uses the following algorithms:

* BSLB: Blocked Sparse Linear Bandits
* C-BSLB: Corralling with Blocked Sparse Linear Bandits
* CORRAL : Corralling a Band of Bandits Aggarwal et al. 2017.
* SEALS: 
* AnchorAL: 


The code is written in Python and uses the following libraries:

* PyTorch
* Hugging Face Transformers
* Scikit-learn
* Small-Text
* CVXPY

## Datasets

The code can be used with the following datasets:

* Image Classification:
  * Pascal VOC 2012: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ The {PASCAL} {V}isual {O}bject {C}lasses {C}hallenge 2012 {(VOC2012)} {R}esults,  Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A., 2012.
  * Visual Search Difficulty Data Set by Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim Papadopoulos, Vittorio Ferrari: http://image-difficulty.herokuapp.com/ 
* Text Classification:
  * Stanford Sentiment Treebank (SST2) https://huggingface.co/datasets/stanfordnlp/sst2, Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2013).
* Recommendation:
  * Jester Jokes Dataset  https://eigentaste.berkeley.edu/dataset/ Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.
  * Goodbooks Reviews Dataset https://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/
  * MovieLens Dataset https://grouplens.org/datasets/movielens/100k/ F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.

## Requirements

To run the code, you will need to install the following libraries:

pip install torch transformers scikit-learn small-text cvxpy



## The code is in form of iPython Notebooks for interactability

* `image_classification_main.ipynb`: This notebook contains code for the ablation study.
* `alg.py`: This file contains the implementation of the CORRAL and other algorithms.
* `movies.ipynb`: This notebook contains the code for the MovieLens dataset.
* `books.ipynb`: This notebook contains the code for the Goodbooks Reviews dataset.
* `jokes.ipynb`: This notebook contains the code for the Jester Jokes dataset.
* `simulated.ipynb`: This notebook contains the code for the synthetic dataset.
* `text_classification.ipynb`: This notebook contains the code for the text classification experiments.
* `image_extra.ipynb`: This notebook contains the code for loading the image embeddings/other image embedding related experiments.

## Notes
Ensure that the data is structured in the following format:

* benchmark
    * books
    * jester
    * movielens
        * ml-100k
    * u.item
* image_classification 
    * VOC2012
    * task-input.csv (from the visual search difficulty dataset)
    * VSD_dataset.csv (from the visual search difficulty dataset)
