{
    "name": "tokenizer",
    "task_description": "\nYour goal is to implement the build_vocabulary method in the provided Tokenizer class. \nA tokenizer is an object that converts words to numerical IDs.\n\nThe objective of the build_vocabulary method is as follows:\n\n- The method's primary goal is to create two dictionaries: self.word_to_id and self.id_to_word.\n\n- self.word_to_id should map each unique word in your corpus to a unique numerical identifier (ID).\n\n- self.id_to_word is the reverse mapping, where each unique ID corresponds to a word.\n\n- The method should only consider the most frequent words in the corpus, up to a limit specified by max_vocab_size.\n\n",
    "function_signature": "\nclass Tokenizer:\n    def __init__(self, max_vocab_size=200):\n        self.max_vocab_size = max_vocab_size\n        self.word_to_id = {}\n        self.id_to_word = {}\n\n    def tokenize(self, text):\n        # do not change\n        # Split text into words by spaces\n        return text.lower().split()\n\n    def build_vocabulary(self, corpus):\n        '''\n        corpus: a list of strings (string denotes a sentence composed of words seperated by spaces)\n        '''\n        # WRITE CODE HERE\n        return \n    \n    def get_word_id(self, word):\n        # do not change\n        # Retrieve the ID of a word, return None if the word is not in the vocabulary\n        return self.word_to_id.get(word)\n\n    def get_word_by_id(self, word_id):\n        # do not change\n        # Retrieve a word by its ID, return None if the ID is not in the vocabulary\n        return self.id_to_word.get(word_id)\n",
    "unit_test": "\ndef test_tokenize():\n    tokenizer = Tokenizer()\n    assert tokenizer.tokenize(\"Hello world\") == [\"hello\", \"world\"], \"Tokenization failed\"\n\ndef test_build_vocabulary_and_get_word_id():\n    tokenizer = Tokenizer(max_vocab_size=2)\n    corpus = [\"hello world\", \"hello python\", \"hello world\"]\n    tokenizer.build_vocabulary(corpus)\n    \n    assert tokenizer.get_word_id(\"hello\") is not None, \"'hello' should be in the vocabulary\"\n    assert tokenizer.get_word_id(\"world\") is not None, \"'world' should be in the vocabulary\"\n    assert tokenizer.get_word_id(\"python\") is None, \"'python' should not be in the vocabulary due to max_vocab_size limit\"\n\ndef test_get_word_by_id():\n    tokenizer = Tokenizer(max_vocab_size=2)\n    corpus = [\"apple orange\", \"banana apple\", \"cherry banana\"]\n    tokenizer.build_vocabulary(corpus)\n    \n    apple_id = tokenizer.get_word_id(\"apple\")\n    assert tokenizer.get_word_by_id(apple_id) == \"apple\", \"ID lookup for 'apple' failed\"\n\n    # Assuming 'cherry' is not in the top 2 words and therefore has no ID\n    cherry_id = tokenizer.get_word_id(\"cherry\")\n    assert cherry_id is None, \"'cherry' should not have an ID\"\n    assert tokenizer.get_word_by_id(cherry_id) is None, \"ID lookup for a non-existent word should return None\"\n\n# Run the tests\ntest_tokenize()\ntest_build_vocabulary_and_get_word_id()\ntest_get_word_by_id()\n",
    "solution": "\nfrom collections import Counter\n\nclass Tokenizer:\n    def __init__(self, max_vocab_size=200):\n        self.max_vocab_size = max_vocab_size\n        self.word_to_id = {}\n        self.id_to_word = {}\n\n    def tokenize(self, text):\n        # Split text into words by spaces\n        return text.lower().split()\n\n    def build_vocabulary(self, corpus):\n        # to be implemented\n        # Flatten the list of sentences into a list of words\n        all_words = [word for sentence in corpus for word in self.tokenize(sentence)]\n\n        # Count the frequency of each word\n        word_freq = Counter(all_words)\n\n        # Select the top 'max_vocab_size' words\n        most_common_words = word_freq.most_common(self.max_vocab_size)\n\n        # Assign an ID to each word\n        self.word_to_id = {word: idx for idx, (word, _) in enumerate(most_common_words)}\n        self.id_to_word = {idx: word for word, idx in self.word_to_id.items()}\n\n    def get_word_id(self, word):\n        # Retrieve the ID of a word, return None if the word is not in the vocabulary\n        return self.word_to_id.get(word)\n\n    def get_word_by_id(self, word_id):\n        # Retrieve a word by its ID, return None if the ID is not in the vocabulary\n        return self.id_to_word.get(word_id)\n",
    "type": "edit_code"
}