{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# Turorial 1: Getting Started with CycleResearcher (huggingface)",
   "id": "8f65eecda42c91eb"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "d3323b8abeef8782"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "CycleResearcher can be easily implemented using the Hugging Face Transformers library, which provides a straightforward and widely-used interface for working with large language models. While this implementation might be slightly slower than VLLM, it offers excellent compatibility and easier setup for most users.",
   "id": "69b54f2115daf8f2"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": 5,
   "source": [
    "import json\n",
    "\n",
    "import torch\n",
    "from cycleresearch.utils import get_paper_from_generated_text\n",
    "from tqdm import tqdm, trange\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n"
   ],
   "id": "324a30824cc7cd57"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "\n",
    "\n",
    "## Loading Data and Model\n",
    "\n",
    "Let's start by loading our pre-prepared data and model. The input data should be in BibTeX format with abstracts. BibTeX is a reference management format widely used in academic publishing, making it an ideal choice for research paper generation.\n",
    "\n",
    "Example BibTeX format:\n",
    "```bibtex\n",
    "@article{example2024,\n",
    "    title = {Sample Paper Title},\n",
    "    author = {Author, A. and Author, B.},\n",
    "    journal = {Journal Name},\n",
    "    year = {2024},\n",
    "    abstract = {This is a sample abstract that would be used as input...}\n",
    "}\n",
    "```\n",
    "\n"
   ],
   "id": "d45d53f2b553e43f"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{\"role\": \"system\", \"content\": \"You are a research assistant AI tasked with generating a scientific paper based on provided literature. Follow these steps:\\n\\n1. Analyze the given References. \\n2. Identify gaps in existing research to establish the motivation for a new study.\\n3. Propose a main idea for a new research work.\\n4. Write the paper's main content in LaTeX format, including:\\n   - Title\\n   - Abstract\\n   - Introduction\\n   - Related Work\\n   - Methods/\\n5. Generate experimental setup details in JSON format to guide researchers.\\n6. After receiving experimental results in JSON format, analyze them.\\n7. Complete the paper by writing:\\n   - Results\\n   - Discussion\\n   - Conclusion\\n   - Contributions\\n\\nEnsure all content is original, academically rigorous, and follows standard scientific writing conventions.\"}, {\"role\": \"user\", \"content\": \"@article{liu2023visual,\\n  title={Visual instruction tuning},\\n  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},\\n  journal={arXiv preprint arXiv:2304.08485},\\n  year={2023}\\n}\\n\\n@article{joshi2017automatic,\\n  title={Automatic sarcasm detection: A survey},\\n  author={Joshi, Aditya and Bhattacharyya, Pushpak and Carman, Mark J},\\n  journal={ACM Computing Surveys (CSUR)},\\n  volume={50},\\n  number={5},\\n  pages={1--22},\\n  year={2017},\\n  publisher={ACM New York, NY, USA}\\n}\\n\\nAbstract: A common form of sarcasm on Twitter consists of a positive sentiment contrasted with a negative situation. For example, many sarcastic tweets include a positive sentiment, such as \\u201clove\\u201d or \\u201cenjoy\\u201d, followed by an expression that describes an undesirable activity or state (e.g., \\u201ctaking exams\\u201d or \\u201cbeing ignored\\u201d). We have developed a sarcasm recognizer to identify this type of sarcasm in tweets. We present a novel bootstrapping algorithm that automatically learns lists of positive sentiment phrases and negative situation phrases from sarcastic tweets. We show that identifying contrasting contexts using the phrases learned through bootstrapping yields improved recall for sarcasm recognition.\\n@inproceedings{riloff2013sarcasm,\\n  title={Sarcasm as contrast between a positive sentiment and negative situation},\\n  author={Riloff, Ellen and Qadir, Ashequl and Surve, Prafulla and De Silva, Lalindra and Gilbert, Nathan and Huang, Ruihong},\\n  booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing},\\n  pages={704--714},\\n  year={2013}\\n}\\n\\nAbstract: Sarcasm is a sophisticated linguistic phenomenon to express the opposite of what one really means. With the rapid growth of social media, multimodal sarcastic tweets are widely posted on various social platforms. In multimodal context, sarcasm is no longer a pure linguistic phenomenon, and due to the nature of social media short text, the opposite is more often manifested via cross-modality expressions. Thus traditional text-based methods are insufficient to detect multimodal sarcasm. To reason with multimodal sarcastic tweets, in this paper, we propose a novel method for modeling cross-modality contrast in the associated context. Our method models both cross-modality contrast and semantic association by constructing the Decomposition and Relation Network (namely D&R Net). The decomposition network represents the commonality and discrepancy between image and text, and the relation network models the semantic association in cross-modality context. Experimental results on a public dataset demonstrate the effectiveness of our model in multimodal sarcasm detection.\\n@inproceedings{xu2020reasoning,\\n  title={Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association},\\n  author={Xu, Nan and Zeng, Zhixiong and Mao, Wenji},\\n  booktitle={Proceedings of the 58th annual meeting of the association for computational linguistics},\\n  pages={3777--3786},\\n  year={2020}\\n}\\n\\nAbstract: Sarcasm is a pervasive phenomenon in today\\u2019s social media platforms such as Twitter and Reddit. These platforms allow users to create multi-modal messages, including texts, images, and videos. Existing multi-modal sarcasm detection methods either simply concatenate the features from multi modalities or fuse the multi modalities information in a designed manner. However, they ignore the incongruity character in sarcastic utterance, which is often manifested between modalities or within modalities. Inspired by this, we propose a BERT architecture-based model, which concentrates on both intra and inter-modality incongruity for multi-modal sarcasm detection. To be specific, we are inspired by the idea of self-attention mechanism and design inter-modality attention to capturing inter-modality incongruity. In addition, the co-attention mechanism is applied to model the contradiction within the text. The incongruity information is then used for prediction. The experimental results demonstrate that our model achieves state-of-the-art performance on a public multi-modal sarcasm detection dataset.\\n@inproceedings{pan2020modeling,\\n  title={Modeling intra and inter-modality incongruity for multi-modal sarcasm detection},\\n  author={Pan, Hongliang and Lin, Zheng and Fu, Peng and Qi, Yatao and Wang, Weiping},\\n  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2020},\\n  pages={1383--1392},\\n  year={2020}\\n}\\n\\nAbstract: Sarcasm is a peculiar form of sentiment expression, where the surface sentiment differs from the implied sentiment. The detection of sarcasm in social media platforms has been applied in the past mainly to textual utterances where lexical indicators (such as interjections and intensifiers), linguistic markers, and contextual information (such as user profiles, or past conversations) were used to detect the sarcastic tone. However, modern social media platforms allow to create multimodal messages where audiovisual content is integrated with the text, making the analysis of a mode in isolation partial. In our work, we first study the relationship between the textual and visual aspects in multimodal posts from three major social media platforms, i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to quantify the extent to which images are perceived as necessary by human annotators. Moreover, we propose two different computational frameworks to detect sarcasm that integrate the textual and visual modalities. The first approach exploits visual semantics trained on an external dataset, and concatenates the semantics features with state-of-the-art textual features. The second method adapts a visual neural network initialized with parameters trained on ImageNet to multimodal sarcastic posts. Results show the positive effect of combining modalities for the detection of sarcasm across platforms and methods.\\n@inproceedings{schifanella2016detecting,\\n  title={Detecting sarcasm in multimodal social platforms},\\n  author={Schifanella, Rossano and De Juan, Paloma and Tetreault, Joel and Cao, Liangliang},\\n  booktitle={Proceedings of the 24th ACM international conference on Multimedia},\\n  pages={1136--1145},\\n  year={2016}\\n}\\n\\nAbstract: Sarcasm is a subtle form of language in which people express the opposite of what is implied. Previous works of sarcasm detection focused on texts. However, more and more social media platforms like Twitter allow users to create multi-modal messages, including texts, images, and videos. It is insufficient to detect sarcasm from multi-model messages based only on texts. In this paper, we focus on multi-modal sarcasm detection for tweets consisting of texts and images in Twitter. We treat text features, image features and image attributes as three modalities and propose a multi-modal hierarchical fusion model to address this task. Our model first extracts image features and attribute features, and then leverages attribute features and bidirectional LSTM network to extract text features. Features of three modalities are then reconstructed and fused into one feature vector for prediction. We create a multi-modal sarcasm detection dataset based on Twitter. Evaluation results on the dataset demonstrate the efficacy of our proposed model and the usefulness of the three modalities.\\n@inproceedings{cai2019multi,\\n  title={Multi-modal sarcasm detection in twitter with hierarchical fusion model},\\n  author={Cai, Yitao and Cai, Huiyu and Wan, Xiaojun},\\n  booktitle={Proceedings of the 57th annual meeting of the association for computational linguistics},\\n  pages={2506--2515},\\n  year={2019}\\n}\\n\\nAbstract: Sarcasm is a peculiar form and sophisticated linguistic act to express the incongruity of someone's implied sentiment expression, which is a pervasive phenomenon in social media platforms. Compared with sarcasm detection purely on texts, multi-modal sarcasm detection is more adapted to the rapidly growing social media platforms, where people are interested in creating multi-modal messages. When focusing on the multi-modal sarcasm detection for tweets consisting of texts and images on Twitter, the significant clue of improving the performance of multi-modal sarcasm detection evolves into how to determine the incongruity relations between texts and images. In this paper, we investigate multi-modal sarcasm detection from a novel perspective, so as to determine the sentiment inconsistencies within a certain modality and across different modalities by constructing heterogeneous in-modal and cross-modal graphs (InCrossMGs) for each multi-modal example. Based on it, we explore an interactive graph convolution network (GCN) structure to jointly and interactively learn the incongruity relations of in-modal and cross-modal graphs for determining the significant clues in sarcasm detection. Experimental results demonstrate that our proposed model achieves state-of-the-art performance in multi-modal sarcasm detection.\\n@inproceedings{liang2021multi,\\n  title={Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs},\\n  author={Liang, Bin and Lou, Chenwei and Li, Xiang and Gui, Lin and Yang, Min and Xu, Ruifeng},\\n  booktitle={Proceedings of the 29th ACM international conference on multimedia},\\n  pages={4707--4715},\\n  year={2021}\\n}\\n\\n@inproceedings{liang2022multi,\\n  title={Multi-modal sarcasm detection via cross-modal graph convolutional network},\\n  author={Liang, Bin and Lou, Chenwei and Li, Xiang and Yang, Min and Gui, Lin and He, Yulan and Pei, Wenjie and Xu, Ruifeng},\\n  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},\\n  volume={1},\\n  pages={1767--1777},\\n  year={2022},\\n  organization={Association for Computational Linguistics}\\n}\\n\\nAbstract: Sarcasm is a linguistic phenomenon indicating a discrepancy between literal meanings and implied intentions. Due to its sophisticated nature, it is usually difficult to be detected from the text itself. As a result, multi-modal sarcasm detection has received more and more attention in both academia and industries. However, most existing techniques only modeled the atomic-level inconsistencies between the text input and its accompanying image, ignoring more complex compositions for both modalities. Moreover, they neglected the rich information contained in external knowledge, e.g., image captions. In this paper, we propose a novel hierarchical framework for sarcasm detection by exploring both the atomic-level congruity based on multi-head cross attentions and the composition-level congruity based on graph neural networks, where a post with low congruity can be identified as sarcasm. In addition, we exploit the effect of various knowledge resources for sarcasm detection. Evaluation results on a public multi-modal sarcasm detection dataset based on Twitter demonstrate the superiority of our proposed model.\\n@inproceedings{liu2022towards,\\n  title={Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement},\\n  author={Liu, Hui and Wang, Wenya and Li, Haoliang},\\n  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},\\n  pages={4995--5006},\\n  year={2022}\\n}\\n\\n@inproceedings{qin2023mmsd2,\\n  title={MMSD2. 0: Towards a Reliable Multi-modal Sarcasm Detection System},\\n  author={Qin, Libo and Huang, Shijue and Chen, Qiguang and Cai, Chenran and Zhang, Yudi and Liang, Bin and Che, Wanxiang and Xu, Ruifeng},\\n  booktitle={Findings of the Association for Computational Linguistics: ACL 2023},\\n  pages={10834--10845},\\n  year={2023}\\n}\\n\\nAbstract: Multimodal sarcasm detection is an important research topic in natural language processing and multimedia computing, and benefits a wide range of applications in multiple domains. Most existing studies regard the incongruity between image and text as the indicative clue in identifying multimodal sarcasm. To capture cross-modal incongruity, previous methods rely on fixed architectures in network design, which restricts the model from dynamically adjusting to diverse image-text pairs. Inspired by routing-based dynamic network, we model the dynamic mechanism in multimodal sarcasm detection and propose the Dynamic Routing Transformer Network (DynRT-Net). Our method utilizes dynamic paths to activate different routing transformer modules with hierarchical co-attention adapting to cross-modal incongruity. Experimental results on a public dataset demonstrate the effectiveness of our method compared to the state-of-the-art methods. Our codes are available at https://github.com/TIAN-viola/DynRT.\\n@inproceedings{tian2023dynamic,\\n  title={Dynamic Routing Transformer Network for Multimodal Sarcasm Detection},\\n  author={Tian, Yuan and Xu, Nan and Zhang, Ruike and Mao, Wenji},\\n  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},\\n  pages={2468--2480},\\n  year={2023}\\n}\\n\\nAbstract: Sarcasm is important to sentiment analysis on social media. Sarcasm Target Identification (STI) deserves further study to understand sarcasm in depth. However, text lacking context or missing sarcasm target makes target identification very difficult. In this paper, we introduce multimodality to STI and present Multimodal Sarcasm Target Identification (MSTI) task. We propose a novel multi-scale cross-modality model that can simultaneously perform textual target labeling and visual target detection. In the model, we extract multi-scale visual features to enrich spatial information for different sized visual sarcasm targets. We design a set of convolution networks to unify multi-scale visual features with textual features for cross-modal attention learning, and correspondingly a set of transposed convolution networks to restore multi-scale visual information. The results show that visual clues can improve the performance of TSTI by a large margin, and VSTI achieves good accuracy.\\n@inproceedings{wang2022multimodal,\\n  title={Multimodal sarcasm target identification in tweets},\\n  author={Wang, Jiquan and Sun, Lin and Liu, Yi and Shao, Meizhi and Zheng, Zengwei},\\n  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},\\n  pages={8164--8175},\\n  year={2022}\\n}\\n\\nAbstract: We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL .\\n@article{bai2023qwen,\\n  title={Qwen-vl: A frontier large vision-language model with versatile abilities},\\n  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},\\n  journal={arXiv preprint arXiv:2308.12966},\\n  year={2023}\\n}\\n\\nAbstract: Sarcasm is sophisticated linguistic expression and is commonly observed on social media and e-commerce platforms. Failure to detect sarcastic expressions in natural language processing tasks, such as opinion mining and sentiment analysis, leads to poor model performance. Traditional approaches rely heavily on discrete handcrafted features and will incur enormous human costs. It was not until recent that scholars began to employ neural networks to address these limitations and have achieved new state-of-the-art performance. In this work, we propose a novel self-matching network to capture sentence \\u201dincongruity\\u201d information by exploring word-to-word interactions. In particular, we calculate the joint information in each word-to-word pair in the input sentence to build a self-matching attention vector, based on which we attend the sentence and build its representation vector. Such a network allows sentence to match within itself word by word and cater to the words of conflict sentiments. In addition, we incorporate a bi-directional LSTM network into our proposed network to retain compositional information. We concatenate incongruity information and compositional information through a Low-rank Bilinear Pooling method to control for potential information redundancy without losing discriminative power. Experiment results on publicly available datasets demonstrate that our model significantly outperforms extant baselines on standard evaluation metrics including precision, recall, F1 score and accuracy.\\n@inproceedings{xiong2019sarcasm,\\n  title={Sarcasm detection with self-matching networks and low-rank bilinear pooling},\\n  author={Xiong, Tao and Zhang, Peiran and Zhu, Hongbo and Yang, Yihui},\\n  booktitle={The world wide web conference},\\n  pages={2115--2124},\\n  year={2019}\\n}\\n\\nAbstract: Automatic sarcasm detection from text is an important classification task that can help identify the actual sentiment in user-generated data, such as reviews or tweets. Despite its usefulness, sarcasm detection remains a challenging task, due to a lack of any vocal intonation or facial gestures in textual data. To date, most of the approaches to addressing the problem have relied on hand-crafted affect features, or pre-trained models of non-contextual word embeddings, such as Word2vec. However, these models inherit limitations that render them inadequate for the task of sarcasm detection. In this paper, we propose two novel deep neural network models for sarcasm detection, namely ACE 1 and ACE 2. Given as input a text passage, the models predict whether it is sarcastic (or not). Our models extend the architecture of BERT by incorporating both affective and contextual features. To the best of our knowledge, this is the first attempt to directly alter BERT\\u2019s architecture and train it from scratch to build a sarcasm classifier. Extensive experiments on different datasets demonstrate that the proposed models outperform state-of-the-art models for sarcasm detection with significant margins.\\n@inproceedings{babanejad2020affective,\\n  title={Affective and contextual embedding for sarcasm detection},\\n  author={Babanejad, Nastaran and Davoudi, Heidar and An, Aijun and Papagelis, Manos},\\n  booktitle={Proceedings of the 28th international conference on computational linguistics},\\n  pages={225--243},\\n  year={2020}\\n}\\n\\nAbstract: Past work in computational sarcasm deals primarily with sarcasm detection. In this paper, we introduce a novel, related problem: sarcasm target identi\\ufb01cation ( i.e. , extracting the target of ridicule in a sarcastic sentence). As a benchmark, we introduce a new dataset for the task. This dataset is manually annotated for the sarcasm target in book snippets and tweets based on our formulation of the task. We then introduce an automatic approach for sarcasm target identi\\ufb01cation. It is based on a combination of two types of extractors: one based on rules, and another consisting of a statistical classi\\ufb01er. Our introductory approach establishes the viability of sarcasm target identi\\ufb01cation, and will serve as a baseline for future work.\\n@inproceedings{joshi2018sarcasm,\\n  title={Sarcasm target identification: Dataset and an introductory approach},\\n  author={Joshi, Aditya and Goel, Pranav and Bhattacharyya, Pushpak and Carman, Mark},\\n  booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},\\n  year={2018}\\n}\\n\\nAbstract: We present an overview of the 2019 ALTA shared task. This is the 10th of the series of shared tasks organised by ALTA since 2010. The task was to detect the target of sarcastic comments posted on social media. We intro- duce the task, describe the data and present the results of baselines and participants. This year\\u2019s shared task was particularly challenging and no participating systems improved the re- sults of our baseline.\\n@inproceedings{molla2019overview,\\n  title={Overview of the 2019 ALTA shared task: Sarcasm target identification},\\n  author={Molla, Diego and Joshi, Aditya},\\n  booktitle={Proceedings of the the 17th annual workshop of the Australasian language technology association},\\n  pages={192--196},\\n  year={2019}\\n}\\n\\nAbstract: In this paper we propose a deep learning framework for sarcasm target detection in predefined sarcastic texts. Identification of sarcasm targets can help in many core natural language processing tasks such as aspect based sentiment analysis, opinion mining etc. To begin with, we perform an empirical study of the socio-linguistic features and identify those that are statistically significant in indicating sarcasm targets (p-values in the range(0.05,0.001)). Finally, we present a deep-learning framework augmented with socio-linguistic features to detect sarcasm targets in sarcastic book-snippets and tweets.We achieve a huge improvement in the performance in terms of exact match and dice scores compared to the current state-of-the-art baseline.\\n@inproceedings{patro2019deep,\\n  title={A deep-learning framework to detect sarcasm targets},\\n  author={Patro, Jasabanta and Bansal, Srijan and Mukherjee, Animesh},\\n  booktitle={Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp)},\\n  pages={6336--6342},\\n  year={2019}\\n}\\n\\nAbstract: We describe our methods in trying to detect the target of sarcasm as part of ALTA 2019 shared task. We use combination of ensemble of clas- sifiers and a rule-based system. Our team ob- tained a Dice-Sorensen Coefficient score of 0.37150, which placed 2nd in the public leader- board. Despite no team beating the baseline score for the private dataset, we present our findings and also some of the challenges and future improvements which can be used in or- der to tackle the problem.\\n@inproceedings{parameswaran2019detecting,\\n  title={Detecting target of sarcasm using ensemble methods},\\n  author={Parameswaran, Pradeesh and Trotman, Andrew and Liesaputra, Veronica and Eyers, David},\\n  booktitle={Proceedings of the the 17th annual workshop of the Australasian language technology association},\\n  pages={197--203},\\n  year={2019}\\n}\\n\\nAbstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.\\n@inproceedings{brown2020language,\\n  title={Language models are few-shot learners},\\n  author={Brown, Tom B and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others},\\n  booktitle={Proceedings of the 34th International Conference on Neural Information Processing Systems},\\n  pages={1877--1901},\\n  year={2020}\\n}\\n\\n@article{OpenAI2023GPT4TR,\\n  title={GPT-4 Technical Report},\\n  author={OpenAI},\\n  journal={ArXiv},\\n  year={2023},\\n  volume={abs/2303.08774},\\n  url={https://api.semanticscholar.org/CorpusID:257532815}\\n}\\n\\nAbstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.\\n@article{chowdhery2022palm,\\n  title={Palm: Scaling language modeling with pathways},\\n  author={Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others},\\n  journal={arXiv preprint arXiv:2204.02311},\\n  year={2022}\\n}\\n\\nAbstract: We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B\\u2019s architecture and training, and evaluate its performance. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.\\n@inproceedings{black2022gpt,\\n  title={GPT-NeoX-20B: An Open-Source Autoregressive Language Model},\\n  author={Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others},\\n  booktitle={Proceedings of BigScience Episode\\\\# 5--Workshop on Challenges \\\\& Perspectives in Creating Large Language Models},\\n  pages={95--136},\\n  year={2022}\\n}\\n\\nAbstract: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\\\\times$RTX 3090 (24G) or 8$\\\\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \\\\url{https://github.com/THUDM/GLM-130B/}.\\n@inproceedings{zeng2022glm,\\n  title={GLM-130B: An Open Bilingual Pre-trained Model},\\n  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},\\n  booktitle={The Eleventh International Conference on Learning Representations},\\n  year={2022}\\n}\\n\\nAbstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.\\n@article{touvron2023llama,\\n  title={Llama: Open and efficient foundation language models},\\n  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\\\\'e}e and Rozi{\\\\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and others},\\n  journal={arXiv preprint arXiv:2302.13971},\\n  year={2023}\\n}\\n\\n@article{touvron2023llama2,\\n  title={Llama 2: Open foundation and fine-tuned chat models},\\n  author={Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others},\\n  journal={arXiv preprint arXiv:2307.09288},\\n  year={2023}\\n}\\n\\n@article{yang2023dawn,\\n  title={The dawn of lmms: Preliminary explorations with gpt-4v (ision)},\\n  author={Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan},\\n  journal={arXiv preprint arXiv:2309.17421},\\n  volume={9},\\n  number={1},\\n  year={2023}\\n}\\n\\nAbstract: Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.\\n@article{Dai2023InstructBLIPTG,\\n  title={InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning},\\n  author={Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Albert Li and Pascale Fung and Steven C. H. Hoi},\\n  journal={ArXiv},\\n  year={2023},\\n  volume={abs/2305.06500},\\n  url={https://api.semanticscholar.org/CorpusID:258615266}\\n}\\n\\n@article{team2023gemini,\\n  title={Gemini: A family of highly capable multimodal models},\\n  author={Team, Gemini and Anil, Rohan and Borgeaud, Sebastian and Wu, Yonghui and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew M and Hauth, Anja and others},\\n  journal={arXiv preprint arXiv:2312.11805},\\n  year={2023}\\n}\\n\\n@article{wang2023cogvlm,\\n  title={Cogvlm: Visual expert for pretrained language models},\\n  author={Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and others},\\n  journal={arXiv preprint arXiv:2311.03079},\\n  year={2023}\\n}\\n\\nAbstract: YOLO has become a central real-time object detection system for robotics, driverless cars\\n@article{terven2023comprehensive,\\n  title={A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond},\\n  author={Terven, Juan and Cordova-Esparza, Diana},\\n  journal={arXiv preprint arXiv:2304.00501},\\n  year={2023}\\n}\\n\\nAbstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).\\n@inproceedings{devlin2019bert,\\n  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},\\n  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},\\n  booktitle={Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},\\n  pages={4171--4186},\\n  year={2019}\\n}\\n\\nAbstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.\\n@inproceedings{he2017mask,\\n  title={Mask r-cnn},\\n  author={He, Kaiming and Gkioxari, Georgia and Doll{\\\\'a}r, Piotr and Girshick, Ross},\\n  booktitle={Proceedings of the IEEE international conference on computer vision},\\n  pages={2961--2969},\\n  year={2017}\\n}\\n\\n@inproceedings{minderer2022simple,\\n  title={Simple open-vocabulary object detection},\\n  author={Minderer, Matthias and Gritsenko, Alexey and Stone, Austin and Neumann, Maxim and Weissenborn, Dirk and Dosovitskiy, Alexey and Mahendran, Aravindh and Arnab, Anurag and Dehghani, Mostafa and Shen, Zhuoran and others},\\n  booktitle={European Conference on Computer Vision},\\n  pages={728--755},\\n  year={2022},\\n  organization={Springer}\\n}\\n\\nAbstract: In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \\\\url{https://github.com/IDEA-Research/GroundingDINO}.\\n@article{liu2023grounding,\\n  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},\\n  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},\\n  journal={arXiv preprint arXiv:2303.05499},\\n  year={2023}\\n}\\n\\nAbstract: Web 2.0 has led to the development and evolution of web-based communities and applications. These communities provide places for information sharing and collaboration. They also open t he door for inappropriate online activities, such as harassment, i n which some users post messages in a virtual community that are intention- ally offensive to other members of the community. It is a new and challenging task to detect online harassment; currently fe w systems attempt to solve this problem. In this paper, we use a supervised learning approach for dete ct- ing harassment. Our technique employs content features, sentiment features, and contextual features of documents. The experi mental results described herein show that our method achieves significant improvements over several baselines, including Term Frequency- Inverse Document Frequency (TFIDF) approaches. Identifica tion of online harassment is feasible when TFIDF is supplemented with sentiment and contextual feature attributes.\\n@article{yin2009detection,\\n  title={Detection of harassment on web 2.0},\\n  author={Yin, Dawei and Xue, Zhenzhen and Hong, Liangjie and Davison, Brian D and Kontostathis, April and Edwards, Lynne and others},\\n  journal={Proceedings of the Content Analysis in the WEB},\\n  volume={2},\\n  number={0},\\n  pages={1--7},\\n  year={2009},\\n  publisher={Madrid, Spain}\\n}\\n\\nAbstract: Sarcasm is a form of speech act in which the speakers convey their message in an implicit way. The inherently ambiguous nature of sarcasm sometimes makes it hard even for humans to decide whether an utterance is sarcastic or not. Recognition of sarcasm can benefit many sentiment analysis NLP applications, such as review summarization, dialogue systems and review ranking systems. \\n \\nIn this paper we experiment with semi-supervised sarcasm identification on two very different data sets: a collection of 5.9 million tweets collected from Twitter, and a collection of 66000 product reviews from Amazon. Using the Mechanical Turk we created a gold standard sample in which each sentence was tagged by 3 annotators, obtaining F-scores of 0.78 on the product reviews dataset and 0.83 on the Twitter dataset. We discuss the differences between the datasets and how the algorithm uses them (e.g., for the Amazon dataset the algorithm makes use of structured information). We also discuss the utility of Twitter #sarcasm hashtags for the task.\\n@inproceedings{davidov2010semi,\\n  title={Semi-supervised recognition of sarcasm in Twitter and Amazon},\\n  author={Davidov, Dmitry and Tsur, Oren and Rappoport, Ari},\\n  booktitle={Proceedings of the fourteenth conference on computational natural language learning},\\n  pages={107--116},\\n  year={2010}\\n}\\n\\nAbstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL\\n@article{bochkovskiy2020yolov4,\\n  title={Yolov4: Optimal speed and accuracy of object detection},\\n  author={Bochkovskiy, Alexey and Wang, Chien-Yao and Liao, Hong-Yuan Mark},\\n  journal={arXiv preprint arXiv:2004.10934},\\n  year={2020}\\n}\\n\\n@article{hu2023bad,\\n  title={Bad actor, good advisor: Exploring the role of large language models in fake news detection},\\n  author={Hu, Beizhe and Sheng, Qiang and Cao, Juan and Shi, Yuhui and Li, Yang and Wang, Danding and Qi, Peng},\\n  journal={arXiv preprint arXiv:2309.12247},\\n  year={2023}\\n}\\n\\n@inproceedings{lin2023beneath,\\n  title={Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models},\\n  author={Lin, Hongzhan and Luo, Ziyang and Ma, Jing and Chen, Long},\\n  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},\\n  year={2023}\\n}\\n\\n@inproceedings{wei2022chain,\\n  title={Chain-of-Thought Prompting Elicits Reasoning in Large Language Models},\\n  author={Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed H and Le, Quoc V and Zhou, Denny and others},\\n  booktitle={Advances in Neural Information Processing Systems},\\n  year={2022}\\n}\\n\\nAbstract: Recent studies have alarmed that many online hate speeches are implicit. With its subtle nature, the explainability of the detection of such hateful speech has been a challenging problem. In this work, we examine whether ChatGPT can be used for providing natural language explanations (NLEs) for implicit hateful speech detection. We design our prompt to elicit concise ChatGPT-generated NLEs and conduct user studies to evaluate their qualities by comparison with human-written NLEs. We discuss the potential and limitations of ChatGPT in the context of implicit hateful speech research.\\n@inproceedings{huang2023chatgpt,\\n  title={Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech},\\n  author={Huang, Fan and Kwak, Haewoon and An, Jisun},\\n  booktitle={Companion Proceedings of the ACM Web Conference 2023},\\n  pages={294--297},\\n  year={2023}\\n}\\n\\n@inproceedings{lin2014microsoft,\\n  title={Microsoft coco: Common objects in context},\\n  author={Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\\\\'a}r, Piotr and Zitnick, C Lawrence},\\n  booktitle={Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13},\\n  pages={740--755},\\n  year={2014},\\n  organization={Springer}\\n}\\n\\nAbstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new \\\"Colossal Clean Crawled Corpus\\\", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.\\n@article{raffel2020exploring,\\n  title={Exploring the limits of transfer learning with a unified text-to-text transformer},\\n  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J},\\n  journal={The Journal of Machine Learning Research},\\n  volume={21},\\n  number={1},\\n  pages={5485--5551},\\n  year={2020},\\n  publisher={JMLRORG}\\n}\\n\\nAbstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.\\n@inproceedings{liu2021swin,\\n  title={Swin transformer: Hierarchical vision transformer using shifted windows},\\n  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},\\n  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},\\n  pages={10012--10022},\\n  year={2021}\\n}\\n\\nAbstract: We present DINO (\\\\textbf{D}ETR with \\\\textbf{I}mproved de\\\\textbf{N}oising anch\\\\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\\\\textbf{+6.0}$\\\\textbf{AP} and $\\\\textbf{+2.7}$\\\\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \\\\texttt{val2017} ($\\\\textbf{63.2}$\\\\textbf{AP}) and \\\\texttt{test-dev} (\\\\textbf{$\\\\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \\\\url{https://github.com/IDEACVR/DINO}.\\n@article{zhang2022dino,\\n  title={Dino: Detr with improved denoising anchor boxes for end-to-end object detection},\\n  author={Zhang, Hao and Li, Feng and Liu, Shilong and Zhang, Lei and Su, Hang and Zhu, Jun and Ni, Lionel M and Shum, Heung-Yeung},\\n  journal={arXiv preprint arXiv:2203.03605},\\n  year={2022}\\n}\\n\\nAbstract: Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks. However, there is a gap between optimizing the commonly used distance losses for regressing the parameters of a bounding box and maximizing this metric value. The optimal objective for a metric is the metric itself. In the case of axis-aligned 2D bounding boxes, it can be shown that IoU can be directly used as a regression loss. However, IoU has a plateau making it infeasible to optimize in the case of non-overlapping bounding boxes. In this paper, we address the this weakness by introducing a generalized version of IoU as both a new loss and a new metric. By incorporating this generalized IoU ( GIoU) as a loss into the state-of-the art object detection frameworks, we show a consistent improvement on their performance using both the standard, IoU based, and new, GIoU based, performance measures on popular object detection benchmarks such as PASCAL VOC and MS COCO.\\n@inproceedings{rezatofighi2019generalized,\\n  title={Generalized intersection over union: A metric and a loss for bounding box regression},\\n  author={Rezatofighi, Hamid and Tsoi, Nathan and Gwak, JunYoung and Sadeghian, Amir and Reid, Ian and Savarese, Silvio},\\n  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},\\n  pages={658--666},\\n  year={2019}\\n}\\n\\nAbstract: Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM\\n@article{wizardcoder,\\n  author       = {Ziyang Luo and\\n                  Can Xu and\\n                  Pu Zhao and\\n                  Qingfeng Sun and\\n                  Xiubo Geng and\\n                  Wenxiang Hu and\\n                  Chongyang Tao and\\n                  Jing Ma and\\n                  Qingwei Lin and\\n                  Daxin Jiang},\\n  title        = {WizardCoder: Empowering Code Large Language Models with Evol-Instruct},\\n  journal      = {CoRR},\\n  volume       = {abs/2306.08568},\\n  year         = {2023},\\n  url          = {https://doi.org/10.48550/arXiv.2306.08568},\\n  doi          = {10.48550/ARXIV.2306.08568},\\n  eprinttype    = {arXiv},\\n  eprint       = {2306.08568},\\n  timestamp    = {Sun, 18 Jun 2023 16:10:59 +0200},\\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2306-08568.bib},\\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\\n}\\n\\nAbstract: Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\\\\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM\\n@article{wizardlm,\\n  author       = {Can Xu and\\n                  Qingfeng Sun and\\n                  Kai Zheng and\\n                  Xiubo Geng and\\n                  Pu Zhao and\\n                  Jiazhan Feng and\\n                  Chongyang Tao and\\n                  Daxin Jiang},\\n  title        = {WizardLM: Empowering Large Language Models to Follow Complex Instructions},\\n  journal      = {CoRR},\\n  volume       = {abs/2304.12244},\\n  year         = {2023},\\n  url          = {https://doi.org/10.48550/arXiv.2304.12244},\\n  doi          = {10.48550/ARXIV.2304.12244},\\n  eprinttype    = {arXiv},\\n  eprint       = {2304.12244},\\n  timestamp    = {Wed, 03 May 2023 14:12:58 +0200},\\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2304-12244.bib},\\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\\n}\\n\\n@inproceedings{lin2024explainable,\\n    title={Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models},\\n    author={Hongzhan Lin and Ziyang Luo and Wei Gao and Jing Ma and Bo Wang and Ruichao Yang},\\n    booktitle={The ACM Web Conference 2024},\\n    year={2024},\\n    address={Singapore},\\n}\\n\\nThe above content represents the relevant literature in this field. Please analyze it and provide the motivation and main idea. Then, provide the Title, Abstract, Introduction, Related Work, and Methods sections in LaTeX format.\"}]\n"
     ]
    }
   ],
   "execution_count": 8,
   "source": [
    "with open('./demo_data.json', 'r', encoding='utf-8') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Print sample data\n",
    "print(json.dumps(data[0]['messages'][:2]))\n",
    "\n",
    "# Initialize tokenizer and model\n",
    "model_name = 'WestlakeNLP/CycleResearcher-12B'\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_name,\n",
    "    torch_dtype=torch.float16,  # Use float16 for efficiency\n",
    "    device_map=\"auto\",  # Automatically handle model placement\n",
    "    max_memory={i: \"24GiB\" for i in range(torch.cuda.device_count())},  # Adjust memory per GPU\n",
    ")\n",
    "\n",
    "# Generation parameters (equivalent to vllm's SamplingParams)\n",
    "generation_config = {\n",
    "    \"max_length\": 19000,\n",
    "    \"temperature\": 0.1,\n",
    "    \"top_p\": 0.95,\n",
    "    \"pad_token_id\": None,\n",
    "    \"do_sample\": True,\n",
    "}\n"
   ],
   "id": "83778a417f67b45d"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Generating and Processing Output\n",
    "\n",
    "Finally, we can generate output from CycleResearcher by calling `llm.generate`. An important note about the output processing: the raw generated text needs to be formatted using the `get_paper_from_generated_text` function, which returns a dictionary containing structured paper components.\n",
    "\n",
    "\n",
    "\n",
    "The function handles all the necessary parsing and formatting, ensuring that the output maintains proper academic paper structure while being easily accessible for further processing or analysis:"
   ],
   "id": "9cbb76fa01715fcd"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "* item['latex'] contains the main body of the model-generated paper.\n",
    "* item['generated_text3'] contains the original complete text generated by the model\n",
    "* item['motivation'] contains the motivation section explaining why this research was conducted\n",
    "* item['idea'] contains the main idea section describing the core methodology and approach\n",
    "* item['interestingness'] contains the section explaining why this research is significant\n",
    "* item['feasibility'] contains the analysis of how practical and implementable the approach is\n",
    "* item['novelty'] contains the discussion of what makes this research unique and innovative\n",
    "* item['title'] contains the paper title extracted from LaTeX formatting\n",
    "* item['abstract'] contains the paper abstract summarizing the research\n",
    "* item['Experimental_Setup'] contains the experimental configuration details, either as JSON or text\n",
    "* item['Experimental_results'] contains the experimental outcomes and findings, either as JSON or text\n"
   ],
   "id": "a991081f486d76eb"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "results = []\n",
    "batch_size = 1  # Process one item at a time\n",
    "\n",
    "# Process data in batches\n",
    "for idx in trange(0, len(data), batch_size):\n",
    "    batch_data = data[idx:idx+batch_size]\n",
    "    \n",
    "    # Prepare prompts\n",
    "    prompts = []\n",
    "    for item in batch_data:\n",
    "        prompt = tokenizer.apply_chat_template(\n",
    "            item['messages'][:2],\n",
    "            tokenize=False,\n",
    "            add_generation_prompt=True\n",
    "        )\n",
    "        prompts.append(prompt)\n",
    "    \n",
    "    # Tokenize inputs\n",
    "    inputs = tokenizer(\n",
    "        prompts,\n",
    "        return_tensors=\"pt\",\n",
    "        padding=True,\n",
    "        truncation=True\n",
    "    ).to(model.device)\n",
    "    \n",
    "    # Generate text\n",
    "    with torch.no_grad():\n",
    "        outputs = model.generate(\n",
    "            **inputs,\n",
    "            **generation_config,\n",
    "            return_dict_in_generate=True,\n",
    "        )\n",
    "    \n",
    "    # Process outputs\n",
    "    for output_idx, generated_ids in enumerate(outputs.sequences):\n",
    "        # Decode the generated text\n",
    "        generated_text = tokenizer.decode(\n",
    "            generated_ids[inputs['input_ids'].shape[1]:],  # Skip input tokens\n",
    "            skip_special_tokens=True,\n",
    "            clean_up_tokenization_spaces=True\n",
    "        )\n",
    "        \n",
    "        # Store results\n",
    "        batch_data[output_idx]['generated_text'] = generated_text\n",
    "        results.append(batch_data[output_idx])\n",
    "\n",
    "# Process and print results\n",
    "for idx, result in enumerate(results):\n",
    "    item = get_paper_from_generated_text(result['generated_text'])\n",
    "    if item is not None:\n",
    "        print('Idx %d:\\n %s\\n' % (idx, item))"
   ],
   "id": "c21ef888dbfcc8b2"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "48f21bdbeeb29de9"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
