# SciLitLLM: Adapting LLMs for Scientific Literature Understanding

**SciLitLLM** adapts a general large language model for effective scientific literature understanding. This repository contains all necessary code for the continual pre-training (CPT) and supervised fine-tuning (SFT) methods, which are key components of SciLitLLM.

## Overview

Scientific literature understanding is essential for extracting valuable insights and advancing scientific discovery. **SciLitLLM** specializes in this by integrating domain-specific knowledge and task-specific instruction-following abilities. The framework achieves this through:

-**Continual Pre-Training (CPT)**: Infusing domain knowledge from scientific corpora.

-**Supervised Fine-Tuning (SFT)**: Enhancing instruction-following using diverse scientific tasks.

## Repository Structure

Please refer to each subdirectory for details.

-**cpt/**: data processing codes for CPT corpura.

-**sft/**: data processing codes for SFT instructions.

## Create Your Domian-specific Model

1. Clone the repository and setup environments:

   ```bash
   conda create --name scilitllm python=3.10
   
   conda activate scilitllm
   
   pip install -r requirements.txt
2. Follow the instructions in the **cpt/** and **sft/** directories to prepare training corpora.
