# GitHub Data Synthesis Pipeline

A high-throughput data ingestion pipeline for scraping, filtering, and indexing GitHub repository and Pull Request data. Optimized for high network throughput and various storage access.

## Features

- **Simple and Flexibility**: Single node architecture, short (<100 lines) core logic per task
- **High Concurrency**: Built for performance and throughput
- **Automatic Retry**: Intelligent retry logic for various API edge cases, rate limits and server errors
- **Fast Reruns**: API reverse proxy caching enables fast reruns without checkpointing
- **Efficient Storage**: Parquet format with Snappy compression
- **Storage Optimized**: Sequential io with batching and file rotation optimized for HDD and parallel io optimized for SSD/GPFS
- **Throughput Monitoring**: Real-time throughput reporting every 30 seconds
- **Hybrid Go/Rust Architecture**: Go for I/O-bound tasks, Rust for CPU-bound tokenization (12x faster)
- **Six-Task Pipeline**:
  - Task 1: Repository survey and filtering (producer-consumer pattern)
  - Task 2: Pull Request metadata ingestion (producer-consumer pattern)
  - Task 3: PR enrichment with file contents and commit history (batch processing)
  - Task 4: LLM enhancement (PR summary + commit message refinement)
  - Task 5: Render PRs to plaintext for LLM training (batch processing)
  - Task 6: High-performance tokenization via Rust workers (IPC-based)

## Prerequisites

- Go 1.25+
- Rust 1.80+ (for tokenizer worker)
- High Performance Workstation (100+ core recommended)
- HDD/SSD/GPFS with high capacity for data storage

## Installation

```bash
# Clone the repository
git clone https://github.com/Anonymous/daVinci-Dev.git
cd daVinci-Dev/Pipeline

# Install dependencies
go mod download

# Build everything (Go pipeline + Rust tokenizer worker)
make build

# Or build separately
make build-go      # Build Go pipeline only
make build-rust    # Build Rust tokenizer worker only
```

## Using the Hugging Face open-source snapshot (Task 5/6)

Quickstart (recommended workflow: download HF dataset): see [`text_from_huggingface.md`](text_from_huggingface.md:1).

## Configuration

Due to the flexible nature of the pipeline, you can significantly customize its behavior by modifying the core logic in the `tasks/` directory besides just using environment variables.

Configure the pipeline using environment variables:

See [`config/config.go`](config/config.go:1) for more configuration options.

### Network Settings
- `PROXY_BASE_URL`: Base URL of the reverse proxy (default: `http://localhost:8080`)
- `MAX_CONCURRENCY`: Maximum concurrent workers for online tasks 1-3 (default: `10`)
- `MAX_IDLE_CONNS`: Maximum idle connections (default: `10`)

### Storage Settings
- `DATA_DIR`: Root directory for all data (default: `/mnt/hdd/github_data`)
- `MAX_FILE_SIZE`: Maximum parquet file size in bytes (default: `524288000` = 500MB)

### Task 1 Settings
- `SINCE_ID`: Starting repository ID (default: `0`)
- `MIN_STARS`: Minimum stars for filtering (default: `5`)
- `TARGET_LANGUAGE`: Target programming language (default: `Python`)

### Task 2 Settings
- `MIN_PY_FILES`: Minimum Python files in PR (default: `1`)
- `MAX_PY_FILES`: Maximum Python files in PR (default: `5`)
- `MAX_TOTAL_FILES`: Maximum total files in PR (default: `20`)

### Task 3 Settings
- `PR_BATCH_SIZE`: PRs per batch for enrichment (default: `1000`)

### Task 4 Settings (LLM Enhancement)
- `LLM_BASE_URL`: OpenAI-compatible endpoint (e.g., vLLM/SGLang) (default: `http://localhost:8000`)
- `LLM_API_KEY`: API key (if required by your gateway)
- `LLM_MODEL`: Model name (default: `Qwen/Qwen2.5-Coder-32B-Instruct`)
- `LLM_CONCURRENCY`: Concurrent LLM requests (default: `64`)
- `LLM_TIMEOUT_SECONDS`: Request timeout in seconds (default: `120`)

### Task 5 & 6 Settings (Offline, CPU-intensive)
- `OFFLINE_CONCURRENCY`: Concurrent workers for offline tasks (default: `32`)
- `RUST_TOKENIZER_PATH`: Path to Rust tokenizer worker binary (default: `./tokenizer/rust_worker/target/release/tokenizer_worker`)
- `MAX_TOKENS`: Maximum tokens per PR (default: `32000`)

## Usage

### Task 1: Repository Survey

Scans GitHub repositories, filters by language and stars, and outputs to Parquet files.

```bash
# Run task 1
./github-pipeline -task repos

# With custom configuration
PROXY_BASE_URL=http://192.168.1.100:8080 \
MAX_CONCURRENCY=15 \
MIN_STARS=10 \
./github-pipeline -task repos

# Note: No checkpoint/resume needed - API proxy caching makes reruns fast
```

**Output:**
- `raw_index/`: Raw repository index (ID + full_name)
- `filtered_repos/`: Filtered repositories meeting criteria

### Task 2: PR Metadata Ingestion

Reads filtered repositories and fetches merged PR metadata with file statistics.

```bash
# Run task 2
./github-pipeline -task prs

# Note: No checkpoint/resume needed - API proxy caching makes reruns fast
```

**Output:**
- `raw_prs/`: PR metadata with file statistics

### Task 3: PR Enrichment

Enriches filtered PRs with related issues, file contents, and commit history.

```bash
# Run task 3
./github-pipeline -task enrich
```

**Output:**
- `enriched_prs/`: Comprehensive PR data for training

### Task 4: LLM Enhancement

Generates an LLM-written PR summary and refines each commit message to improve downstream training quality.

```bash
# Run task 4
./github-pipeline -task llm_enhance
```

**Output:**
- `llm_enhanced_prs/`: Enriched PR data + `pr_summary` + `refined_message`

### Task 5: Render Text

Renders LLM-enhanced PRs to plaintext using a template.

```bash
# Run task 5
./github-pipeline -task render
```

**Output:**
- `rendered_text/`: Plaintext training data

### Task 6: Tokenization and Final Dataset

Tokenizes rendered PR text using high-performance Rust workers and produces the final training dataset.

**Prerequisites:**
```bash
# Download tokenizer model first (one-time setup)
python3 -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-32B-Instruct')"

# Build Rust tokenizer worker
make build-rust
```

**Run:**
```bash
# Run task 6
./github-pipeline -task tokenize

# With custom worker count (e.g., 192-core system)
OFFLINE_CONCURRENCY=192 ./github-pipeline -task tokenize
```

**Performance:**
- **Throughput**: ~22M tokens/s with 128 Rust workers (12x faster than Go)
- **Architecture**: Go workers handle filtering and I/O, Rust workers handle tokenization via IPC
- **Scalability**: Linear scaling up to CPU core count

**Output:**
- `tokenized_dataset/`: Final training data with token IDs
- `token_stats/`: Statistics for all processed PRs

## GitHub API Proxy (required for Tasks 1-3)

The pipeline expects a GitHub API reverse proxy at `PROXY_BASE_URL`.

Some custom behaviors are expected by proxy:
- adding API Key in `Authorization` header
- forwarding GitHub REST API GETs
- a special GraphQL endpoint: `GET /gql/pull_closing_issues/{owner}/{repo}/{pr_number}` (translated to a GitHub GraphQL Request)
- Retry-After detection + status code remapping compatible with the pipeline client
- Caching (not necessary but recommended for fast reruns. Note that the pipeline does not query an API twice in a single run)
A reference implementation is provided in [`cmd/github_api_proxy/main.go`](cmd/github_api_proxy/main.go:1).

Usage:

```bash
export GITHUB_TOKEN=ghp_xxx

go run ./cmd/github_api_proxy -listen :8080

# then run the pipeline
export PROXY_BASE_URL=http://localhost:8080
./github-pipeline -task repos
```

## Data Directory Structure

```
/mnt/hdd/github_data/
├── raw_index/                   # Task 1 output (raw repo index)
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── filtered_repos/              # Task 1 output (filtered repos)
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── raw_prs/                     # Task 2 output
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── enriched_prs/                # Task 3 output
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── llm_enhanced_prs/            # Task 4 output
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── rendered_text/               # Task 5 output
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── tokenized_dataset/           # Task 6 output (training data)
│   ├── part-0001.parquet
│   └── part-0002.parquet
├── token_stats/                 # Task 6 output (statistics)
│   ├── part-0001.parquet
│   └── part-0002.parquet
└── failures.jsonl               # Error log
```

## Output Schemas

- See `models/` directory for detailed Parquet schemas for each task's output

## Filtering Rules

### Task 1: Repository Filtering
1. **Language**: Must be Python
2. **Stars**: Must have ≥ 5 stars (configurable)
3. **Status**: Must not be archived

### Task 2: PR Filtering
1. **Merge Status**: Must be merged (merged_at != null)
2. **Python Files**: Must have ≥ 1 and ≤ 5 Python files (configurable)
3. **Total Files**: Must have ≤ 20 total files (configurable)
4. **File Extensions**: Only counts Python (.py, .pyi, .pyx, .pyw) and documentation (.md, .rst) files

### Task 6: Final Dataset Filtering
1. **Bot Authors**: Excluded (based on Task 2 metadata)
2. **Issue Comments**: Related issue must have ≤ 20 comments
3. **Token Length**: Must have ≤ 32,000 tokens (configurable)

## Performance Tuning

### Network Optimization (Tasks 1-3)
- Tune `MAX_CONCURRENCY` for each stage, depending on observed throughput in console logs

### Offline Task Optimization (Tasks 5-6)
- Adjust `OFFLINE_CONCURRENCY` based on CPU cores or observed throughput

### Storage Optimization
-  Choose parquet reader/writer (change source code) based on storage type:
   - Small output files: Sequential I/O is enough
   - HDD: Sequential I/O
   - SSD/GPFS: Parallel I/O, tune reader/writer concurrency

## Graceful Shutdown

The pipeline handles `SIGINT` (Ctrl+C) and `SIGTERM` gracefully:
1. Stops accepting new work
2. Completes in-flight requests
3. Flushes all buffers to disk
4. Closes all file handles
5. Does not (usually) print a lot of logging during shutdown

Note: No checkpoint saving needed - API proxy caching makes reruns fast

## Monitoring

Monitor progress through console output with real-time throughput reporting:
```
[INFO] Starting Task 1: Repository Survey from ID 0 with 200 producers
[THROUGHPUT] Fetched: 50000 (16.7/s) | Filtered: 12000 (4.0/s) | API Requests: 55000 (18.3/s, +50) | Elapsed: 3000.0s
[INFO] Created new parquet file: filtered_repos/part-0001.parquet
[WARN] Rate limit hit. Sleeping for 60 seconds
```

## Development

### Project Structure

- [`main.go`](main.go): entry point / task router
- [`config/`](config): configuration
  - [`config/config.go`](config/config.go): env var parsing + defaults
- [`client/`](client): external service clients
  - [`client/github_client.go`](client/github_client.go): GitHub API client (via reverse proxy)
  - [`client/llm_client.go`](client/llm_client.go): OpenAI-compatible LLM client
- [`models/`](models): parquet schemas / data models
  - [`models/models.go`](models/models.go)
- [`parquet/`](parquet): parquet readers/writers (sequential + parallel)
- [`diff/`](diff): patch parsing + translation utilities (used by rendering/tokenization)
- [`tasks/`](tasks): pipeline tasks
  - [`tasks/task1_repos.go`](tasks/task1_repos.go): repository survey
  - [`tasks/task2_prs.go`](tasks/task2_prs.go): PR ingestion
  - [`tasks/task3_enrich_pr.go`](tasks/task3_enrich_pr.go): PR enrichment
  - [`tasks/task4_llm_enhance.go`](tasks/task4_llm_enhance.go): LLM enhancement
  - [`tasks/task5_render_text.go`](tasks/task5_render_text.go): render training text
  - [`tasks/task6_tokenization.go`](tasks/task6_tokenization.go): tokenization + dataset preparation
- [`tokenizer/`](tokenizer): tokenizer integration
  - [`tokenizer/rust_client.go`](tokenizer/rust_client.go): IPC client for Rust worker
  - [`tokenizer/rust_worker/`](tokenizer/rust_worker): Rust tokenizer worker
    - [`tokenizer/rust_worker/Cargo.toml`](tokenizer/rust_worker/Cargo.toml)
    - [`tokenizer/rust_worker/src/main.rs`](tokenizer/rust_worker/src/main.rs)
    - [`tokenizer/rust_worker/README.md`](tokenizer/rust_worker/README.md)
- [`cmd/`](cmd): standalone utilities
- [`bench_data/`](bench_data): Rust benchmarks
- [`logger/`](logger): failure logging
  - [`logger/logger.go`](logger/logger.go)

### Adding New Filters

Edit filtering logic in:
- Task 1: `tasks/task1_repos.go` → `enrichAndFilter()`
- Task 2: `tasks/task2_prs.go` → `processPR()`

## Performance Benchmarks

### Task 6 Tokenization Performance

**Hardware**: 192-core server

**Go-only implementation** (deprecated):
- Throughput: ~1.8M tokens/s (266 PRs/s)
- Bottleneck: Go tokenizer library overhead

**Hybrid Go/Rust implementation** (current):
- Throughput: ~22M tokens/s (1,810 PRs/s)
- **12x performance improvement**
- Architecture: Go handles filtering/I/O, Rust handles tokenization
- IPC overhead: Negligible (batch-based communication)

**Scaling**: Linear up to CPU core count with thread-local tokenizers in Rust workers.

## Architecture Details

### Task 6: Hybrid Go/Rust Tokenization

```
┌─────────────────────────────────────────────────────────────┐
│                     Go Worker Pool (184)                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  │  ...184  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │           │
│       │ Filter PRs  │             │             │           │
│       │ Render Text │             │             │           │
│       └─────────────┴─────────────┴─────────────┘           │
│                          │                                   │
│         IPC: Length-Prefixed MessagePack (concurrent)        │
└──────────────────────────┼───────────────────────────────────┘
                           │
┌──────────────────────────▼───────────────────────────────────┐
│          Single Rust Tokenizer Worker Process                │
│                                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │         Internal Thread Pool (128 workers)          │    │
│  │                                                      │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────┐ │    │
│  │  │ Workers 1-16 │  │Workers 17-32 │  │ ... 128  │ │    │
│  │  │      ↓       │  │      ↓       │  │    ↓     │ │    │
│  │  │ Tokenizer 1  │  │ Tokenizer 2  │  │ Token 8  │ │    │
│  │  └──────────────┘  └──────────────┘  └──────────┘ │    │
│  │                                                      │    │
│  │  8 Tokenizer Groups (16 threads per tokenizer)     │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                               │
│  Throughput: 22M tokens/s (1,810 PRs/s)                     │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           │ IPC: MessagePack responses
┌──────────────────────────▼───────────────────────────────────┐
│                  Go Parquet Writers                          │
│              (Write tokenized data + stats)                  │
└──────────────────────────────────────────────────────────────┘
```

**Key Design Decisions:**
1. **Single Rust process**: Eliminates process spawning overhead (184 processes → 1)
2. **Internal thread pool**: 128 workers in 8 groups (16 threads per tokenizer)
3. **Length-prefixed MessagePack**: Solves buffer size limits, 2-3x faster than JSON
4. **Concurrent IPC**: 192 Go workers send batches concurrently to single Rust process
5. **Optimal configuration**: 128 workers with 16 threads/tokenizer from empirical benchmarks

**Performance:**
- **IPC overhead**: <1% (binary protocol, no line parsing)
- **Memory efficiency**: 8 tokenizers vs 192 (23x reduction)
- **Throughput**: 22M tokens/s sustained

See [`tokenizer/rust_worker/README.md`](tokenizer/rust_worker/README.md) for detailed Rust worker documentation.
