---
name: research-collect
description: "[Read when prompt contains /research-collect]"
metadata:
  {
    "agent-runtime":
      {
        "emoji": "🔍",
      },
  }
---

# Literature Survey

**Don't ask permission. Just do it.**

**Workspace:** `$W` = working directory provided in the task parameter.

## Output Structure

```
$W/
├── survey/
│   ├── search_terms.json      # search-term list
│   └── report.md              # final report
├── papers/
│   ├── _downloads/            # raw downloads
│   ├── _meta/                 # per-paper metadata
│   │   └── {arxiv_id}.json
│   └── {direction}/           # post-clustering folders
├── repos/                     # reference code repos (Phase 3)
│   ├── {repo_name_1}/
│   └── {repo_name_2}/
└── prepare_res.md             # repo-selection report (Phase 3)
```

---

## Workflow

### Phase 1: Preparation

Make sure the workspace structure exists:

```bash
mkdir -p "$W/survey" "$W/papers/_downloads" "$W/papers/_meta"
```

Generate 4–8 search terms and save them to `$W/survey/search_terms.json`.

---

### Phase 2: Incremental search → filter → download (loop)

**Repeat the steps below for each search term:**

#### 2.1 Search

```
arxiv_search({ query: "<term>", max_results: 30 })
```

#### 2.2 Filter immediately

Score the returned papers (1–5) **immediately** and keep only those scoring ≥4.

Scoring guide:

- 5: core paper, directly studies this topic.
- 4: related method or application.
- 3 and below: skip.

#### 2.3 Download useful papers

```
arxiv_download({
  arxiv_ids: ["<useful paper id>"],
  output_dir: "papers/_downloads"
})
```

#### 2.4 Write metadata

For each downloaded paper, create a metadata file at `$W/papers/_meta/{arxiv_id}.json`:

```json
{
  "arxiv_id": "2401.12345",
  "title": "...",
  "abstract": "...",
  "score": 5,
  "source_term": "battery RUL prediction",
  "downloaded_at": "2024-01-15T10:00:00Z"
}
```

**Finish one search term completely before moving to the next.** This keeps the context from being polluted by large search-result dumps.

---

### Phase 3: GitHub code search and reference-repo selection

**Goal:** provide downstream skills (research-survey, research-plan, research-implement) with reference open-source implementations to point at.

#### 3.1 Pick high-scoring papers

Read papers with score ≥4 from `$W/papers/_meta/` and pick the **top 5** most relevant.

#### 3.2 Search for reference repos

For each chosen paper, search GitHub with combinations like:

- Paper title + "code" / "implementation"
- Core method name + author name
- Dataset name mentioned in the paper + task name

Use the `github_search` tool:

```javascript
github_search({
  query: "{paper_title} implementation",
  max_results: 10,
  sort: "stars",
  language: "python"
})
```

#### 3.3 Filter and clone

Evaluate the returned repos by:

- Star count (suggested >100).
- Code quality (has README, has requirements.txt, clear structure).
- Match with the paper.

Pick **3–5** of the most relevant repos and clone them into `$W/repos/`:

```bash
mkdir -p "$W/repos"
cd "$W/repos"
git clone --depth 1 <repo_url>
```

#### 3.4 Write the selection report

Create `$W/prepare_res.md`:

```markdown
# Reference repo selection

| Repo | Paper | Stars | Why selected |
|------|-------|-------|--------------|
| repos/{repo_name} | {paper_title} (arxiv:{id}) | {N} | {reason} |

## Key files per repo

### {repo_name}
- **Model implementation**: `model/` or `models/`
- **Training script**: `train.py` or `main.py`
- **Data loading**: `data/` or `dataset.py`
- **Core file**: `{key file path}` — {description}
```

**If no relevant repo can be found**, note "no usable reference repo" in `prepare_res.md`; downstream skills will then run without a code mapping.

---

### Phase 4: Cluster and organize

After all search terms and code searches are done:

#### 4.1 Read all metadata

```bash
ls $W/papers/_meta/
```

Read every `.json` file and aggregate the paper list.

#### 4.2 Cluster

Based on titles, abstracts, and source search terms, identify 3–6 research directions.

#### 4.3 Create folders and move

```bash
mkdir -p "$W/papers/data-driven"
mv "$W/papers/_downloads/2401.12345" "$W/papers/data-driven/"
```

---

### Phase 5: Generate the report

Create `$W/survey/report.md`:

- Survey summary (number of search terms, papers, directions).
- Overview of each research direction.
- Top 10 papers.
- **Reference-repo summary** (cite `prepare_res.md`).
- Suggested reading order.

---

## Key design

| Principle | Description |
|-----------|-------------|
| **Incremental processing** | Each search term completes search → filter → download → metadata independently, avoiding context blow-up. |
| **Metadata-driven** | Clustering is based on `_meta/*.json`, not on a giant in-memory list. |
| **Folders as categories** | Cluster results live in `papers/{direction}/`, no extra JSON needed. |

## Tools

| Tool | Purpose |
|------|---------|
| `arxiv_search` | Search papers (no side effects). |
| `arxiv_download` | Download .tex/.pdf (requires absolute path). |
| `github_search` | Search reference repos. |
