AI for researchers: turn your reading list into a queryable corpus

*May 13, 2026 · 12 min read*

Research is bottlenecked by retrieval. You read a hundred papers over a year; you remember a few dozen specific arguments and forget the rest. When you sit down to write a literature review, you re-Google papers you already read, re-skim them, and re-extract the same passages. The reading work didn't compound.

The fix is a queryable corpus — your reading, organized so an AI can answer questions against it on demand. Not a passive archive. A working dataset where each paper has its own page, your annotations are first-class, and Claude or ChatGPT can run "what arguments do I have for X" or "what sources mention Y" without you re-reading anything.

This is the playbook. It's the workflow used by researchers, grad students, analysts, and writers who've moved their literature workflow into MindWiki, but most of it generalizes to any markdown-based knowledge base.

TL;DR

For each paper you read, create one markdown page with citation metadata in YAML frontmatter, your summary in your own words, key quotes (sparingly), and [[wikilinks]] to the concepts the paper touches. Connect Claude or ChatGPT to the vault via MCP and start asking. The result: you stop re-reading what you already understood and start writing from a corpus that grows in value every month.

Why the standard research workflow stops scaling

Most people start research with Zotero (or Mendeley, or EndNote) + a folder of PDFs + scattered margin notes. That stack works fine for one paper. It collapses around paper number 50:

Zotero stores metadata, not your thoughts. You can find the paper. You can't find what you thought about it.
PDF margin notes don't aggregate. A highlight in a PDF is useful when re-reading that PDF. It's invisible when you're trying to write a synthesis across papers.
Highlights without summarization are write-only. Readwise exports your highlights as bullets. Bullets aren't claims, and they're not in your voice. You can't write from them.
No graph. Two papers that argue the same point from opposite directions look unrelated until you've manually noticed.
No AI surface. Claude can't read your Zotero library. ChatGPT can't see your PDFs. Whatever AI work you do is starting from scratch.

The cumulative cost of these limits is enormous: hours per literature review, decades of accumulated reading you'll never re-access.

The vault-based research setup

The setup that scales:

Folder layout

sources/                    ← one page per source
  /papers/                    optional sub-categorization
  /books/
  /talks/
slips/                      ← atomic concept notes you've written
research/                   ← project pages — literature reviews, writeups
people/                     ← author pages (backlinks reveal what you've read of each)
capture/                    ← raw highlights queued for processing

Three folders carry weight: sources/ (one page per paper/book/talk), slips/ (concept notes in your voice, linked from sources and from each other), and research/ (project pages that link to both).

Source-page schema

Every source page has frontmatter that turns it into a queryable record:

yaml

---
title: "Predictive coding and the active inference framework"
authors: [Friston, K.]
year: 2010
journal: Nature Reviews Neuroscience
citation: "Friston, K. (2010). The free-energy principle: a unified brain theory?..."
doi: 10.1038/nrn2787
zotero_key: ABCD1234   # optional cross-reference to Zotero
area: sources
type: paper
status: read           # one of: queue, skim, read
tags: [neuroscience, free-energy, predictive-coding]
created: 2026-03-14
---

# Predictive coding and the active inference framework

## Summary (in my own words)

Friston argues that the brain minimizes free energy by maintaining
predictive models and acting to confirm them. The core claim is that
perception and action share the same loss function.

## Key claims
- Brains minimize variational free energy.
- Perception is approximate Bayesian inference.
- Action minimizes prediction error by changing the world.

## Quotes (sparingly)
> "The brain ... infers the causes of its sensations" (p. 4)

## My thoughts
- Connects to [[hyperprior]] in interesting ways.
- The [[free-energy principle]] page should be the primary anchor.
- Compare to [[Marr's three levels]] — Friston argues against the
  computational/algorithmic separation.

## Backlinks
- (auto-populated by MindWiki — every page that wikilinks here)

A few specific moves that pay off:

`status` field — queue (haven't read), skim (read superficially), read (real engagement). Lets you filter the corpus.
`citation` as a single rendered string. The AI can pull this for citations in your writeups. (You can also keep BibTeX in Zotero and reference the key — but having the formatted citation on the page is what the AI can actually use.)
Summary in your own words. This is the load-bearing field. The point is the writing, not the highlights.
`## My thoughts` with wikilinks. Where the corpus starts to compound — you're linking the paper to your existing concept network.

Concept (slip) pages

For each non-trivial concept you've engaged with, write a slip in slips/. The slip is in your voice. It links to the sources that taught it to you. The sources link to the slip via wikilink. The graph grows.

This is the Zettelkasten move and it's why the corpus compounds. After a year you have 60 sources + 200 slips, and the slips are doing most of the heavy lifting when you write — because each slip is one assertion that you can defend, with the source pages right next to it.

(See: Zettelkasten in 2026 with AI.)

Project / writeup pages

research/literature-review-X.md is a project page. It links to sources and slips. When you sit down to write, you start at this page and walk the wikilinks. The AI helps when you ask "what slips do I have on X that aren't yet referenced in this review?"

Where AI earns its keep

After you've got a few dozen source pages and a few dozen slips, connect Claude or ChatGPT to the vault via MCP (see the setup guide). Useful prompts:

Synthesis

> "Summarize what I've read on predictive coding. Group by year and group by whether each source supports or critiques the core claim. Cite the source pages."

The AI calls mindwiki_search("predictive coding"), reads the top hits, and synthesizes. It cites real pages you can click open. If the synthesis is missing a paper, you'll notice instantly.

Gap analysis

> "Read my literature review on X and check whether every source I cite is actually in sources/. Flag the citations I haven't written source pages for."

The AI runs mindwiki_read_page on the review, parses the citation strings, and cross-references sources/.

Identifying clusters

> "Look at my sources/papers/ and propose a clustering — which papers belong together based on the concepts they engage with?"

The AI calls mindwiki_list_pages with area: sources, reads each one, and proposes clusters. Tells you where you have one paper on a topic and where you have eight.

Writing-time retrieval

> "I'm about to write a section on X. What slips do I have on X, and what sources back them up?"

This is the workflow that makes literature reviews stop feeling like punishment. You're never starting from a blank page — you're pulling from the corpus and writing the connective prose.

Periodic queue review

For MindWiki Pro users, the Weekly Classifier reads recent captures and proposes which ones are worth promoting into proper source pages. Pattern Detection surfaces emergent clusters across your reading. Monthly Summary writes a month-end report of what you read and what shifted.

Where to source highlights

The pipeline that works:

Readwise export for book highlights. Drop the Books/ markdown export into capture/ and process during weekly review. Don't migrate the whole archive; cherry-pick the highlights you'd actually re-read.
Email forwarding for newsletters. Forward to {username}@mindwiki.io — it lands in capture/ as a markdown page. Triage weekly.
PDF annotations. Hardest. Two options: re-type the highlights you actually want into the source page, or use a tool like Hypothes.is and pipe its annotations into the vault via the API. Most researchers find that re-typing the *important* highlights (not all of them) is the right friction — it's a forcing function for actually engaging with the source.
Talks and podcasts. Transcribe via Otter or Apple's built-in transcription. Drop the transcript into capture/. Write a real source page during the next weekly review.

Honest about what MindWiki doesn't do

Not a citation manager. Keep Zotero (or your tool of choice) as the system of record for bibliographic data. Cross-reference by DOI or Zotero key in the frontmatter.
Not a PDF reader. Read PDFs in your favorite reader. Excerpt and summarize into the source page.
Not a literature search engine. Use Google Scholar, Semantic Scholar, Connected Papers to find sources. Bring the findings into the vault.

The split that works: external tools for discovery and bibliographic data, MindWiki for *your engagement* with the sources. That's the part that compounds.

A realistic ramp-up

Weeks 1–4: set up the vault, write source pages for the most-cited papers from your last project. Don't try to migrate ten years of reading.

Months 2–3: add new sources as you read them. Don't try to backfill. Write slips for the strongest concepts that emerge from the reading.

Month 4: connect Claude or ChatGPT. Start asking. You'll discover what's missing from the corpus by what the AI can't find.

Month 6: you have a working queryable corpus. Literature reviews stop feeling like you're starting from scratch.

Year 2: the corpus is one of the highest-value assets you've built. Future projects start with "what do I already know about this" producing real answers instead of vague recollections.

Why MindWiki specifically

Markdown vault so the corpus is yours, portable, and survives any future tool change.
Properties + table views for status filtering, year sorting, author grouping.
Wikilinks + backlinks so source pages and slip pages cross-reference automatically.
Hybrid search so half-remembered phrases pull up the right source.
MCP server so Claude, ChatGPT, Codex, and Claude Desktop can read the corpus directly.
Pro automations that handle the structural maintenance work no researcher wants to do manually.

If you're a researcher, grad student, analyst, or writer for whom reading is the main work, this is the architecture that scales for a decade.