Indexing Layers

A library of religious literature is not searchable by raw text alone. We build a stack of complementary indexes — each engineered to answer a different kind of question.

The problem with one big index

Every paragraph in our corpus is, by default, indexed by its own text. Keyword search finds passages that contain the user's query words; semantic search finds passages whose embedding is close to the query's embedding. That works well for direct lookups — "He is the best beloved of all things" finds the Hidden Words passage immediately.

But scriptural and scholarly texts are routinely searched in ways their own words don't anticipate. A user asks "why does prejudice get in the way of finding truth?" The Iqán's actual passage uses entirely different vocabulary — "He must so cleanse his heart that no remnant of either love or hate may linger therein, lest that love blindly incline him to error, or that hate repel him away from the truth." The semantic distance between the user's casual paraphrase and that 19th-century passage is large enough that the right paragraph never surfaces, even though it directly answers the question.

One index can't bridge that gap. So we layer.

Architecture: main index plus sidecars

The corpus has one main paragraph index — the canonical home for every paragraph — and a series of sidecar indexes, each storing different derived signal. At query time, we search the main index and every sidecar in parallel, then merge the results using Reciprocal Rank Fusion. A paragraph that ranks well in multiple indexes scores higher than one that ranks in only one — surfacing passages that semantically match both what the user is asking about and what the passage is teaching.

Sidecars are additive. A new layer doesn't change the existing indexes; it just contributes another column of signal to the merge. We can ship a new layer for one corpus while leaving everything else alone.

Layer 1 — Main paragraph index

What it stores: every paragraph's text plus a separate "context" field (a one-line resolution of pronouns and ambiguous references — see Layer 2). The full record is embedded for semantic search and tokenized for keyword matching.

What it answers: "Find me passages whose actual words or meaning match my query."

Example: Query "Bahá'u'lláh on consultation" → directly matches paragraphs containing "consultation" and "Bahá'u'lláh", and semantically matches paragraphs about "taking counsel together" even when those exact words don't appear.

Layer 2 — Disambiguation

Most paragraphs in religious literature contain pronouns and indirect references that, in isolation, are ambiguous. "He must purge his breast..." — who is "He"? "in this Day" — which day? Standalone these phrases tell the embedding nothing.

Before each paragraph is embedded, we run a disambiguation pass that produces a one-line context resolution: who or what each ambiguous reference points to, given the surrounding paragraphs. The disambiguation is concatenated to the paragraph's main embedding — so a paragraph that says "He" is now embedded knowing "He = the true seeker."

Example: Paragraph 248 of the Iqán reads "But, O my brother, when a true seeker determineth...". Disambiguation adds: seeker = the true seeker pursuing knowledge of God; my brother = generic addressee, the reader being instructed. The semantic match for "path of the spiritual seeker" now lights up reliably.

Value: ~15-20% of paragraphs are heavily anaphoric. Without disambiguation, those paragraphs are effectively invisible to paraphrase queries. Disambiguation isn't a separate sidecar — it lives inside the main index, augmenting each paragraph's embedding.

Layer 3 — HyPE questions

HyPE — Hypothetical Paragraph Embeddings — flips the search problem on its head. Instead of asking "does this paragraph match the user's query?", we precompute "what questions does this paragraph answer?", embed those questions, and match against them at search time.

For each paragraph we generate exactly five questions covering five different registers — because real users search differently from real scholars:

Conversational — how a thoughtful friend would actually ask. "Why does prejudice keep us from seeing the truth?"
Topical concept — academic framing of the central idea. "Bahá'u'lláh on the prerequisites for spiritual perception."
Philosophical implication — the doctrinal stake or what follows from this teaching. "Does the Iqán teach that emotional bias is an epistemological barrier to truth?"
Cross-tradition / connection — broader debates, traditions, or fields the passage speaks to. "How does this prescription compare to Sufi fana or Christian apophatic tradition?"
Distinctive phrase — a striking phrase from the passage itself. "no remnant of either love or hate may linger therein"

Each question becomes a row in the hype_questions sidecar index, embedded and tagged with its source paragraph_id. When a user searches, the sidecar is matched semantically — a query that looks like a real user question can match the question form directly, without needing semantic distance to bridge from informal-modern English to 19th-century scriptural prose.

Value: Paraphrase, casual register, and conceptual queries land. The same paragraph becomes findable through five different "doors" — one suited to a casual user, one to a scholar, one to a comparative-religion student, etc. Without sacrificing the precision of the main paragraph index.

Layer 4 — Doctrinal thesis (new)

Questions are useful for matching how readers ask. But they don't tell us what the passage actually teaches as a proposition. So we add a separate one-sentence doctrinal thesis per paragraph — a precise claim, not a question.

Example for paragraph 248 of the Iqán:

The true seeker of divine knowledge must first purify the heart of acquired learning, pride, and all attachments — including love and hate — and cultivate trust in God, patience, and silence, because inner purity is the prerequisite condition for spiritual perception and truth-seeking.

The thesis goes into the same sidecar as the questions but is flagged with is_thesis: 1. Three uses:

Indexable as its own search target — when a user asks "what does the Iqán teach about love and hate?", the thesis matches directly even when no specific question form does.
TL;DR for search results — the thesis displays alongside the paragraph, giving the reader an immediate sense of the passage's claim before they read the full text.
Crafter context for the chat assistant — when our chatbot quotes a paragraph in an answer, the thesis tells it precisely what the quoted passage is making the case for, so the framing prose around the quote stays accurate.

How layers are generated: tiered models

Not every paragraph deserves the same effort. The Iqán's treatment of epistemology demands genuine philosophical reasoning to capture; an academic paper's introductory matter does not. We tier the corpus and route each tier to an appropriate generator:

Tier	Source	Generator
1	Shoghi Effendi (authoritative interpreter)	Anthropic Claude Sonnet 4.6, via the Messages Batches API
2	True compilations (Lights of Guidance, etc.)
3	'Abdu'l-Bahá
4	Bahá'u'lláh
5	The Báb
6	Esslemont — Bahá'u'lláh and the New Era
7	Nabíl — The Dawn-Breakers
8	Other religions' doctrinal works (Quran, Gospels, Tanakh, Pali Canon, Vedas, Guru Granth Sahib, etc.)	Local Qwen3-32B (single-paragraph mode)
9	Everything else (secondary scholarship, administrative letters, etc.)	Local Qwen3-32B (single-paragraph mode)

Tier classification happens at ingestion time: when a new document lands in the library, the classifier inspects its author and title and routes future paragraphs accordingly. Adding a new Shoghi Effendi document will route it through Sonnet on the next enrichment tick — automatically.

The generation prompt

For tier 1-7 paragraphs (the highest-value Bahá'í primary doctrinal works), each paragraph is sent to Sonnet with a 5-paragraph context window (the target paragraph plus two before and two after) and the following prompt structure:

System prompt:

You are generating hypothetical questions and a doctrinal thesis for a passage from sacred or scholarly literature.

Document: "<title>" by <author>
Tradition: <religion> / <collection>
About: <description>

You will see 5 paragraphs. The TARGET is [P3]. Surrounding paragraphs are CONTEXT (for resolving pronouns and references in the TARGET) — do NOT generate questions about them.

<context>
[P1] <paragraph N-2>
[P2] <paragraph N-1>
[P3] (TARGET) <paragraph N>
[P4] <paragraph N+1>
[P5] <paragraph N+2>
</context>

User prompt:

For [P3] (the TARGET paragraph), produce TWO things:

PART 1 — A single-sentence DOCTRINAL THESIS stating what this paragraph actually teaches as a proposition (not a question). Specific to this paragraph's actual claim — not a generic restatement. 25-50 words.

PART 2 — Exactly 5 hypothetical questions covering these 5 registers (one each):
  1. Conversational — how a thoughtful friend would ask, casual register, 8-15 words
  2. Topical concept — academic framing of the central idea, 8-15 words
  3. Philosophical implication — the doctrinal stake or what follows from this teaching
  4. Cross-tradition / connection — broader debates, traditions, or fields this passage speaks to
  5. Distinctive phrase — a striking phrase from the passage someone might search literally

Use ONLY content from the TARGET paragraph. Surrounding paragraphs are for pronoun resolution only.

Output format (exactly):
THESIS: <thesis sentence>
Q1: <conversational>
Q2: <topical>
Q3: <philosophical>
Q4: <cross-tradition>
Q5: <distinctive phrase>

Nothing else.

The 5-paragraph context window resolves anaphora ("He" → "the true seeker") without diluting the model's attention. The five-register split forces the model to think about the passage from multiple search-realistic angles rather than producing five variations of one question. The THESIS:/Q1:...Q5: output schema is parsed deterministically into the database.

Coming layers

The architecture is deliberately additive. Each new layer adds a sidecar without touching existing ones. Three layers we plan to ship next:

Quoted authors

Every quotation embedded inside a paragraph is itself a citation chain — Bahá'u'lláh quoting the Qur'án; Shoghi Effendi quoting 'Abdu'l-Bahá; Esslemont quoting Bahá'u'lláh. We'll extract the quoted text, attribute it, and store it in a quoted_authors sidecar. Searches like "Bahá'u'lláh quoting the Imáms" or "Shoghi Effendi citing the Master" become first-class queries — currently they're nearly impossible to express.

Definitional statements

Many paragraphs contain a precise definition: "This is what is meant by...", "By X, We mean...". These are the canonical terminological loci of a tradition. A definitions sidecar will index those statements separately, indexed by the term being defined. A query for "definition of pious fear" resolves directly to where the term is defined, bypassing the dozens of paragraphs that merely use the term.

Graph entities

People, places, events, and concepts are already extracted at the document level. The next step is connecting them across documents and indexing those connections — so a query for "Mullá Ḥusayn's role in the Letters of the Living" doesn't just match paragraphs that mention him by name, but traces the relationships through the corpus. The entities sidecar holds the per-paragraph entity mentions; a separate graph layer holds the relationships between them.

Why this is principled

The layered architecture gives us three properties we want:

Decoupled. Each layer can be regenerated independently. Discovering that the question generator was producing weak output on the Iqán doesn't require re-indexing other layers — we just rewrite that one.
Interpretable. When a search result surprises us, we can see which layer matched it. Was this paragraph found because its text matched? Its disambiguation? One of its hypothetical questions? Its thesis? The merge log shows it.
Tiered effort. The most authoritative texts get the most expensive generators. Secondary literature gets cheaper ones. The architecture doesn't assume a single quality bar across millions of paragraphs.

Each new layer that ships is a permanent improvement. The architecture is the deliverable, not any single layer.