Search & Research Strategy
How SifterSearch makes 5.0 million passages — across 8 languages and 11 traditions — searchable with the nuance religious texts deserve
The Scale of the Problem
SifterSearch — the search and research intelligence behind OceanLibrary.com — indexes 46,552 religious and philosophical texts across three faith traditions: 39,750 Bahá'í documents, 388 Islamic texts (most in Arabic), and a growing collection of Buddhist and other writings. Across these traditions, the library contains roughly 5.0 million individual passages in 8 languages — each a window into centuries of human spiritual inquiry.
Making those passages genuinely searchable is harder than it sounds. Standard search technology was built for the web: short documents, contemporary language, authors who write to be found. Religious texts are the opposite. They are dense with allusion, written across millennia in translated language, and full of references that only make sense in context. A passage that uses "He" might mean God, a prophet, a historical figure, or an archetypal spiritual seeker — and the difference matters enormously to anyone doing serious study.
This document describes the research strategy SifterSearch uses to close that gap: a layered pipeline of AI-assisted enrichment that runs once per passage, at low cost, so that every search — no matter how it is phrased, or what language it is phrased in — can find what it is actually looking for.
Figure 1. The SifterSearch enhancement pipeline. Each passage passes through entity extraction, contextual disambiguation, and hypothetical question generation before entering the enriched search index.
Problem 1: Context Collapses When You Index at Passage Level
The Problem. Search systems work by breaking documents into small, independently retrievable passages. This is necessary — returning a 400-page book in response to every query is not useful. But it creates a fundamental problem: references that are perfectly clear inside a document become opaque when the passage is read alone.
"This teaching" — which teaching? "He declared" — who? "The station described above" — described where? In mystical texts, this problem is especially acute. The Seven Valleys of Bahá'u'lláh uses phrases like "this valley," "the wayfarer," and "that celestial city" throughout — phrases whose meaning builds across the work. A passage pulled out of sequence looks like poetry without a key. Search returns it; the researcher stares at it; the connection they needed is never made.
The Solution. For every passage in the library, an AI model reads the surrounding twenty or so paragraphs and generates a brief disambiguation note. This note resolves ambiguous references so they can be indexed alongside the passage text. The disambiguation handles not just pronouns but conceptual references: "This station" becomes "the station of absolute unity described in the Valley of True Poverty in the Seven Valleys." "The ancient covenant" becomes "the covenant between God and humanity renewed in each prophetic cycle, here referring specifically to the covenant of Bahá'u'lláh." These are not paraphrases — they are grounding notes that make the passage findable by someone who does not yet know the terminology.
Critically, the disambiguation is grounded only in the document text, never in general knowledge. This matters for specialized religious literature where general knowledge can mislead. The word "Covenant" means something very specific in Baha'i writings that a model trained on general internet text might conflate with Jewish or Christian usage. The constraint that disambiguation must cite only what the surrounding text actually says keeps the enrichment honest.
The Result. A researcher searching for "the stages of the mystical journey in Baha'i writings" surfaces passages from the Seven Valleys that, in their raw form, contain no such phrase — but whose disambiguated index entry does.
He arrived at the city before dawn.
He [Mullá Husayn-i-Bushrú'í] arrived at the city [Shíráz, Iran] before dawn [May 23rd, 1844].
Figure 2. A single sentence becomes fully searchable. Someone researching "Mullá Husayn" or "Shíráz" or "1844" will now find this passage — impossible with the original text alone.
Problem 2: Users Speak Modern English; Texts Speak Ancient Arabic
The Problem. Users ask questions in everyday language. The texts answer in formal, archaic, or translated language that may have been rendered into English from Arabic, Persian, Sanskrit, or Hebrew centuries ago. A researcher who asks "What does the Baha'i Faith say about fairness?" may not think to search for 'adl (justice) or insaf (equity) — but those are the terms the texts use. Someone curious about "finding your purpose" may not know to search for "station" or "vocation" or "the purpose of creation."
The vocabulary mismatch between how people ask and how texts are written means that a great deal of relevant material is simply never surfaced. Standard keyword search misses it. Even semantic vector search — which understands concepts rather than just words — has limits when the conceptual vocabulary of the domain is so specialized.
The Solution. For each passage, the AI generates several questions that the passage would naturally answer. These questions are stored alongside the text and become part of what gets searched.
A passage about Ibn Khaldun's concept of 'asabiyyah might generate questions like: "What is Ibn Khaldun's theory of group solidarity?", "How does social cohesion lead to the rise and fall of civilizations?", or "What did medieval Islamic historians say about tribal bonds?"
When a user later searches for any of those questions — or something close to them — the passage is retrieved even though none of the query words appear in the original text. This is the domain gap problem solved at indexing time rather than search time. The approach is called Hypothetical Question Pre-Indexing, or HyPE. It bridges the gap between how people ask and how texts are written, and it does so at zero cost per search — the questions are computed once and stored.
Beyond standard HyPE. Traditional HyPE generates factual retrieval questions — "what does this passage say about X?" SifterSearch extends this with two additional question types that are critical for philosophical and religious texts:
- Definitional questions — "What is the meaning of 'asabiyyah?" or "How does Bahá'u'lláh define justice?" Many passages contain implicit definitions of specialized terms without ever using the word "definition." Generating these questions makes every passage that defines or explains a concept findable by anyone asking "what does X mean?"
- Philosophical implication questions — "What are the implications of progressive revelation for religious authority?" or "How does this concept of the soul challenge materialism?" Religious texts are studied not just for what they say but for what they imply. A passage about the unity of the prophets has implications for interfaith dialogue, religious exclusivism, and theological pluralism — implications that a researcher might search for but that the passage never names explicitly. Generating questions about these implications makes the deep connections in philosophical texts discoverable.
These two additions — definitions and implications — transform HyPE from a vocabulary bridging technique into a genuine scholarly research tool. A scholar studying the concept of covenant across traditions can now find every passage that defines, explains, or carries implications for that concept, even when the word "covenant" never appears.
The Result. Everyday questions like "What Islam teaches about gratitude" now surface relevant passages from classical Arabic scholarship that would be invisible to any keyword search, and nearly invisible to users who don't know the technical terminology.
stored with each passage
Figure 3. HyPE bridges the vocabulary gap between everyday query language and formal or archaic text. Generated questions are indexed once at processing time, adding zero cost per search.
Problem 3: The Arabic Library Is Invisible to English Searchers
The Problem. Of SifterSearch's 46,552 documents, 388 are Islamic texts — most of them in Arabic. This is a rich body of classical scholarship: hadith collections, tafsir (Quranic commentary), Sufi treatises, works of Islamic jurisprudence and philosophy. For an English-speaking researcher, these texts might as well not exist. Keyword search finds nothing. Even typing Arabic transliterations rarely works without knowing the precise scholarly convention.
The same barrier runs in reverse. An Arabic-speaking researcher exploring Baha'i theology or Buddhist philosophy faces 46,552 documents in languages they may not read. The library is multilingual; the search, until now, has not been.
The Solution. SifterSearch's semantic search uses 3,072-dimensional vector embeddings to represent the meaning of each passage — not its words. These embeddings are language-agnostic: the concept of divine unity carries the same vector signature whether it appears as "the oneness of God" in English, "tawhid" (التوحيد) in Arabic, or "yegānegi" in Persian. A search in any language retrieves passages in all languages that share conceptual proximity.
Search "divine unity" in English and surface Arabic passages about التوحيد (tawhid). Search "detachment from the world" and find Persian Sufi poetry about زهد (zuhd) alongside English Baha'i texts about the same theme. The 3,072-dimensional embeddings capture meaning regardless of the language it is expressed in — making the full 8-language library searchable from a single query.
This is especially powerful for Islamic scholarship, where the Arabic originals and their English translations coexist in the library. A researcher can search in English, encounter the Arabic original, follow through to the translation, and move fluidly between both — without ever needing to know that a language boundary exists. The HyPE questions for Arabic passages are generated in English as well as Arabic, further strengthening the cross-language bridge at the keyword layer.
The Result. A student of comparative religion can explore how concepts like "the covenant," "the station of the soul," or "the nature of divine revelation" are treated across Baha'i, Islamic, and Buddhist traditions — in a single search, across eight languages — without specialist knowledge of any of those languages.
Problem 4: Entity References Are Tradition-Specific
The Problem. Before any disambiguation can happen at the passage level, the system needs to know who and what the document is about. "The Master" means 'Abdu'l-Bahá in Baha'i writings, but in a Sufi text it might refer to a spiritual teacher. "The Friend" is a recurring epithet for God in certain Persian poetry. "The Promised One" carries entirely different meanings in different traditions. A disambiguation engine that doesn't already know the cast of characters in a document will make mistakes.
The Solution. Before disambiguation begins on any document, the system runs a separate entity extraction pass over the full text. This produces a structured glossary of key entities: people mentioned by name or title, organizations and institutions, theological concepts that carry specific meaning in this tradition, and time periods. This glossary becomes the grounding for the disambiguation step — it is passed as context so the model already knows, when it encounters "the Master," which Master this document means. Entity extraction runs using a larger AI model (in the 32-billion-parameter range) to ensure accuracy on this foundational step, once per document rather than once per passage.
The Result. Disambiguation at the passage level is accurate because it starts from a document-level map of entities rather than general world knowledge. The enrichment is honest about what the text actually says.
Problem 5: Processing 5.0M Passages Costs Too Much
The Problem. Processing 5.0 million passages requires millions of AI API calls. Without optimization, the cost for disambiguation alone (using a model like Claude Sonnet) would be roughly $25,000 — and adding HyPE generation pushes the total past $42,000. This puts comprehensive enhancement out of reach for most projects.
The Solution. Two techniques combine to make this tractable: prompt caching and priority-based processing.
Prompt caching. When the same context appears at the start of multiple consecutive requests, the AI provider can cache it — billing cached tokens at roughly 10% of normal cost. When processing passages from the same document sequentially, the system prompt (document metadata, entity glossary, ~20 surrounding paragraphs) remains nearly identical between calls — only a few paragraphs slide in and out of the window. This achieves ~90% cache hit rates on input tokens, reducing input costs dramatically.
Output tokens (the disambiguation and question text) cannot be cached, so the overall savings depend on the ratio of input to output. For disambiguation (large input context, short output), caching saves roughly 70% of total cost. For HyPE (shorter input, more output), savings are closer to 50%. Across the full pipeline, prompt caching reduces costs by approximately 60–65%.
Priority-based processing. Not all 5.0 million passages need enhancement simultaneously. SifterSearch processes documents in order of authority and scholarly importance — core scripture first, then major commentary, then supporting texts. The dual-index architecture means un-enhanced texts remain fully searchable through the original index while enhancement proceeds in the background.
Re-embedding with OpenAI (text-embedding-3-large) adds only ~$170 for the full library. Entity extraction and query-time classification can run on smaller, more cost-effective models.
The Result. What would cost $42,000 naively can be done for a fraction of that through caching and prioritization — and the enhancement is delivered incrementally, with the most important texts improved first.
Figure 5. Prompt caching reduces costs by ~63%. Combined with priority processing (core texts first), comprehensive enhancement becomes tractable for independent projects.
Problem 6: The Index Improves Slowly or Not at All
The Problem. A naive approach to deploying any enhancement pipeline would wait until everything is processed before switching over — meaning users see no improvement for months, and the risk of a botched migration is high.
The Solution. The enhancement pipeline does not replace the existing search index — it builds a second, enriched index alongside it. The original Meilisearch index continues working unchanged, serving every search as it always has. As passages are processed through the disambiguation and question-generation pipeline, they are written into the enhanced index. The search system automatically uses the enhanced index for any passage that has been processed, falling back to the original for passages that have not yet been enriched.
The Result. Improvements are delivered incrementally. Users see better results for processed passages immediately, without waiting for the entire library to be processed. There is no cutover, no migration risk, and no disruption to current users during the rollout.
Figure 4. The dual-index architecture runs both indexes in parallel. Every search uses the enhanced index where available, falling back to the original for passages not yet processed. Rollout is incremental with no disruption.
Search-Time Intelligence
In addition to the pre-indexed enrichment, SifterSearch applies several AI-assisted techniques at query time to further improve result quality.
Query Intent Classification
Each search query is analyzed by a local AI model to determine what kind of search will serve it best. A query that is clearly looking for a specific quote or passage ("exact phrase" intent) benefits from stronger keyword matching. A query exploring a theme or concept ("semantic" intent) benefits from stronger vector similarity. The system blends the two approaches dynamically based on this classification, rather than using a fixed ratio for every query.
Cross-Encoder Reranking
Initial retrieval is fast but imprecise — it returns a broad set of candidates, not a perfectly ranked list. A second AI model, called a cross-encoder, then re-evaluates each candidate by reading both the query and the passage together, scoring them as a pair rather than in isolation. This catches nuanced relevance that simpler scoring misses: a passage that uses completely different words but addresses exactly the same concept rises to the top.
Document Authority Weighting
Primary scripture and authoritative texts are given a relevance boost over secondary commentary and peripheral works. A researcher looking for what "the Baha'i Faith says" about a topic should encounter Bahá'u'lláh's own words before encountering a pilgrim's second-hand recollection. The weighting is soft — commentary is not excluded — but it ensures that authoritative sources surface first when they are relevant.
Figure 6. The search-time pipeline. Query intent classification and cross-encoder reranking run on local models at zero API cost. AI summarization uses extraction-only prompting to keep latency and cost low.
Why This Matters for Religious Texts Specifically
The techniques described above are not unique to religious literature — they improve search quality across many domains. But religious and philosophical texts are where the gap between standard search and good search is widest, and where the consequences of getting it wrong are most visible.
These texts are dense with conceptual reference in ways that secular texts usually are not. A passage about "the Covenant" means something very different in Baha'i, Jewish, and Christian contexts — and the disambiguation approach, grounded in the document itself, resolves this rather than conflating it. Mystical texts like the Seven Valleys or Sufi poetry use metaphor as their primary mode of expression — "this station," "the path," "the valley" are not vague terms but precise references that carry the full weight of a spiritual tradition.
Religious texts also span millennia of scholarship. A researcher studying a concept may need to trace it from early scripture through medieval commentary through modern interpretation — across languages, traditions, and centuries of translation. The cross-language capability of SifterSearch's enhanced index, where conceptual connections are indexed rather than just word matches, makes that kind of research possible in ways that were not practical before. Searching "the nature of the soul" returns not just English passages but relevant Arabic, Persian, and Sanskrit material — the entire depth of humanity's reflection on the question, not just the portion that was translated into the searcher's language.
The goal throughout has been to treat these texts with the scholarly care they deserve: not to flatten their richness into keyword bags, but to index their meaning.
Open Source & Community
SifterSearch is the research engine behind OceanLibrary.com, an individual initiative developed for personal and academic research and offered freely to anyone who finds it useful. The research strategy described in this document is being implemented openly — the approach, the architecture, and the cost estimates are all shared here rather than kept proprietary.
The enhancement pipeline is actively being built. Supporters can follow progress and explore the library at OceanLibrary.com. Questions and feedback are welcome through the search interface itself — asking Sifter is the fastest way to explore what is already working and what still needs improvement.
Planned Sidecar Layers
The architecture is intentionally additive. Each search-quality
improvement comes as a new sidecar — a new column on every paragraph, or a
new index that joins to content.id. The four layers below are
designed but not yet built; this section documents the intent so the
eventual implementation matches the roadmap.
1. Graph Object — per-paragraph entity + quotation extraction
Every paragraph gets a structured JSON sidecar capturing what's in the paragraph as data, not as prose. Six categories:
- People — proper names with role tags (speaker / subject / recorder / addressee). Canonical forms resolved (e.g., "the Guardian" → "Shoghi Effendi").
- Places & organizations — geographic and institutional references.
- Events with dates — "Declaration of the Báb (1844-05-23)", "Conference of Badasht (summer 1848)".
- Quotations with attribution chains — the verbatim quoted span, the speaker (canonicalized), the addressee, the recorder (for pilgrim notes / memoirs), the date, and the attribution strength (direct / reported / paraphrased / translated). This is the layer that makes scholarly use of secondary sources possible.
- Citations to other works — when this paragraph references the Aqdas or Qur'án 21:107, the citation is captured with locator and resolved to a corpus document when found.
- Concepts — open-vocabulary doctrinal/thematic tags. No closed taxonomy; deduplication happens at index time.
- Relations — minimal subject-predicate-object triples between entities ("'Abdu'l-Bahá visited London", "Mullá Ḥusayn was the first Letter of the Living").
2. Unique-Quote Detection
A boolean sidecar (content.has_unique_quote) marking
paragraphs that contain attributed quotations not present in our
primary corpus. The scholarly value is novel material: a Shoghi Effendi
statement preserved only in a pilgrim note, an 'Abdu'l-Bahá talk recorded
in a memoir but never published in Promulgation — these are
research gold but currently invisible unless you happen to keyword-match
the exact wording. Computed once at extraction time by checking each
quoted span against tier 1-4 paragraph embeddings; flips to true only
when at least one extracted quote has no near-duplicate in the
canonical corpus.
3. Speaker-Intent Search Routing
A small but important routing rule: when the user query names a primary figure as a speaker ("what did Shoghi Effendi say about X" / "find statements by 'Abdu'l-Bahá on Y"), the orchestrator bypasses the default authority-tier filter that drops secondary sources. Pilgrim notes and memoirs become first-class results for these queries and the crafter is trusted to attribute correctly per-paragraph. Queries containing the word unique additionally bias toward tier 5+ sources where novel quotations live.
4. Conversation Memory Layer
The orchestrator currently treats each turn as fresh. Multi-step research
workflows — "study chapter 2 of Dawn-Breakers, list the people
mentioned" → "now do the same for documents mentioned in those chapters"
→ "for each person, give me a research summary" — require persistent
session context. The chat_sessions table already exists from
earlier work; the missing piece is the pipeline reading from it on each
turn so referents like "those chapters" or "each person" resolve to
concrete prior context rather than re-prompting the user.
These four layers compose: speaker-intent routing benefits from quotation-attribution data in the graph object; unique-quote detection is a derivative of the graph object's quotations field; memory ties everything together for serious research use.
Last updated: April 2026