Entity-Aware Search

How named entity extraction makes people, places, and events searchable — across spelling variants, titles, and the full sweep of a long work

The Problem Keyword Search Cannot Solve

Religious and historical texts are full of figures who appear under many names. The first person to believe in the Báb is called "Mullá Ḥusayn" in some passages, "the first Letter of the Living" in others, "Bábul-Báb" in still others, and simply "he" in dozens more where context supplies the referent. A keyword search for any one of these finds a fraction of the record. A search for all of them requires knowing every variant in advance — and even then misses the pronoun references entirely.

The same problem applies to roles. A researcher who wants to know what Ṭáhirih said is asking a different question from one who wants to know what was said about her. Keyword search cannot distinguish the two: both return every paragraph where her name appears. The distinction between speaker, subject, and addressee requires structured data that keyword indexes do not carry.

And across a long work like the Dawn-Breakers — 900 pages, 200+ named figures — tracking a minor character from their first appearance to their last requires either reading the whole book or accepting that the search will miss chapters where their name is spelled differently, abbreviated, or replaced by a title.

The Solution: An Entity Mentions Sidecar

SifterSearch runs each passage through an extraction pipeline that identifies named entities — people, places, organizations, events — and resolves them to canonical identities. Every resolved mention is stored in a sidecar index (entity_mentions_idx) alongside the canonical entity ID, the paragraph it appears in, the role the entity plays in that paragraph (speaker, subject, addressee, narrator), and authority metadata inherited from the source document.

This means that "Mullá Ḥusayn," "the first Letter of the Living," "Bábul-Báb," and the pronoun "he" in a passage whose context makes the referent clear all map to a single entity ID. A search that resolves to that entity finds all mentions — every paragraph in the library, across all documents, regardless of how the name was rendered.

Keyword search: "Mullá Ḥusayn"

Finds paragraphs containing that exact transliteration.
Misses: "first Letter of the Living", "Bábul-Báb", "Mulla Husain", "he" (when referent is contextual), Arabic/Persian passages using the original form.

Misses title variants

Misses diacritic variants

Misses pronoun references

Misses non-English passages

Entity search: Mullá Ḥusayn (entity #1222917)

Finds every paragraph where the extraction pipeline resolved any mention to this entity — regardless of how the name was written, which title was used, or which language the passage is in.

All name variants matched

Title references matched

Contextual pronouns matched

Role metadata available (speaker / subject)

Entity resolution collapses all surface forms of a name into one canonical identity, then retrieves every mention in the library against that identity.

Four Capabilities This Unlocks

1. Complete coverage of a figure across a long work

The Dawn-Breakers alone contains 2,052 resolved entity mentions. Before entity extraction, a researcher tracking a figure through the narrative had to rely on the index in the back of the book — if one existed, if it was complete, and if the spelling they searched matched the transliteration in the index. Now a single query retrieves every mention, in sequence, with the surrounding passage text.

2. Role-filtered search

Each mention is tagged with the role the entity plays: speaker (this entity said these words), subject (this entity is being described), addressee (these words were directed to this entity), or narrator (this entity is recounting something). This makes it possible to ask "what did X say?" rather than "where does X appear?" — a distinction that is impossible to make with keyword search.

3. Authority-weighted retrieval

Every mention inherits the authority tier of its source document. A statement attributed to the Báb in the Dawn-Breakers (a secondary historical source) is ranked differently from the same statement found in primary scripture. Researchers studying what a figure actually wrote versus what was reported about them can filter accordingly.

4. Cross-document entity continuity

The same entity ID appears across all indexed documents. A query for Mullá Ḥusayn returns his appearances in the Dawn-Breakers, in God Passes By, in pilgrim notes, and in Memorials of the Faithful — without the researcher needing to know which documents to search or how the name was rendered in each.

Example Queries

Each link below demonstrates a question the entity-aware search handles that keyword search cannot answer reliably. Try them — the difference is most visible on questions about minor figures, variant names, or specific speech acts.

Variant resolution

What is known about the first Letter of the Living?

"First Letter of the Living," "Mullá Ḥusayn," and "Bábul-Báb" all resolve to the same entity. Results include passages that use none of these phrases but whose contextual referent the extraction pipeline identified.

Variant resolution

Find all mentions of Ṭáhirih in the Dawn-Breakers

Catches "Qurratu'l-'Ayn," "the poetess of Qazvín," "the fair and immortal heroine," "she," and every other surface form resolved to this entity across all 900 pages.

Role-filtered

What did Ṭáhirih say at the conference of Badasht?

Filters for role=speaker on Ṭáhirih's entity, returning only the paragraphs where she speaks — not the many more where she is discussed.

Role-filtered

Compile direct statements by the Báb in the Dawn-Breakers

role=speaker on the Báb's entity, scoped to the Dawn-Breakers. Distinguishes the Báb's own words from Nabíl's descriptions of him — a distinction that matters deeply for anyone doing doctrinal research.

Character tracking

What happened to Quddús after the siege of Shaykh Ṭabarsí?

Tracks a figure through later chapters of a long work — finding mentions in passages where the name may not appear but the entity was resolved from context.

Character tracking

Trace Mullá Ḥusayn from his first meeting with the Báb to his martyrdom

Returns mentions in narrative order across the full arc of the Dawn-Breakers — spanning passages that use the canonical name, his titles, and contextual pronouns.

Cross-document

How does Shoghi Effendi describe Mullá Ḥusayn compared to Nabíl?

The same entity ID appears in God Passes By (Shoghi Effendi) and the Dawn-Breakers (Nabíl). Authority tier metadata allows the response to attribute each passage to its source and rank them accordingly.

Cross-document

Find every primary-source reference to the Declaration of the Báb

The Declaration (May 23, 1844) is an event entity. Filtering by authority_tier = primary_scripture or revealed surfaces only passages from authoritative sources — not secondary histories, however accurate.

Assembly

Who was present at the Declaration of the Báb on May 23, 1844?

Entity mentions filtered to that event and those paragraphs surface the cast of that night — including figures mentioned once, in passing, who would not survive a keyword search for their names.

Assembly

List every Letter of the Living and what is known about each

The eighteen Letters of the Living are an entity set. Each has their own entity ID with associated mentions across the library. This query assembles a structured answer that would take hours to compile manually.

How It Works

The extraction pipeline runs once per passage, offline, before the passage enters the search index. A language model reads each paragraph with surrounding context and identifies named entities, their canonical forms, and their roles. The results are stored in the entity_mentions table in the main database, then synced to entity_mentions_idx in Meilisearch — a dedicated index that sits alongside the primary paragraphs index.

At search time, when a query names a recognizable entity, the orchestrator resolves the name to an entity ID (via the alias table, which maps every known surface form), then performs a filtered search against the entity_mentions index in parallel with the standard paragraph search. Results from both indexes are merged and reranked before the response is generated.

The alias table is what makes variant resolution work: it stores every known surface form — canonical name, honorific titles, shortened forms, diacritic variants, and resolved contextual references — mapped to a single entity ID with a confidence score. A query for "Mulla Husain" (without diacritics) resolves to the same entity as "Mullá Ḥusayn" because both appear in the alias table pointing to the same ID.

Entity extraction runs once at index time (top). At search time (bottom), name resolution and parallel index search add zero latency overhead — the entity index is already built.

Current Coverage

Entity extraction is currently complete for the Bahá'í primary corpus, including the full text of the Dawn-Breakers (2,052 resolved mentions across 900 pages), God Passes By, Memorials of the Faithful, and the major collected writings of Bahá'u'lláh and 'Abdu'l-Bahá. The pipeline is running continuously; coverage across the broader library expands each week.

The entity graph for Islamic, Buddhist, and other traditions is in early stages. Questions about figures in those traditions will fall back to the standard paragraph search if no entity match is found — the result is still useful, just without the completeness that entity resolution provides.

Last updated: May 2026