Search Quality

504 test fixtures across five categories. Scores update when the battery is run and the JSON committed. Fixture set adapted from David Hunt's ocean-search-testing framework.

How to read these scores
Pass Rate
% of tests where the expected document appeared in the top 10. For tests with an expected doc ID, that's the primary gate. For text-only tests, the result text must contain the expected phrases.
MRR
Mean Reciprocal Rank — average of 1÷rank across all tests. 1.0 = everything found at position 1; 0.5 = average position 2. Higher is better.
Text precision
% of results where the specific paragraph contained the expected phrases. Tracked separately — it's a paragraph-quality signal, not a pass/fail gate. Low text precision means we're finding the right book but returning the wrong paragraph.
Not found
The expected document didn't appear in the top 10 at all. These are the hardest failures requiring index or ranking improvements.
Search Quality 168 tests Jun 1, 2026, 04:11 PM
87 146/168
Pass Rate
62 0.622
MRR
51 86 exact
Rank 1
87 146/168
Phrase Match
0 6.5s p50
Performance

Score History

9 runs · 2026-05-31 → 2026-06-01
DatePass%MRRp50
Jun 1, 2026 79% 0.833 3.8s
Jun 1, 2026 81% 0.820 3.9s
Jun 1, 2026 75% 0.821 3.5s
Jun 1, 2026 71% 0.798 4.5s
May 31, 2026 56% 0.656 5.9s
May 31, 2026 50% 0.673 6.3s
May 31, 2026 54% 0.634 6.0s
May 31, 2026 56% 0.690 6.4s
May 31, 2026 2% 0.019 77ms

By Category

Phrase Match
87% 146/168

Rank Distribution

Rank 1
86
Rank 2–3
31
Rank 4–10
29
Not found
22

Failure Breakdown

Failures (22)
Not found in top 10
13
Timed out
9

Failing Tests

QueryCatRankTop hitReason
Adam Eve Phrase Match NF Various — Douay-Rheims Bible not found
Babylonian Exile Phrase Match NF Various — Deuterocanonical Books of the Bible - Ap not found
Brihadaranyaka Upanishad Phrase Match NF Vyāsa — The Mahabharata 14 not found
Chochmah Phrase Match NF Bhagat Kabir, Bhagat Farid, Bhagat Namdev, and others — Guru Granth Sahib - Bhagat Bani (Hymns o not found
commandments of Jesus Phrase Match NF Matthew — The Gospel of Matthew - ευαγγέλιο του Μα not found
Phrase Match NF timeout
Phrase Match NF timeout
Daniel Phrase Match NF Matthew — The Gospel of Matthew - ευαγγέλιο του Μα not found
devotion Phrase Match NF Confucius — The Great Learning not found
Diaspora Phrase Match NF Various Sikh Gurus — Guru Granth Sahib - Raag Asa not found
dietary practices Phrase Match NF Unknown — Sutra Collection (B) not found
eternal life Phrase Match NF Various — Douay-Rheims Bible not found
ever-advancing Phrase Match NF Bahá’u’lláh — Gleanings from the Writings of Bahá’u’ll not found
Phrase Match NF timeout
Phrase Match NF timeout
Phrase Match NF timeout
Phrase Match NF timeout
Phrase Match NF timeout
Isa Phrase Match NF Bahá’u’lláh — Epistle to the Son of the Wolf not found
Phrase Match NF timeout
Phrase Match NF timeout
Zhong Phrase Match NF Laozi, tr. Chang Chung-yuan — Tao Te Ching (tr. Chang Chung-yuan) not found

Change Log

Interventions applied to improve search quality. Updated manually when a change ships.

DateChangeResult
2026-05-31 Initial search quality battery established (52 core + 491 ocean fixtures) Core: 58%, Ocean: baseline pending (Meilisearch sync queue clearing)

Analysis & Improvement Strategy

Generated by claude-opus-4-7 on June 1, 2026. Re-run with --analyze flag to update.

Ocean (504 fixtures)

Pass rate has climbed to 71% (37/52) with MRR 0.798 and zero timeouts/5xx — the infra is healthy and authority-ranking (100%) and cross-tradition (100%) are solved. The dominant failure mode is now text-mismatch (10/15 failures): the right *kind* of document is retrieved at rank 1 but the wrong paragraph, plus 5 entity-aware queries that miss the top 10 entirely.

Strengths

  • Infrastructure fully stabilized — Zero timeouts, zero HTTP errors, zero network errors across all 52 fixtures. p50 of 4.5s and p95 of 9.7s are workable (p95 down from ~25s in the prior run).
  • Authority and cross-tradition routing solid — authority-ranking 100% (4/4) and cross-tradition 100% (5/5) — the authority weighting and multi-tradition retrieval logic are correctly tuned and need no further work.
  • Correct document retrieval even on failures — Almost every concept/phrase failure has top_hit_author or top_hit_title from the correct tradition or author (e.g. Bahá'u'lláh for tablet-of-ahmad, Muhammad for islamic-hadith-patience). Retrieval is finding the right corpus region; only paragraph-level selection is wrong.

Critical Gaps

  • Paragraph-level reranking is the bottleneck — 10 of 15 failures are 'text mismatch' at rank 1 — meaning hybrid search picks the correct book but the wrong passage within it. Authority weighting may even be hurting here by pulling toward high-authority paragraphs that don't contain the target tokens. (tablet-of-ahmad-nightingale (rank 1, wrong tablet by same author), islamic-daily-prayers (rank 1, wrong sura), christian-beatitudes-meek (rank 1, Apocrypha instead of Matthew 5), hindu-bhagavad-gita-action (rank 1, Mahabharata 6 instead of Gita))
  • Entity-aware queries still miss top 10 — 5 entity queries return rank=null. Top hits drift to generic Bahá'u'lláh works or Shoghi Effendi histories rather than passages naming the specific entity (Mullá Husayn, the Báb's declaration, Letters of the Living). (entity-mulla-husayn-first-believer, entity-declaration-bab-1844, entity-letters-of-living, entity-bab-martyrdom, shoghi-effendi-progressive-revelation (all rank=null))
  • Canonical-text queries routed to wrong canonical book — Famous phrases match the *theme* via semantic similarity but land in adjacent canonical books. Golden Rule and Beatitudes go to Apocrypha; Bhagavad Gita verses go to other Mahabharata books; Islamic ritual queries go to Quranic tafsir rather than the ritual-prescriptive verse. (christian-golden-rule → Apocrypha; christian-beatitudes-meek → Apocrypha; hindu-bhagavad-gita-action → Mahabharata 6; islamic-zakat-charity → Sura II rather than zakat-prescriptive verse)

Action Plan

#1 Add token-overlap paragraph reranker before authority weighting high impact medium effort

10 of 15 failures are wrong-paragraph-in-right-book. Add a second-stage reranker that scores each candidate paragraph on query-token coverage (exact lemma hits, rare-term IDF, bigram overlap) and blend it with the current authority×match score. This directly attacks the dominant failure class without touching retrieval.

Implementation: In api/lib/authority.js (or new api/lib/paragraph-rerank.js), after hybridSearch returns the top ~50 candidates, compute per-paragraph: (a) fraction of query content tokens present, (b) sum of IDF for matched rare tokens, (c) tightest-window proximity score. Combine as final_score = 0.45*lexical_overlap + 0.30*semantic + 0.15*authority + 0.10*match_quality. Gate behind RERANK_V2=1 env var so it can be A/B'd.
Success: phrase-match 67%→85%+, concept-match 68%→80%+, overall 71%→82%+
#2 Build entity dictionary with canonical-passage anchors high impact medium effort

All 5 rank=null failures are entity queries. A static entity table mapping named entities (Mullá Husayn, Letters of the Living, Báb, Declaration 1844, Tabriz martyrdom) to canonical passage IDs in God Passes By, Dawn-Breakers, etc., gives a deterministic retrieval boost when the query mentions the entity.

Implementation: Create api/data/entities.json with entries like {name, aliases[], canonical_paragraph_ids[], boost}. In api/lib/search.js hybridSearch, after query normalization, detect entity mentions via case-insensitive alias match and either (a) inject canonical_paragraph_ids into the candidate pool with a +1.5 score bump, or (b) add an OR clause to the Meilisearch query targeting those IDs. Start with ~30 high-value Bahá'í entities covered by the failing tests.
Success: entity-aware 56%→90%+, overall 71%→80%+
#3 Boost BM25/phrase weight for queries with rare canonical tokens medium impact low effort

Queries like 'nightingale of paradise singeth upon the twigs' and 'Blessed are the meek' contain rare lexical signatures that BM25 should pin immediately but semantic embeddings dilute. Detect high-IDF query tokens and shift the hybrid weight toward keyword for those queries.

Implementation: In api/lib/search.js hybridSearch, compute mean IDF of query content tokens against the corpus stats. If mean IDF > threshold (or any token IDF > X), set keyword_weight from default (e.g. 0.5) to 0.75 and enable Meilisearch phrase matching with quotes around top-2 rarest bigrams. Also enable matchingStrategy='all' for these queries.
Success: phrase-match 67%→80%+, especially tablet-of-ahmad-nightingale, christian-beatitudes-meek, hindu-bhagavad-gita-action
#4 Penalize Apocrypha/secondary-canon when query matches primary canon medium impact low effort

Two Christian queries (Golden Rule, Beatitudes) and several Islamic/Hindu queries land in deuterocanonical or commentary works rather than the primary scripture. Add a soft penalty when a query semantically matches a known primary-canon passage but the top hit is from a secondary/commentary source.

Implementation: Tag documents with canonical_tier (gospel/quran/gita = 0, apocrypha/tafsir/commentary = 1, history = 2). In paragraph reranker, apply -0.15 score adjustment per tier-step when query content suggests primary scripture (detected via tradition-specific keyword lists: 'Blessed are', 'do unto', 'Krishna said', 'salat', 'zakat').
Success: christian-golden-rule, christian-beatitudes-meek, hindu-bhagavad-gita-action, islamic-zakat-charity all pass
#5 Add eval-mode bypass and per-stage timing metrics low impact low effort

Latency is now acceptable but p95 9.7s still has headroom. Carrying the prior recommendation forward: expose per-stage timings (intent, research, craft, rerank) in response headers and add ?mode=raw to skip jafar-pipeline entirely. This makes future tuning measurable and unblocks per-stage budgets.

Implementation: In api/lib/jafar-pipeline.js, wrap each stage with performance.now() and emit X-Stage-Timings header. Add mode=raw query param that calls hybridSearch directly and returns without intent/craft. Wire test runner to record these timings per fixture.
Success: p95 latency visibility; potential 30-40% latency reduction in eval runs

Notes on Previous Attempts

The previous plan was on target: P1 (bypass pipeline / timeouts) eliminated the timeout class entirely (0 timeouts now vs 6 before), and overall pass rate moved 56%→71%. P3 (paragraph-level reranking) was correctly identified as the next big lever and is now the #1 priority — the failure pattern has crystallized around 'right book, wrong paragraph,' confirming that diagnosis. P4 (entity index) also remains valid and is reprioritized to #2 since entity-aware is the lowest-scoring category (56%). No dead ends; the restraint shown in run 1 (not tuning against noise) continues to pay dividends. P5 from the prior run (stage-level metrics) is still unimplemented and carried forward — cheap and worth doing before further tuning.

Search Tuning Parameters

These are the primary knobs controlling retrieval, ranking, and authority weighting. Changing any of these affects scores without requiring a reindex.

Retrieval (Meilisearch)

ParameterValueEnv varEffect
semanticRatio 0.5 Blend of BM25 keyword (0) vs. 512-dim semantic (1). 0.5 = equal weight.
overFetch multiplier AUTHORITY_RERANK_MULTIPLIER Fetch this many × limit before reranking. With authority as tiebreaker (not override), 5× keeps reranking within the genuinely relevant result set.
TOP_K 10 Results returned to caller. Battery tests pass/fail within this window.

Authority Scoring

score = relevance × olMultiplier × (1 + boost × (authority − 5) / 5)

ParameterValueEnv varEffect
authorityBoost 0.2 AUTHORITY_BOOST Tiebreaker weight. auth=10 gets +20% vs auth=5; auth=1 gets −16%. A poor match from a canonical source cannot beat a strong match from a secondary source.
olSourceMultiplier 1.2× OL_SOURCE_MULTIPLIER OceanLibrary hits get a 1.2× nudge to compensate for archaic language scoring lower in BM25. Not enough to override a significantly better non-OL match.

Multi-Index RRF Fusion

RRF score = Σ weight / (K + rank) across indexes

IndexWeightDescription
paragraphs (main)1.0Primary text+context hybrid search. Base signal.
HyPE questions1.5Hypothetical question match — highest signal when queries are question-like. Boosts relevant doc sections above pure keyword rank.
entity mentions1.0Named-entity sidecar. Active when resolved entity IDs are present (entity-aware search).
RRF K constant60Rank fusion smoothing. Higher = flatter curve, less winner-take-all.

Authority Tiers

Scores 0–10 set per document in api/lib/authority.js. Same author in different container types → different authority (e.g. primary scripture vs. extracted quote in secondary source).

ScoreSource typeExamples
10Primary scriptureQuran, Bible, Aqdas, Íqán, Hidden Words, Seven Valleys, Bhagavad Gita, Analects
8–9Core canonical writingsShoghi Effendi letters, ʿAbdu'l-Bahá talks & tablets, authenticated compilations
5–7Authoritative secondaryUHJ letters, approved histories, classic commentaries
1–4Supplemental / generalModern scholarship, devotional works, encyclopedias
−0.5Extracted quote penaltyBaha'u'llah quoted in Shoghi Effendi = 9.5 (not 10) — canonical source wins when indexed

Running the Battery

# Unified suite (504 fixtures)
node tests/quality/score-search.mjs --write-report

# With AI analysis
node tests/quality/score-search.mjs --write-report --analyze

# Single category
node tests/quality/score-search.mjs --category=phrase-match

Writes tests/quality/results-latest.json. Commit the file to update this page. The --analyze flag calls claude-opus-4-7 to generate the improvement strategy. See the scoring model guide for details on how results are evaluated.