Search Quality
504 test fixtures across five categories. Scores update when the battery is run and the JSON committed. Fixture set adapted from David Hunt's ocean-search-testing framework.
- Pass Rate
- % of tests where the expected document appeared in the top 10. For tests with an expected doc ID, that's the primary gate. For text-only tests, the result text must contain the expected phrases.
- MRR
- Mean Reciprocal Rank — average of 1÷rank across all tests. 1.0 = everything found at position 1; 0.5 = average position 2. Higher is better.
- Text precision
- % of results where the specific paragraph contained the expected phrases. Tracked separately — it's a paragraph-quality signal, not a pass/fail gate. Low text precision means we're finding the right book but returning the wrong paragraph.
- Not found
- The expected document didn't appear in the top 10 at all. These are the hardest failures requiring index or ranking improvements.
Score History
| Date | Pass% | MRR | p50 |
|---|---|---|---|
| Jun 1, 2026 | 79% | 0.833 | 3.8s |
| Jun 1, 2026 | 81% | 0.820 | 3.9s |
| Jun 1, 2026 | 75% | 0.821 | 3.5s |
| Jun 1, 2026 | 71% | 0.798 | 4.5s |
| May 31, 2026 | 56% | 0.656 | 5.9s |
| May 31, 2026 | 50% | 0.673 | 6.3s |
| May 31, 2026 | 54% | 0.634 | 6.0s |
| May 31, 2026 | 56% | 0.690 | 6.4s |
| May 31, 2026 | 2% | 0.019 | 77ms |
By Category
Rank Distribution
Failure Breakdown
Failing Tests
| Query | Cat | Rank | Top hit | Reason |
|---|---|---|---|---|
| Adam Eve | Phrase Match | NF | Various — Douay-Rheims Bible | not found |
| Babylonian Exile | Phrase Match | NF | Various — Deuterocanonical Books of the Bible - Ap | not found |
| Brihadaranyaka Upanishad | Phrase Match | NF | Vyāsa — The Mahabharata 14 | not found |
| Chochmah | Phrase Match | NF | Bhagat Kabir, Bhagat Farid, Bhagat Namdev, and others — Guru Granth Sahib - Bhagat Bani (Hymns o | not found |
| commandments of Jesus | Phrase Match | NF | Matthew — The Gospel of Matthew - ευαγγέλιο του Μα | not found |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Daniel | Phrase Match | NF | Matthew — The Gospel of Matthew - ευαγγέλιο του Μα | not found |
| devotion | Phrase Match | NF | Confucius — The Great Learning | not found |
| Diaspora | Phrase Match | NF | Various Sikh Gurus — Guru Granth Sahib - Raag Asa | not found |
| dietary practices | Phrase Match | NF | Unknown — Sutra Collection (B) | not found |
| eternal life | Phrase Match | NF | Various — Douay-Rheims Bible | not found |
| ever-advancing | Phrase Match | NF | Bahá’u’lláh — Gleanings from the Writings of Bahá’u’ll | not found |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Isa | Phrase Match | NF | Bahá’u’lláh — Epistle to the Son of the Wolf | not found |
| Phrase Match | NF | — | timeout | |
| Phrase Match | NF | — | timeout | |
| Zhong | Phrase Match | NF | Laozi, tr. Chang Chung-yuan — Tao Te Ching (tr. Chang Chung-yuan) | not found |
Change Log
Interventions applied to improve search quality. Updated manually when a change ships.
| Date | Change | Result |
|---|---|---|
| 2026-05-31 | Initial search quality battery established (52 core + 491 ocean fixtures) | Core: 58%, Ocean: baseline pending (Meilisearch sync queue clearing) |
Analysis & Improvement Strategy
Generated by claude-opus-4-7 on June 1, 2026.
Re-run with --analyze flag to update.
Pass rate has climbed to 71% (37/52) with MRR 0.798 and zero timeouts/5xx — the infra is healthy and authority-ranking (100%) and cross-tradition (100%) are solved. The dominant failure mode is now text-mismatch (10/15 failures): the right *kind* of document is retrieved at rank 1 but the wrong paragraph, plus 5 entity-aware queries that miss the top 10 entirely.
Strengths
- Infrastructure fully stabilized — Zero timeouts, zero HTTP errors, zero network errors across all 52 fixtures. p50 of 4.5s and p95 of 9.7s are workable (p95 down from ~25s in the prior run).
- Authority and cross-tradition routing solid — authority-ranking 100% (4/4) and cross-tradition 100% (5/5) — the authority weighting and multi-tradition retrieval logic are correctly tuned and need no further work.
- Correct document retrieval even on failures — Almost every concept/phrase failure has top_hit_author or top_hit_title from the correct tradition or author (e.g. Bahá'u'lláh for tablet-of-ahmad, Muhammad for islamic-hadith-patience). Retrieval is finding the right corpus region; only paragraph-level selection is wrong.
Critical Gaps
- Paragraph-level reranking is the bottleneck — 10 of 15 failures are 'text mismatch' at rank 1 — meaning hybrid search picks the correct book but the wrong passage within it. Authority weighting may even be hurting here by pulling toward high-authority paragraphs that don't contain the target tokens. (tablet-of-ahmad-nightingale (rank 1, wrong tablet by same author), islamic-daily-prayers (rank 1, wrong sura), christian-beatitudes-meek (rank 1, Apocrypha instead of Matthew 5), hindu-bhagavad-gita-action (rank 1, Mahabharata 6 instead of Gita))
- Entity-aware queries still miss top 10 — 5 entity queries return rank=null. Top hits drift to generic Bahá'u'lláh works or Shoghi Effendi histories rather than passages naming the specific entity (Mullá Husayn, the Báb's declaration, Letters of the Living). (entity-mulla-husayn-first-believer, entity-declaration-bab-1844, entity-letters-of-living, entity-bab-martyrdom, shoghi-effendi-progressive-revelation (all rank=null))
- Canonical-text queries routed to wrong canonical book — Famous phrases match the *theme* via semantic similarity but land in adjacent canonical books. Golden Rule and Beatitudes go to Apocrypha; Bhagavad Gita verses go to other Mahabharata books; Islamic ritual queries go to Quranic tafsir rather than the ritual-prescriptive verse. (christian-golden-rule → Apocrypha; christian-beatitudes-meek → Apocrypha; hindu-bhagavad-gita-action → Mahabharata 6; islamic-zakat-charity → Sura II rather than zakat-prescriptive verse)
Action Plan
10 of 15 failures are wrong-paragraph-in-right-book. Add a second-stage reranker that scores each candidate paragraph on query-token coverage (exact lemma hits, rare-term IDF, bigram overlap) and blend it with the current authority×match score. This directly attacks the dominant failure class without touching retrieval.
All 5 rank=null failures are entity queries. A static entity table mapping named entities (Mullá Husayn, Letters of the Living, Báb, Declaration 1844, Tabriz martyrdom) to canonical passage IDs in God Passes By, Dawn-Breakers, etc., gives a deterministic retrieval boost when the query mentions the entity.
Queries like 'nightingale of paradise singeth upon the twigs' and 'Blessed are the meek' contain rare lexical signatures that BM25 should pin immediately but semantic embeddings dilute. Detect high-IDF query tokens and shift the hybrid weight toward keyword for those queries.
Two Christian queries (Golden Rule, Beatitudes) and several Islamic/Hindu queries land in deuterocanonical or commentary works rather than the primary scripture. Add a soft penalty when a query semantically matches a known primary-canon passage but the top hit is from a secondary/commentary source.
Latency is now acceptable but p95 9.7s still has headroom. Carrying the prior recommendation forward: expose per-stage timings (intent, research, craft, rerank) in response headers and add ?mode=raw to skip jafar-pipeline entirely. This makes future tuning measurable and unblocks per-stage budgets.
Notes on Previous Attempts
The previous plan was on target: P1 (bypass pipeline / timeouts) eliminated the timeout class entirely (0 timeouts now vs 6 before), and overall pass rate moved 56%→71%. P3 (paragraph-level reranking) was correctly identified as the next big lever and is now the #1 priority — the failure pattern has crystallized around 'right book, wrong paragraph,' confirming that diagnosis. P4 (entity index) also remains valid and is reprioritized to #2 since entity-aware is the lowest-scoring category (56%). No dead ends; the restraint shown in run 1 (not tuning against noise) continues to pay dividends. P5 from the prior run (stage-level metrics) is still unimplemented and carried forward — cheap and worth doing before further tuning.
Search Tuning Parameters
These are the primary knobs controlling retrieval, ranking, and authority weighting. Changing any of these affects scores without requiring a reindex.
Retrieval (Meilisearch)
| Parameter | Value | Env var | Effect |
|---|---|---|---|
semanticRatio | 0.5 | — | Blend of BM25 keyword (0) vs. 512-dim semantic (1). 0.5 = equal weight. |
overFetch multiplier | 5× | AUTHORITY_RERANK_MULTIPLIER | Fetch this many × limit before reranking. With authority as tiebreaker (not override), 5× keeps reranking within the genuinely relevant result set. |
TOP_K | 10 | — | Results returned to caller. Battery tests pass/fail within this window. |
Authority Scoring
score = relevance × olMultiplier × (1 + boost × (authority − 5) / 5)
| Parameter | Value | Env var | Effect |
|---|---|---|---|
authorityBoost | 0.2 | AUTHORITY_BOOST | Tiebreaker weight. auth=10 gets +20% vs auth=5; auth=1 gets −16%. A poor match from a canonical source cannot beat a strong match from a secondary source. |
olSourceMultiplier | 1.2× | OL_SOURCE_MULTIPLIER | OceanLibrary hits get a 1.2× nudge to compensate for archaic language scoring lower in BM25. Not enough to override a significantly better non-OL match. |
Multi-Index RRF Fusion
RRF score = Σ weight / (K + rank) across indexes
| Index | Weight | Description |
|---|---|---|
| paragraphs (main) | 1.0 | Primary text+context hybrid search. Base signal. |
| HyPE questions | 1.5 | Hypothetical question match — highest signal when queries are question-like. Boosts relevant doc sections above pure keyword rank. |
| entity mentions | 1.0 | Named-entity sidecar. Active when resolved entity IDs are present (entity-aware search). |
| RRF K constant | 60 | Rank fusion smoothing. Higher = flatter curve, less winner-take-all. |
Authority Tiers
Scores 0–10 set per document in api/lib/authority.js. Same author in different container types → different authority (e.g. primary scripture vs. extracted quote in secondary source).
| Score | Source type | Examples |
|---|---|---|
| 10 | Primary scripture | Quran, Bible, Aqdas, Íqán, Hidden Words, Seven Valleys, Bhagavad Gita, Analects |
| 8–9 | Core canonical writings | Shoghi Effendi letters, ʿAbdu'l-Bahá talks & tablets, authenticated compilations |
| 5–7 | Authoritative secondary | UHJ letters, approved histories, classic commentaries |
| 1–4 | Supplemental / general | Modern scholarship, devotional works, encyclopedias |
| −0.5 | Extracted quote penalty | Baha'u'llah quoted in Shoghi Effendi = 9.5 (not 10) — canonical source wins when indexed |
Running the Battery
# Unified suite (504 fixtures)
node tests/quality/score-search.mjs --write-report
# With AI analysis
node tests/quality/score-search.mjs --write-report --analyze
# Single category
node tests/quality/score-search.mjs --category=phrase-match
Writes tests/quality/results-latest.json. Commit the file to update this page.
The --analyze flag calls claude-opus-4-7 to generate the improvement strategy.
See the scoring model guide for details on how results are evaluated.