Search Quality

504 test fixtures across five categories. Scores update when the battery is run and the JSON committed. Fixture set adapted from David Hunt's ocean-search-testing framework.

How to read these scores

Pass Rate: % of tests where the expected document appeared in the top 10. For tests with an expected doc ID, that's the primary gate. For text-only tests, the result text must contain the expected phrases.
MRR: Mean Reciprocal Rank — average of 1÷rank across all tests. 1.0 = everything found at position 1; 0.5 = average position 2. Higher is better.
Text precision: % of results where the specific paragraph contained the expected phrases. Tracked separately — it's a paragraph-quality signal, not a pass/fail gate. Low text precision means we're finding the right book but returning the wrong paragraph.
Not found: The expected document didn't appear in the top 10 at all. These are the hardest failures requiring index or ranking improvements.

Search Quality 168 tests Jun 1, 2026, 04:11 PM

Pass Rate

MRR

Rank 1

Phrase Match

Performance

Score History

9 runs · 2026-05-31 → 2026-06-01

Date	Pass%	MRR	p50
Jun 1, 2026	79%	0.833	3.8s
Jun 1, 2026	81%	0.820	3.9s
Jun 1, 2026	75%	0.821	3.5s
Jun 1, 2026	71%	0.798	4.5s
May 31, 2026	56%	0.656	5.9s
May 31, 2026	50%	0.673	6.3s
May 31, 2026	54%	0.634	6.0s
May 31, 2026	56%	0.690	6.4s
May 31, 2026	2%	0.019	77ms

By Category

Phrase Match

87% 146/168

Rank Distribution

Rank 1

Rank 2–3

Rank 4–10

Not found

Failure Breakdown

Failures (22)

Not found in top 10

Timed out

Failing Tests

Query	Cat	Rank	Top hit	Reason
Adam Eve	Phrase Match	NF	Various — Douay-Rheims Bible	not found
Babylonian Exile	Phrase Match	NF	Various — Deuterocanonical Books of the Bible - Ap	not found
Brihadaranyaka Upanishad	Phrase Match	NF	Vyāsa — The Mahabharata 14	not found
Chochmah	Phrase Match	NF	Bhagat Kabir, Bhagat Farid, Bhagat Namdev, and others — Guru Granth Sahib - Bhagat Bani (Hymns o	not found
commandments of Jesus	Phrase Match	NF	Matthew — The Gospel of Matthew - ευαγγέλιο του Μα	not found
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
Daniel	Phrase Match	NF	Matthew — The Gospel of Matthew - ευαγγέλιο του Μα	not found
devotion	Phrase Match	NF	Confucius — The Great Learning	not found
Diaspora	Phrase Match	NF	Various Sikh Gurus — Guru Granth Sahib - Raag Asa	not found
dietary practices	Phrase Match	NF	Unknown — Sutra Collection (B)	not found
eternal life	Phrase Match	NF	Various — Douay-Rheims Bible	not found
ever-advancing	Phrase Match	NF	Bahá’u’lláh — Gleanings from the Writings of Bahá’u’ll	not found
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
Isa	Phrase Match	NF	Bahá’u’lláh — Epistle to the Son of the Wolf	not found
	Phrase Match	NF	—	timeout
	Phrase Match	NF	—	timeout
Zhong	Phrase Match	NF	Laozi, tr. Chang Chung-yuan — Tao Te Ching (tr. Chang Chung-yuan)	not found

Change Log

Interventions applied to improve search quality. Updated manually when a change ships.

Date	Change	Result
2026-05-31	Initial search quality battery established (52 core + 491 ocean fixtures)	Core: 58%, Ocean: baseline pending (Meilisearch sync queue clearing)

Analysis & Improvement Strategy

Generated by claude-opus-4-7 on June 1, 2026. Re-run with --analyze flag to update.

Ocean (504 fixtures)

Pass rate has climbed to 71% (37/52) with MRR 0.798 and zero timeouts/5xx — the infra is healthy and authority-ranking (100%) and cross-tradition (100%) are solved. The dominant failure mode is now text-mismatch (10/15 failures): the right *kind* of document is retrieved at rank 1 but the wrong paragraph, plus 5 entity-aware queries that miss the top 10 entirely.

Strengths

Infrastructure fully stabilized — Zero timeouts, zero HTTP errors, zero network errors across all 52 fixtures. p50 of 4.5s and p95 of 9.7s are workable (p95 down from ~25s in the prior run).
Authority and cross-tradition routing solid — authority-ranking 100% (4/4) and cross-tradition 100% (5/5) — the authority weighting and multi-tradition retrieval logic are correctly tuned and need no further work.
Correct document retrieval even on failures — Almost every concept/phrase failure has top_hit_author or top_hit_title from the correct tradition or author (e.g. Bahá'u'lláh for tablet-of-ahmad, Muhammad for islamic-hadith-patience). Retrieval is finding the right corpus region; only paragraph-level selection is wrong.

Critical Gaps

Paragraph-level reranking is the bottleneck — 10 of 15 failures are 'text mismatch' at rank 1 — meaning hybrid search picks the correct book but the wrong passage within it. Authority weighting may even be hurting here by pulling toward high-authority paragraphs that don't contain the target tokens. (tablet-of-ahmad-nightingale (rank 1, wrong tablet by same author), islamic-daily-prayers (rank 1, wrong sura), christian-beatitudes-meek (rank 1, Apocrypha instead of Matthew 5), hindu-bhagavad-gita-action (rank 1, Mahabharata 6 instead of Gita))
Entity-aware queries still miss top 10 — 5 entity queries return rank=null. Top hits drift to generic Bahá'u'lláh works or Shoghi Effendi histories rather than passages naming the specific entity (Mullá Husayn, the Báb's declaration, Letters of the Living). (entity-mulla-husayn-first-believer, entity-declaration-bab-1844, entity-letters-of-living, entity-bab-martyrdom, shoghi-effendi-progressive-revelation (all rank=null))
Canonical-text queries routed to wrong canonical book — Famous phrases match the *theme* via semantic similarity but land in adjacent canonical books. Golden Rule and Beatitudes go to Apocrypha; Bhagavad Gita verses go to other Mahabharata books; Islamic ritual queries go to Quranic tafsir rather than the ritual-prescriptive verse. (christian-golden-rule → Apocrypha; christian-beatitudes-meek → Apocrypha; hindu-bhagavad-gita-action → Mahabharata 6; islamic-zakat-charity → Sura II rather than zakat-prescriptive verse)

Action Plan

#1 Add token-overlap paragraph reranker before authority weighting high impact medium effort

10 of 15 failures are wrong-paragraph-in-right-book. Add a second-stage reranker that scores each candidate paragraph on query-token coverage (exact lemma hits, rare-term IDF, bigram overlap) and blend it with the current authority×match score. This directly attacks the dominant failure class without touching retrieval.

Implementation: In api/lib/authority.js (or new api/lib/paragraph-rerank.js), after hybridSearch returns the top ~50 candidates, compute per-paragraph: (a) fraction of query content tokens present, (b) sum of IDF for matched rare tokens, (c) tightest-window proximity score. Combine as final_score = 0.45*lexical_overlap + 0.30*semantic + 0.15*authority + 0.10*match_quality. Gate behind RERANK_V2=1 env var so it can be A/B'd.

Success: phrase-match 67%→85%+, concept-match 68%→80%+, overall 71%→82%+

#2 Build entity dictionary with canonical-passage anchors high impact medium effort

All 5 rank=null failures are entity queries. A static entity table mapping named entities (Mullá Husayn, Letters of the Living, Báb, Declaration 1844, Tabriz martyrdom) to canonical passage IDs in God Passes By, Dawn-Breakers, etc., gives a deterministic retrieval boost when the query mentions the entity.

Implementation: Create api/data/entities.json with entries like {name, aliases[], canonical_paragraph_ids[], boost}. In api/lib/search.js hybridSearch, after query normalization, detect entity mentions via case-insensitive alias match and either (a) inject canonical_paragraph_ids into the candidate pool with a +1.5 score bump, or (b) add an OR clause to the Meilisearch query targeting those IDs. Start with ~30 high-value Bahá'í entities covered by the failing tests.

Success: entity-aware 56%→90%+, overall 71%→80%+

#3 Boost BM25/phrase weight for queries with rare canonical tokens medium impact low effort

Queries like 'nightingale of paradise singeth upon the twigs' and 'Blessed are the meek' contain rare lexical signatures that BM25 should pin immediately but semantic embeddings dilute. Detect high-IDF query tokens and shift the hybrid weight toward keyword for those queries.

Implementation: In api/lib/search.js hybridSearch, compute mean IDF of query content tokens against the corpus stats. If mean IDF > threshold (or any token IDF > X), set keyword_weight from default (e.g. 0.5) to 0.75 and enable Meilisearch phrase matching with quotes around top-2 rarest bigrams. Also enable matchingStrategy='all' for these queries.

Success: phrase-match 67%→80%+, especially tablet-of-ahmad-nightingale, christian-beatitudes-meek, hindu-bhagavad-gita-action

#4 Penalize Apocrypha/secondary-canon when query matches primary canon medium impact low effort

Two Christian queries (Golden Rule, Beatitudes) and several Islamic/Hindu queries land in deuterocanonical or commentary works rather than the primary scripture. Add a soft penalty when a query semantically matches a known primary-canon passage but the top hit is from a secondary/commentary source.

Implementation: Tag documents with canonical_tier (gospel/quran/gita = 0, apocrypha/tafsir/commentary = 1, history = 2). In paragraph reranker, apply -0.15 score adjustment per tier-step when query content suggests primary scripture (detected via tradition-specific keyword lists: 'Blessed are', 'do unto', 'Krishna said', 'salat', 'zakat').

Success: christian-golden-rule, christian-beatitudes-meek, hindu-bhagavad-gita-action, islamic-zakat-charity all pass

#5 Add eval-mode bypass and per-stage timing metrics low impact low effort

Latency is now acceptable but p95 9.7s still has headroom. Carrying the prior recommendation forward: expose per-stage timings (intent, research, craft, rerank) in response headers and add ?mode=raw to skip jafar-pipeline entirely. This makes future tuning measurable and unblocks per-stage budgets.

Implementation: In api/lib/jafar-pipeline.js, wrap each stage with performance.now() and emit X-Stage-Timings header. Add mode=raw query param that calls hybridSearch directly and returns without intent/craft. Wire test runner to record these timings per fixture.

Success: p95 latency visibility; potential 30-40% latency reduction in eval runs

Notes on Previous Attempts

The previous plan was on target: P1 (bypass pipeline / timeouts) eliminated the timeout class entirely (0 timeouts now vs 6 before), and overall pass rate moved 56%→71%. P3 (paragraph-level reranking) was correctly identified as the next big lever and is now the #1 priority — the failure pattern has crystallized around 'right book, wrong paragraph,' confirming that diagnosis. P4 (entity index) also remains valid and is reprioritized to #2 since entity-aware is the lowest-scoring category (56%). No dead ends; the restraint shown in run 1 (not tuning against noise) continues to pay dividends. P5 from the prior run (stage-level metrics) is still unimplemented and carried forward — cheap and worth doing before further tuning.

Search Tuning Parameters

These are the primary knobs controlling retrieval, ranking, and authority weighting. Changing any of these affects scores without requiring a reindex.

Retrieval (Meilisearch)

Parameter	Value	Env var	Effect
`semanticRatio`	0.5	—	Blend of BM25 keyword (0) vs. 512-dim semantic (1). 0.5 = equal weight.
`overFetch multiplier`	5×	`AUTHORITY_RERANK_MULTIPLIER`	Fetch this many × limit before reranking. With authority as tiebreaker (not override), 5× keeps reranking within the genuinely relevant result set.
`TOP_K`	10	—	Results returned to caller. Battery tests pass/fail within this window.

Authority Scoring

score = relevance × olMultiplier × (1 + boost × (authority − 5) / 5)

Parameter	Value	Env var	Effect
`authorityBoost`	0.2	`AUTHORITY_BOOST`	Tiebreaker weight. auth=10 gets +20% vs auth=5; auth=1 gets −16%. A poor match from a canonical source cannot beat a strong match from a secondary source.
`olSourceMultiplier`	1.2×	`OL_SOURCE_MULTIPLIER`	OceanLibrary hits get a 1.2× nudge to compensate for archaic language scoring lower in BM25. Not enough to override a significantly better non-OL match.

Multi-Index RRF Fusion

RRF score = Σ weight / (K + rank) across indexes

Index	Weight	Description
paragraphs (main)	1.0	Primary text+context hybrid search. Base signal.
HyPE questions	1.5	Hypothetical question match — highest signal when queries are question-like. Boosts relevant doc sections above pure keyword rank.
entity mentions	1.0	Named-entity sidecar. Active when resolved entity IDs are present (entity-aware search).
RRF K constant	60	Rank fusion smoothing. Higher = flatter curve, less winner-take-all.

Authority Tiers

Scores 0–10 set per document in api/lib/authority.js. Same author in different container types → different authority (e.g. primary scripture vs. extracted quote in secondary source).

Score	Source type	Examples
10	Primary scripture	Quran, Bible, Aqdas, Íqán, Hidden Words, Seven Valleys, Bhagavad Gita, Analects
8–9	Core canonical writings	Shoghi Effendi letters, ʿAbdu'l-Bahá talks & tablets, authenticated compilations
5–7	Authoritative secondary	UHJ letters, approved histories, classic commentaries
1–4	Supplemental / general	Modern scholarship, devotional works, encyclopedias
−0.5	Extracted quote penalty	Baha'u'llah quoted in Shoghi Effendi = 9.5 (not 10) — canonical source wins when indexed

Running the Battery

# Unified suite (504 fixtures)
node tests/quality/score-search.mjs --write-report

# With AI analysis
node tests/quality/score-search.mjs --write-report --analyze

# Single category
node tests/quality/score-search.mjs --category=phrase-match

Writes tests/quality/results-latest.json. Commit the file to update this page. The --analyze flag calls claude-opus-4-7 to generate the improvement strategy. See the scoring model guide for details on how results are evaluated.