Translation Pipeline
How SifterSearch turns Arabic and Persian manuscripts into scholar-quality, phrase-aligned English translations — and makes both originals and translations searchable together
Figure 1. The SifterSearch translation pipeline. Each document passes through OCR cleanup, three-pass AI segmentation, JAFAR root-word analysis, and phrase-level translation — producing both a side-by-side display and a cross-language search index.
The Challenge
SifterSearch — the research engine behind OceanLibrary.com — indexes more than 46,552 texts across 12 faith traditions. Among them are hundreds of Arabic and Persian documents: classical Islamic jurisprudence, Quranic commentary, Sufi treatises, early Baha'i tablets, and handwritten manuscripts that have never been widely available to English-speaking readers.
These texts are inaccessible in multiple overlapping ways. Some arrive as poorly-OCR'd PDFs with garbled characters and missing diacritical marks. Others are handwritten manuscripts photographed rather than typed. Even those with clean text pose a structural problem: Arabic and Persian lack the word-spacing conventions that make sentence segmentation trivial in European languages. And the translations that do exist are often inconsistent in terminology, out of date, or simply absent for the texts that matter most.
The translation pipeline described here addresses each of these problems in sequence: clean the text, segment it intelligently, analyze it at the root-word level, and translate with terminology consistency that standard AI translation does not provide.
OCR and Document Cleanup
Before any linguistic processing can begin, the document must be readable. The ingestion pipeline uses Tesseract for initial OCR extraction, followed by an AI correction pass that catches the character errors Tesseract introduces when processing Arabic script — misread letters, fused words, dropped diacritical marks.
Diacritical marks are a particular concern for Baha'i texts, which use a specialized transliteration system for proper names. Rendering errors turn Bahá'u'lláh into Baha'u'llah and 'Abdu'l-Bahá into Abdu'l-Baha — destroying the precision of the transliteration and making names harder to match across documents. The cleanup pass restores these marks using document-specific glossaries and pattern matching.
Structural analysis runs alongside character correction: the system identifies headings, paragraph breaks, page numbers, footnotes, and running headers. The output is clean Markdown with the document structure preserved — section hierarchy intact, footnotes linked rather than interspersed with body text. This structured Markdown is what enters the segmentation step.
Arabic and Persian Segmentation
European-language text segmentation is largely a solved problem. Sentence boundaries are marked by punctuation; paragraph boundaries by blank lines or indentation. Arabic and Persian texts — especially classical manuscripts — often arrive as continuous prose blocks with minimal punctuation, no blank lines, and verse and sentence boundaries implied rather than marked.
The segmenter detects this condition by measuring punctuation density: if a block of text lacks sentence-ending marks at the frequency expected for its length, it triggers the AI segmentation path. Language detection runs first, distinguishing Arabic from Persian by the presence of Farsi-specific characters (پ چ ژ گ ی) — necessary because the two languages share most of their script but have different grammatical structures.
The model identifies natural phrase breaks within continuous text — the smallest meaningful units. Compound verb phrases, prepositional phrases, and relative clauses become distinct segments.
Phrases are grouped into complete sentences. The model understands Arabic grammar well enough to recognize where a sentence ends even without a terminal punctuation mark.
Sentences are grouped into thematic paragraphs — the unit that becomes independently retrievable in the search index. Each paragraph is small enough to be precise, large enough to carry meaning.
Figure 2. Three-pass segmentation for Arabic and Persian text. The model (Qwen 72B) suggests break positions only — it never generates or modifies text. Text integrity is verified after every pass.
An important constraint governs the entire segmentation process: the AI model suggests break positions, but never rewrites. Every character in the output must appear in the input. This is verified programmatically — if the reconstructed text does not match the original character-for-character, the segmentation is rejected and retried. Scholarly texts cannot be paraphrased during processing; the original must survive intact.
The segmented paragraphs become the fundamental unit of indexing. Each paragraph receives its own embedding, its own search index entry, and — after the translation step — its own phrase-level translation. Segmentation quality directly determines the precision of everything that follows.
Translation Strategy with CTAI.info
Standard AI translation of Arabic and Persian religious texts has a well-known failure mode: terminological inconsistency. The Arabic word tawakkul might be translated as "reliance," "trust," "surrender," or "dependence" depending on which passage it appears in — and which of those renderings is correct depends on context that casual translation does not capture. In a library of hundreds of texts, inconsistent term translation makes cross-text research unreliable.
SifterSearch's translation strategy addresses this through integration with CTAI.info (Comprehensive Transliteration and Analysis of Islamic texts), which provides JAFAR reports — JSON Analysis of Foreign Arabic Roots — for Arabic and Persian text.
A JAFAR (JSON Analysis of Foreign Arabic Roots) report breaks each word in an Arabic or Persian passage into its trilateral root (the three-letter base from which the word derives), its morphological form (verb, noun, participle, and the grammatical pattern applied), and its semantic field (the cluster of related meanings the root carries). This root-level analysis is what makes consistent, context-aware translation possible: the same root translated the same way, every time, across every document in the library.
The root-word analysis that JAFAR provides enables three translation capabilities that general-purpose AI translation does not offer:
- Terminological consistency. When the same Arabic root appears in different documents across the library, it receives the same English rendering — unless context provides clear reason to diverge, in which case the divergence is documented. A researcher tracing a concept across texts can trust that the same English term reflects the same Arabic original.
- Disambiguation by root. Arabic words that look similar on the surface often derive from different trilateral roots with distinct meaning fields. Root analysis catches these distinctions where transliteration alone would not.
- Scholar-quality nuance. Many technical terms in Islamic philosophy and Sufi literature carry layered meanings that casual translation collapses. The semantic field annotation in JAFAR output preserves this layering so the translation can capture precision that the reader — even without Arabic — can rely on.
Figure 3. A JAFAR analysis of tawakkul. The trilateral root, morphological form, and semantic field together determine the consistent English rendering used across every document in the library.
Side-by-Side Display
Translated text is stored at the phrase level in the translation_segments column
of the content table — each phrase segment paired with its Arabic or Persian original. The
frontend renders these in alignment: left column shows the original script (right-to-left),
right column shows the English translation, synchronized phrase by phrase.
This phrase-level alignment serves a purpose beyond readability. A researcher who wants to verify a translation can click any English phrase and see exactly which Arabic original it corresponds to. A scholar who reads Arabic can spot a translation choice they disagree with and immediately identify the source phrase. The alignment makes the translation accountable in a way that paragraph-level or document-level translation is not.
Figure 4. Phrase-level side-by-side display. Each Arabic phrase aligns with its English translation. Click-to-reveal JAFAR root analysis is coming soon.
Cross-Language Search Integration
Once a document has been translated and segmented, its passages are embedded in the same 3,072-dimensional semantic space as all other documents in the library — both the Arabic original and the English translation. This is where the translation pipeline connects to the broader research strategy.
Because the embedding model operates on meaning rather than words, the Arabic phrase and its English translation occupy nearby positions in the vector space. A researcher searching in English surfaces not only English passages but the Arabic originals — and vice versa. The language boundary effectively disappears at the search layer.
Search "reliance on God" in English and surface Arabic passages about تَوَكُّل (tawakkul) — because the English translation in the index uses that precise term consistently. Search using Arabic terminology and surface the English Baha'i texts that discuss the same concept. HyPE questions for Arabic passages are generated in English as well, creating an additional vocabulary bridge at the keyword layer.
The translation pipeline's terminological consistency pays a dividend here that it would not without JAFAR analysis: because tawakkul is always "reliance on God" across the library, an English search for that phrase retrieves every Arabic text in which the root appears — not just those where a given translation model happened to use those words that day. The consistency is load-bearing for cross-language search quality.
Pipeline Status
OCR ingestion, document cleanup, and the three-pass segmentation system are in production,
running locally on vLLM with prefix caching for cost efficiency. CTAI.info is a completed,
working project — its JAFAR API is live and available for root-word analysis. The
infrastructure for storing and serving phrase-level translations (translation
and translation_segments columns, admin routes for triggering translation
jobs, and the email notification system for completed jobs) is already in place.
Integration of the CTAI.info translation API into the automated pipeline is underway.
The side-by-side display and the click-to-JAFAR feature are UI work that will follow as translations are generated at scale. The cross-language semantic search described above works today for documents that are already indexed — the translation pipeline will extend that capability to hundreds of Arabic and Persian texts that currently return no results for English queries.
Last updated: March 2026