Transcriber Agent

Role: Audio/Video to Markdown Transcription Specialist File: api/agents/agent-transcriber.js

Overview

The Transcriber agent converts audio and video content into searchable Markdown documents using OpenAI’s Whisper model. It handles direct file uploads, URL downloads (including YouTube and Vimeo), and produces clean, well-formatted transcripts suitable for indexing in the library.

Core Capabilities

1. Multi-Source Input

Direct audio/video file uploads (MP3, MP4, WAV, FLAC, etc.)
YouTube and Vimeo URL downloads via yt-dlp
Direct URL downloads for other audio sources
Podcast RSS feed URLs (future)

2. Whisper Transcription

Development mode: OpenAI Whisper API (simpler setup)
Production mode: Local Whisper installation (faster, no API costs)
Supports multiple languages with auto-detection
Segment-level timestamps for navigation

3. Markdown Formatting

AI-powered transcript cleanup and formatting
Section headings based on topic shifts
Speaker detection and labeling
Proper punctuation and paragraph breaks
YAML frontmatter with metadata

4. Sacred Text Handling

Correct spelling of religious terms
Proper diacriticals (Bahá’u’lláh, ‘Abdu’l-Bahá)
Preservation of speaker’s voice and style
[inaudible] markers for unclear portions

Architecture

Audio/Video Source
       │
       ▼
┌─────────────────────┐
│  Download Media     │ ◄── YouTube, Vimeo, direct URLs
│  (yt-dlp/fetch)     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Transcribe Audio   │ ◄── Local Whisper or OpenAI API
│  (whisper/API)      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Format Transcript  │ ◄── AI cleanup and structuring
│  (LLM)              │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Save Markdown      │ ──► data/transcriptions/
└─────────────────────┘

Supported Formats

Format	Extension	Notes
MP3	.mp3	Most common audio format
MP4	.mp4	Video with audio extraction
WAV	.wav	High-quality uncompressed
FLAC	.flac	Lossless compression
WebM	.webm	Web video format
M4A	.m4a	Apple audio format
OGG	.ogg	Open source format
MPEG	.mpeg, .mpga	Various MPEG formats

Usage Examples

Transcribe from URL

import { TranscriberAgent } from './api/agents/agent-transcriber.js';

const transcriber = new TranscriberAgent();

// Transcribe a YouTube video
const result = await transcriber.transcribe(
  'https://www.youtube.com/watch?v=VIDEO_ID',
  {
    language: 'en',
    whisperModel: 'medium',
    format: true  // AI formatting
  }
);

// Returns:
// {
//   markdown: "---\ntitle: ...\n---\n\n# Talk Title\n\n...",
//   rawText: "original whisper output...",
//   metadata: { title, duration, uploader, url },
//   outputPath: "data/transcriptions/Talk-Title.md",
//   segments: [{ start, end, text }, ...]
// }

Transcribe Local File

// Transcribe a local audio file
const result = await transcriber.transcribe(
  '/path/to/lecture.mp3',
  {
    language: 'en',
    format: true
  }
);

Skip AI Formatting

// Get raw transcript without AI cleanup
const result = await transcriber.transcribe(url, {
  format: false  // Skip AI formatting
});

Custom Output Path

const result = await transcriber.transcribe(url, {
  outputPath: './library/talks/my-lecture.md'
});

Output Format

Markdown with Frontmatter

---
title: "The Power of Prayer"
speaker: "Dr. John Smith"
date: "2024-03-15"
source: "https://youtube.com/watch?v=..."
duration: "01:23:45"
---

# The Power of Prayer

## Introduction

Welcome everyone. Today we're going to explore the transformative
power of prayer across different spiritual traditions...

## Historical Context

The practice of prayer dates back to the earliest recorded
human civilizations...

## Practical Application

When we approach prayer with **sincerity** and **humility**,
we open ourselves to divine guidance...

[Recording ends]

Configuration

Option	Default	Description
`service`	quality	AI service tier for formatting
`temperature`	0.3	Low temp for accurate transcription
`maxTokens`	4000	Max tokens for formatting response
`whisperModel`	medium	Whisper model (tiny, base, small, medium, large)
`language`	en	Language code or ‘auto’ for detection

Environment Variables

# Transcription output directory
TRANSCRIPTION_DIR=./data/transcriptions

# For OpenAI Whisper API (dev mode)
OPENAI_API_KEY=sk-...

# DEV_MODE determines API vs local Whisper
DEV_MODE=true  # Uses OpenAI API
DEV_MODE=false # Uses local Whisper

Dependencies

For Development (API Mode)

OpenAI API key with access to Whisper

For Production (Local Mode)

Whisper installed locally
FFmpeg for audio processing
yt-dlp for video downloads

# Install local Whisper
pip install openai-whisper

# Install yt-dlp
brew install yt-dlp  # macOS
pip install yt-dlp   # or via pip

# Install FFmpeg
brew install ffmpeg  # macOS

API Limits

Service	Limit	Notes
OpenAI Whisper API	25MB max file	Split larger files
Local Whisper	No limit	Constrained by RAM/GPU
YouTube	Varies	yt-dlp handles most restrictions

Integration with Sifter

The Transcriber can be invoked through Sifter for natural language requests:

User: "Transcribe this talk: https://youtube.com/watch?v=..."

Sifter: I'll have the Transcriber process that video for you.
[Invokes TranscriberAgent.transcribe()]
The transcript has been saved to the library. Would you like me
to search for specific topics within it?

Integration with Librarian

Transcribed documents can be automatically queued for library ingestion:

// Transcribe and queue for library
const transcript = await transcriber.transcribe(url);

// Queue with Librarian for indexing
await librarian.queueDocument('transcription', {
  content: transcript.markdown,
  metadata: transcript.metadata,
  sourcePath: transcript.outputPath
});

Future Enhancements

Batch transcription from playlist/channel URLs
Speaker diarization (who spoke when)
Real-time streaming transcription
Podcast RSS feed auto-import
Translation of non-English transcripts
Audio chapter/segment extraction
Integration with narration for read-back verification