Audio/Video to Text

Transcriber Agent

Audio and video to Markdown transcription agent using Whisper AI

Transcriber Agent

Role: Audio/Video to Markdown Transcription Specialist File: api/agents/agent-transcriber.js

Overview

The Transcriber agent converts audio and video content into searchable Markdown documents using OpenAI’s Whisper model. It handles direct file uploads, URL downloads (including YouTube and Vimeo), and produces clean, well-formatted transcripts suitable for indexing in the library.

Core Capabilities

1. Multi-Source Input

  • Direct audio/video file uploads (MP3, MP4, WAV, FLAC, etc.)
  • YouTube and Vimeo URL downloads via yt-dlp
  • Direct URL downloads for other audio sources
  • Podcast RSS feed URLs (future)

2. Whisper Transcription

  • Development mode: OpenAI Whisper API (simpler setup)
  • Production mode: Local Whisper installation (faster, no API costs)
  • Supports multiple languages with auto-detection
  • Segment-level timestamps for navigation

3. Markdown Formatting

  • AI-powered transcript cleanup and formatting
  • Section headings based on topic shifts
  • Speaker detection and labeling
  • Proper punctuation and paragraph breaks
  • YAML frontmatter with metadata

4. Sacred Text Handling

  • Correct spelling of religious terms
  • Proper diacriticals (Bahá’u’lláh, ‘Abdu’l-Bahá)
  • Preservation of speaker’s voice and style
  • [inaudible] markers for unclear portions

Architecture

Audio/Video Source


┌─────────────────────┐
│  Download Media     │ ◄── YouTube, Vimeo, direct URLs
│  (yt-dlp/fetch)     │
└──────────┬──────────┘


┌─────────────────────┐
│  Transcribe Audio   │ ◄── Local Whisper or OpenAI API
│  (whisper/API)      │
└──────────┬──────────┘


┌─────────────────────┐
│  Format Transcript  │ ◄── AI cleanup and structuring
│  (LLM)              │
└──────────┬──────────┘


┌─────────────────────┐
│  Save Markdown      │ ──► data/transcriptions/
└─────────────────────┘

Supported Formats

FormatExtensionNotes
MP3.mp3Most common audio format
MP4.mp4Video with audio extraction
WAV.wavHigh-quality uncompressed
FLAC.flacLossless compression
WebM.webmWeb video format
M4A.m4aApple audio format
OGG.oggOpen source format
MPEG.mpeg, .mpgaVarious MPEG formats

Usage Examples

Transcribe from URL

import { TranscriberAgent } from './api/agents/agent-transcriber.js';

const transcriber = new TranscriberAgent();

// Transcribe a YouTube video
const result = await transcriber.transcribe(
  'https://www.youtube.com/watch?v=VIDEO_ID',
  {
    language: 'en',
    whisperModel: 'medium',
    format: true  // AI formatting
  }
);

// Returns:
// {
//   markdown: "---\ntitle: ...\n---\n\n# Talk Title\n\n...",
//   rawText: "original whisper output...",
//   metadata: { title, duration, uploader, url },
//   outputPath: "data/transcriptions/Talk-Title.md",
//   segments: [{ start, end, text }, ...]
// }

Transcribe Local File

// Transcribe a local audio file
const result = await transcriber.transcribe(
  '/path/to/lecture.mp3',
  {
    language: 'en',
    format: true
  }
);

Skip AI Formatting

// Get raw transcript without AI cleanup
const result = await transcriber.transcribe(url, {
  format: false  // Skip AI formatting
});

Custom Output Path

const result = await transcriber.transcribe(url, {
  outputPath: './library/talks/my-lecture.md'
});

Output Format

Markdown with Frontmatter

---
title: "The Power of Prayer"
speaker: "Dr. John Smith"
date: "2024-03-15"
source: "https://youtube.com/watch?v=..."
duration: "01:23:45"
---

# The Power of Prayer

## Introduction

Welcome everyone. Today we're going to explore the transformative
power of prayer across different spiritual traditions...

## Historical Context

The practice of prayer dates back to the earliest recorded
human civilizations...

## Practical Application

When we approach prayer with **sincerity** and **humility**,
we open ourselves to divine guidance...

[Recording ends]

Configuration

OptionDefaultDescription
servicequalityAI service tier for formatting
temperature0.3Low temp for accurate transcription
maxTokens4000Max tokens for formatting response
whisperModelmediumWhisper model (tiny, base, small, medium, large)
languageenLanguage code or ‘auto’ for detection

Environment Variables

# Transcription output directory
TRANSCRIPTION_DIR=./data/transcriptions

# For OpenAI Whisper API (dev mode)
OPENAI_API_KEY=sk-...

# DEV_MODE determines API vs local Whisper
DEV_MODE=true  # Uses OpenAI API
DEV_MODE=false # Uses local Whisper

Dependencies

For Development (API Mode)

  • OpenAI API key with access to Whisper

For Production (Local Mode)

# Install local Whisper
pip install openai-whisper

# Install yt-dlp
brew install yt-dlp  # macOS
pip install yt-dlp   # or via pip

# Install FFmpeg
brew install ffmpeg  # macOS

API Limits

ServiceLimitNotes
OpenAI Whisper API25MB max fileSplit larger files
Local WhisperNo limitConstrained by RAM/GPU
YouTubeVariesyt-dlp handles most restrictions

Integration with Sifter

The Transcriber can be invoked through Sifter for natural language requests:

User: "Transcribe this talk: https://youtube.com/watch?v=..."

Sifter: I'll have the Transcriber process that video for you.
[Invokes TranscriberAgent.transcribe()]
The transcript has been saved to the library. Would you like me
to search for specific topics within it?

Integration with Librarian

Transcribed documents can be automatically queued for library ingestion:

// Transcribe and queue for library
const transcript = await transcriber.transcribe(url);

// Queue with Librarian for indexing
await librarian.queueDocument('transcription', {
  content: transcript.markdown,
  metadata: transcript.metadata,
  sourcePath: transcript.outputPath
});

Future Enhancements

  • Batch transcription from playlist/channel URLs
  • Speaker diarization (who spoke when)
  • Real-time streaming transcription
  • Podcast RSS feed auto-import
  • Translation of non-English transcripts
  • Audio chapter/segment extraction
  • Integration with narration for read-back verification