Case Study 03

Multi-Model AI Video Localization Pipeline

Four AI services orchestrated into one coherent pipeline. Source video in, localized derivative video out — for under $1.

Role: Solo Developer · Timeline: 3 months production · Status: Active deployment

The business problem

A content arbitrage agency wanted to take publicly available English-language videos and create localized variations for different markets — different languages, different audience targeting, different content niches. The constraint: straight translation creates copyright issues (you can't just translate someone's content and re-upload), so they needed something that produced derivative content rather than translations.

The use case was a custom commercial requirement, not a market product — there was no existing tool that could do this. The client wanted to test a content scaling hypothesis: could AI uniqualize source material at sufficient quality and volume to make arbitrage economically viable.

What made this hard

The naive approach — translate + re-voice + re-upload — fails for two reasons:

The system needed to produce videos with:

Architecture: two-stage pipeline

Stage 1: library population

The system first builds a searchable visual library:

Architectural decision worth noting: the previous implementation stored embeddings in Firebase, which was expensive and unnecessary. I migrated everything to a locally-hosted Qdrant instance, eliminating the recurring database costs entirely. Only embedding creation costs money now — storage and retrieval are free.

Stage 2: new video generation

When the client wants to produce a new video:

Key technical decisions

Why Vertex AI for embeddings

OpenAI didn't expose API access to the specific video embedding model needed at the time. Local alternatives would be expensive to run. Vertex AI offered the best cost-quality balance for production use.

Why self-hosted Whisper

At scale, API costs for transcription become significant. Self-hosting on local GPU eliminated recurring transcription expense entirely.

Why multi-language TTS via reseller provider

Instead of subscribing directly with rigid plan-based API limits, I used a pay-as-you-go reseller. Same quality output, no subscription lock-in, easier cost scaling.

Why Qdrant locally

Vector database hosted on a local server eliminated cloud database recurring costs. The full library lived on a single home server (i5 10th gen + GTX 1070).

Cost engineering

Per-video cost breakdown for a 20-minute output:

Total per video: under $1, even for long-form 20-minute content.

This is the kind of cost structure that makes content arbitrage economically viable at scale — manually localizing a 20-minute video would take a designer/editor 8–15 hours of work.

Production setup

Challenges solved

1. Embedding quality for visual matching

Initial implementation produced poor semantic matches — new audio about Topic X would get visually unrelated footage. Solution: augmented each video segment's embedding with a JSON content description, dramatically improving match relevance.

2. Pacing and rhythm in assembled videos

Auto-assembled videos initially looked unnatural — segments too short (under 1.5s) or too long (over 15s), character cuts at awkward moments. Built constraints into the assembly logic: minimum/maximum segment durations, avoiding repeat segments within proximity, audio level normalization.

3. Migration from Firebase to local Qdrant

Inherited architecture stored embeddings in Firebase at recurring cost. Migrated entire pipeline to locally-hosted Qdrant, eliminating ongoing database expense entirely.

4. Whisper translation quality

Standard Whisper translations sometimes produced awkward output. Added Gemini as a rewriting layer that improved both meaning preservation and natural language flow in the target language.

Outcome

< $1
Per 20-minute video produced
4+
AI services orchestrated
~2/hour
Throughput for 20-min outputs
3 mo
Active production use

End-to-end automation — client submits a URL via Telegram, receives ready-to-publish video without intermediate manual steps. Sub-$1 cost per 20-minute output. Scalable architecture designed to expand across multiple GPU instances and category-specific libraries.

Tech stack

LanguagePython
Video ProcessingFFmpeg (GPU-accelerated)
EmbeddingsVertex AI
TranscriptionWhisper Large (self-hosted)
LLM RewritingGemini
Voice SynthesisMulti-language TTS
Vector DatabaseQdrant (self-hosted)
InterfaceTelegram Bot

What this demonstrates

Got a similar problem?

I'd love to hear about what you're building.

david@chystyi.dev →
← Previous: Open Source Lipsync Next: Motion Control Workflow →