Separating Signal from Noise on YouTube with One Claude Call

I watch a lot of YouTube to learn things, and I kept noticing the same waste: a twenty-minute video has maybe six minutes of real content buried inside fourteen minutes of “smash that subscribe,” a sponsor read for a VPN, a recap of the last video, and manufactured urgency about why you need to keep watching. The signal is in there. You just have to mine it out of the noise.

So I built a tool that does the mining for you. Paste a video or a channel URL, and it returns a tight briefing: a two-to-four sentence summary of only the substantive parts, the key insights, the detailed points, a verdict on whether the thing is worth watching in full, and an honest audit trail of what it threw away as noise. I called it Signal/Noise.

This is the first of a handful of write-ups on the AI systems I have been building. I am starting with the smallest one on purpose, because sometimes the best use of a good model is the most boring one.

Open Table of contents

The whole product is one prompt
Why this needs a backend at all
Parsing JSON out of a language model, defensively
The transcript chain, and YouTube’s bot wall
Logging without ever blocking a request
Shifting the cost to the user
A desktop app without Electron
What I took away from it

The whole product is one prompt

It is tempting to over-engineer something like this. Chunk the transcript, embed it, run a multi-turn refinement loop, stream tokens, build a RAG pipeline. I did none of that. The entire intelligence of the app is a single Claude call with a system prompt that defines, in plain English, what counts as noise and what counts as value.

That system prompt is the product positioning, written as an instruction:

Noise is sponsor reads, self-promotion, channel and CTA plugs (like, subscribe, hit the bell), filler and verbal padding, manufactured hype or urgency, repetition, off-topic tangents, and engagement-bait that carries no information.

Value is concrete claims, data, mechanisms, arguments, frameworks, predictions with reasoning, specific examples, and actionable takeaways.

The model is asked to read the whole transcript at once and return structured JSON: a tldr, an integer value_score from 0 to 100 (the percentage of the video that is genuinely worth your time), key_insights, detailed_points, an explicit noise audit of what got sidelined, and a verdict of watch / skim / skip. One call, max_tokens around 4000, default model Claude Sonnet 4.6. No streaming, no follow-up turns, no state.

There is real freedom in deciding the LLM call should be a pure function. Transcript in, briefing out. Everything else in the app is just plumbing around that function, and the plumbing turned out to be the interesting engineering.

Why this needs a backend at all

My first instinct was to make it a pure frontend app. You cannot. The browser blocks you on two fronts:

YouTube’s video-listing endpoints are CORS-blocked, so client JavaScript cannot enumerate a channel’s recent uploads.
Transcript fetching is a server-side affair.

So the architecture is a minimal Flask backend (about 575 lines) plus a single-page frontend (about 476 lines). The backend resolves URLs, fetches transcripts, calls Claude, and optionally logs to Postgres. The frontend handles the API key, the library of past briefings, and rendering the report. That is the entire system.

Parsing JSON out of a language model, defensively

Models love to wrap JSON in markdown fences, or add a sentence before the object. If you call json.loads on the raw response you will get burned. The parser strips fences first, then falls back to grabbing everything between the first { and the last }:

def parse_json(text):
    text = re.sub(r"^```(?:json)?|```$", "", text.strip(), flags=re.M).strip()
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        a, b = text.find("{"), text.rfind("}")
        if a >= 0 and b > a:
            return json.loads(text[a:b + 1])
        raise

Small, unglamorous, and it is the difference between a tool that works on every video and one that randomly 500s when the model gets chatty.

The transcript chain, and YouTube’s bot wall

This is where most of the real work went. Getting a transcript reliably is harder than it sounds, and it gets harder the moment you deploy to the cloud.

The fetch logic is a three-tier fallback:

youtube-transcript-api for official transcripts, trying English variants ["en", "en-US", "en-GB"]. Fastest path, works fine from a home IP.
yt-dlp auto-subtitles if that fails. It downloads the .vtt captions, strips the timing tags and bracketed annotations like [music], and deduplicates repeated lines.
A hosted transcript API (Supadata) as the last resort.

Why three tiers? Because of a problem you only discover after you deploy: YouTube blocks datacenter IPs. From your laptop, tier one just works. From a Render or Railway box, YouTube returns a 429 and asks you to “confirm you’re not a bot,” and your nice working app falls over in production. The two honest fixes are a hosted transcript service that has solved the IP problem for you (Supadata has a free tier of roughly 100 transcripts a month, no card required), or routing every YouTube call through a residential proxy like Webshare. The app supports both and cascades automatically. This is the kind of thing no architecture diagram warns you about; you learn it from a production log.

Listing a channel’s videos uses yt-dlp in a deliberately lightweight way, extract_flat so it never downloads anything, just enumerates the most recent N uploads:

opts = {
    "quiet": True, "no_warnings": True,
    "extract_flat": "in_playlist",   # list, don't download
    "playlistend": n,
    "skip_download": True,
}

Long videos get capped at about 120k characters (roughly 30k tokens) with a “transcript truncated” marker, to keep cost and latency bounded.

Logging without ever blocking a request

I wanted analytics (how many analyses, success rate, distinct users, average signal score) without slowing down a single user request. Synchronous logging would mean every request waits on a database write, and if the database is slow or down, the whole app is slow or down.

So logging is fully non-blocking. A background daemon thread drains a bounded queue:

Every log call enqueues and returns immediately.
A worker thread writes to Postgres on its own time.
If the database is slow, the queue absorbs it. If the queue fills (cap 1000), new items are dropped silently rather than blocking the request.
If the database is down entirely, the app does not care. It keeps serving.

The user-facing path never waits on the database. Ever. It is a simple pattern, a queue and a daemon thread, and it is the right default for any web app where logging is nice-to-have rather than load-bearing.

The schema is careful about one thing in particular: it never stores a user’s API key. It stores a key_hash, the first 12 characters of a SHA-256 of the key, which is enough to count distinct users without ever holding their secret.

CREATE TABLE events (
  id BIGSERIAL PRIMARY KEY,
  created_at TIMESTAMPTZ DEFAULT now(),
  kind TEXT, status TEXT,           -- resolve|analyze, ok|error
  video_id TEXT, title TEXT, url TEXT,
  value_score INTEGER, transcript_chars INTEGER, model TEXT,
  key_hash TEXT,                    -- sha256 prefix, never the key
  duration_ms INTEGER, ip TEXT, user_agent TEXT,
  error TEXT, result JSONB          -- the full briefing
);

Shifting the cost to the user

If you put a public LLM tool on the internet with your own API key wired in, you are funding everyone’s usage and you will wake up to a surprise bill. The model here flips that: nobody’s key lives on the server.

Each visitor pastes their own Anthropic key into the browser. It is stored in localStorage, displayed masked (sk-an...▪▪▪▪), and sent per request in an X-Anthropic-Key header. The server uses it for that one call and forgets it. The only trace is the one-way hash for counting users. The visitor pays for their own analyses, which means the tool can be genuinely public without bankrupting me. The frontend also keeps a local library of up to 80 past briefings, so re-opening a video you already analysed is instant and free, with a “re-run a fresh briefing” button when you want to bust the cache.

There is even a small bit of intentional UX honesty in the loader, which cycles through five messages every few seconds (“Separating signal from noise…”, “Surfacing the claims that matter…”) and tells you straight out that “reading it properly takes a few moments, worth the wait.” The analysis takes 30 to 60 seconds. Rather than hide that, the loader makes the wait feel deliberate.

A desktop app without Electron

I wanted a version friends could just install, not run from a terminal. The reflex answer is Electron, which means shipping a whole Chromium. I did not want that. Instead the desktop build is the same Flask app, packaged with PyInstaller, fronted by a tiny launcher:

def _pick_port():
    if os.environ.get("PORT"):
        return int(os.environ["PORT"])
    for p in (8765, 8766, 8767):          # prefer a stable, bookmarkable port
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        try:
            s.bind(("127.0.0.1", p)); s.close(); return p
        except OSError:
            s.close()
    return _free_port()                    # fall back to any free port

The launcher finds a port (preferring fixed ones so the browser tab stays bookmarkable across restarts), starts the Flask server in a background thread, opens the browser to 127.0.0.1, and puts a small icon in the macOS menu bar with Open and Quit via the rumps library. build_app.sh runs PyInstaller and wraps the result in a drag-to-Applications DMG. The whole thing is bound to localhost only, so it never listens on the network. It feels like a native app and it is a fraction of the size of an Electron build. The same codebase deploys to the cloud with one line in a Procfile:

gunicorn app:app --workers 1 --threads 8 --timeout 120 --bind 0.0.0.0:$PORT

Three parameters there matter more than they look. One worker to avoid contention, eight threads because Claude calls block and you want concurrency, and a 120-second timeout because gunicorn’s 30-second default would kill a long analysis dead.

What I took away from it

Sometimes the LLM call should be a pure function. Transcript in, structured briefing out, no state. Resisting the urge to build a pipeline kept the whole thing legible and cheap.
The prompt is the product. The entire value proposition lives in nineteen lines that define noise versus value. Get that right and the rest is plumbing.
Production teaches you things local never will. The datacenter-IP block is invisible until you deploy, and the fix (fallback transcript sources) became a core feature rather than a patch.
Shift the cost to where the value lands. Bring-your-own-key turned a tool that would have drained my budget into one I can leave running in public.

The through-line, which I will keep coming back to in the bigger systems I write about next, is that the model is rarely the hard part. The harness, the plumbing, and the honest handling of edge cases are where the real work lives.