How It Works

A guide to the Journalist Tool content monitoring pipeline

Last updated March 2026

What is Journalist Tool?

Journalist Tool is an AI-powered content monitoring and analysis pipeline for beat reporters. It watches RSS feeds, YouTube channels, podcasts, websites, and forums for new content; acquires and cleans transcripts; runs structured AI analysis; maintains a living, self-improving knowledge base; and synthesizes daily intelligence briefings with cross-source themes, signal alerts, and story angles.

Multi-Source Monitoring

Watches RSS, YouTube, podcasts, websites, and forums. 7-tier transcript acquisition with free-first strategy.

AI Analysis

Claude Haiku extracts entities, signals, claims, quotes, sentiment, and topics from every piece of content.

Living Knowledge Base

Self-improving Cortex with entity profiles, glossary, ASR corrections, and contradiction detection.

Daily Briefs

Claude Sonnet synthesizes cross-source Signal Radar briefs with trend alerts and story angles.

How the Pipeline Works

The pipeline runs as 8 independent Railway cron services sharing the same Docker image. Each service handles one concern and runs on its own schedule. Failures in one service don't affect the others.

Check Sources

Poll RSS, YouTube, web, and forum sources for new content. Circuit breaker skips unhealthy sources.

→

Transcribe

7-tier acquisition: raw text, web scrape, YouTube subs, RSS tags, mirrors, Groq Whisper, YT audio.

→

Enrich

Haiku + RAG corrects ASR errors, normalizes jargon, extracts unknown terms. Webhooks dispatched.

→

Analyze

Haiku produces structured JSON: summary, entities, signals, claims, quotes, sentiment, topics.

→

Cortex Learn

Promote terms, refine definitions, detect duplicates, merge aliases, generate training questions.

→

Synthesize

Sonnet clusters stories across sources and generates the daily Signal Radar brief.

→

Renders the brief as HTML and sends it via Resend. Pipeline complete.

Stage Details

1. Check Sources: Polls all enabled sources (respecting check intervals) for new content. RSS via feedparser, YouTube via yt-dlp, websites via trafilatura, forums via custom scraping. Circuit breaker pauses sources after 3 consecutive failures.
2. Acquire Transcripts: 7-tier strategy tries free methods first: raw text, web scrape, YouTube auto-subs, RSS transcript tags, YouTube mirrors. Falls back to paid Groq Whisper ASR for podcasts and videos without free transcripts.
3. Enrich: Claude Haiku with RAG-retrieved KB context corrects domain-specific ASR errors, normalizes jargon, adds paragraph breaks, and extracts unknown terms. Cleaned transcripts are dispatched via webhooks to downstream systems.
4. Analyze: Claude Haiku produces structured JSON analysis per item: summary, one-line TL;DR, key claims with confidence scores, notable quotes with timestamps, entities, topics, signals, and sentiment.
5. Cortex Learn: Full KB learning cycle: promote high-confidence terms from pending to active, refine glossary definitions from accumulated evidence, detect duplicate terms, merge entity aliases, and generate training questions for human review.
6. Synthesize: Claude Sonnet clusters related stories across sources, then generates the daily Signal Radar brief: executive summary, convergence threads with quotes and claims, signal alerts, trend alerts, and story angles.
7. Email: Renders the brief as a formatted HTML email and sends it via Resend. The brief is also viewable in the dashboard at /briefs.

System Architecture

Three deployment targets work together. Railway runs the Python pipeline on a cron schedule. Supabase hosts the PostgreSQL database and Edge Functions. Vercel serves this dashboard and the Train Cortex game.

Railway

Docker cron container
Python async pipeline
Batch execution (not a server)
Runs every 6 hours

Supabase

PostgreSQL 15+ database
pgvector for embeddings
Edge Functions (Train Cortex API)
22 tables, 7 RPCs

Vercel

Next.js 15 dashboard
Server components (no API layer)
Train Cortex game (static HTML)
Reads directly from Supabase

Railway→ writes to →Supabase← reads from ←Vercel

Railway runs 8 independent cron services from the same Docker image (python:3.12-slim + ffmpeg + uv). Each service is dispatched by the RUN_MODE env var: source-checker (*/15), transcribe-worker (*/30), analyze-worker (*/30), cortex-applier (*/30), re-enricher (every 2h), cortex-learner (every 4h), embedder (daily 3am), and content-monitor (daily synthesis + email). No long-running server.

Supabase provides PostgreSQL with pgvector for embeddings, plus Edge Functions for the Train Cortex API. 22 tables, 7 RPCs, HNSW indexes for vector similarity search.

Vercel hosts this Next.js dashboard. Server components read directly from Supabase — no API layer in between. Also hosts the Train Cortex game as static HTML.

Dashboard Pages

Dashboard

At-a-glance stats: total items, active sources, unhealthy sources, latest brief. Preview of the most recent brief and the last 10 discovered items.

When to use: Check system health and see what's new.

Items

/items

Filterable, sortable table of all discovered content. Filter by status (pending, transcribed, analyzed, failed) and source. Each item links to its full analysis.

When to use: Browse content, find specific items, check processing status.

Sources

/sources

All monitored sources with health indicators. Inline controls to pause, resume, edit, reset health, or delete. Add new sources via the form.

When to use: Manage what the tool monitors. Diagnose source failures.

Beats

/beats

Beat configuration hub. Each beat has a glossary editor, entity manager, signal taxonomy editor, KB browsers, auto-mined suggestion review, and a seed-from-text tool.

When to use: Configure domain knowledge. Review and train the KB.

Briefs

/briefs

Daily Signal Radar briefs. Each brief contains an executive summary, convergence threads with cross-source quotes and claims, signal alerts, trend alerts, and story angles.

When to use: Read your morning intelligence briefing.

Intel

/intel

Machine intelligence tracking: rumors, leaks, announcements, and predictions extracted from content. Filter by entity, type, or confidence level.

When to use: Track accountability: who said what, and were they right?

Search

/search

Two search modes: item search (title + analysis text) and transcript search (full-text segment search with highlighted results and timestamps).

When to use: Find what anyone said about any topic across all sources.

The Knowledge Base (Cortex)

Cortex is a living, self-improving domain knowledge system. It learns from every transcript the pipeline processes, building a richer understanding of the beat over time.

Entity Profiles

Canonical registry of people, companies, and organizations. Auto-synthesized profiles from accumulated mentions. Alias detection and entity merging.

Glossary

Living dictionary of domain-specific terms. Definitions refined from accumulated evidence. Version history tracks how definitions evolve.

Signal Taxonomy

Categorized signal types (acquisitions, launches, rumors, etc.) with descriptions and examples. Used by analysis to tag content with detected signals.

Claims Tracking

Key claims extracted from content with confidence scores. Predictions, rumors, and factual assertions tracked for accountability.

ASR Corrections

Known speech-to-text errors mapped to correct forms. Fed back into enrichment to improve transcript quality over time.

Contradictions

Detects when new information conflicts with existing KB entries. Flags for human review with side-by-side comparison.

Train Cortex

Gamified human-in-the-loop training. Tinder-style swipe cards for disambiguating entities, verifying ASR corrections, and reviewing definitions.

Cortex Health

Dashboard metrics: glossary breakdown, linked terms, refinement status, ASR correction counts, contradiction counts, pending questions.

Ideal Workflow

Here's how to get the most out of the Journalist Tool day-to-day.

Add Sources

Setup

Point the tool at RSS feeds, YouTube channels, podcasts, websites, or forums you want to monitor. Set check intervals per source.

Run the Pipeline

Automatic

Eight independent Railway cron services run automatically on staggered schedules. They check sources (every 15min), transcribe and enrich (every 30min), analyze (every 30min), apply KB answers, re-enrich stale transcripts, run Cortex learning, embed chunks, and synthesize daily briefs.

Read Your Brief

Daily

Check /briefs each morning for the Signal Radar — cross-source themes, signal alerts, trend alerts, and key quotes.

Explore Claims

As needed

Use the Claims Explorer to browse predictions, rumors, and factual claims. Filter by entity, type, or confidence to track accountability.

Search the Archive

As needed

Use /search for full-text and semantic search across all transcripts and analyses. Find what anyone said about any topic.

Train the Cortex

Weekly

Visit Train Cortex to answer KB questions (merge terms, verify ASR, disambiguate entities). This improves enrichment and analysis quality over time.

Review Intel

As needed

Check /intel for machine intelligence tracking — rumors, leaks, and announcements with source corroboration.

Flywheel effect: The more sources you add and the more you train the Cortex, the smarter the system gets. Each pipeline run builds on previous knowledge. ASR corrections improve transcripts. Glossary terms improve analysis. Entity profiles improve briefs.

Key Concepts

Concept	Definition
Source	A content feed to monitor: RSS, YouTube channel, podcast, website, or forum. Each source has a type, URL, check interval, and health tracking.
Item	A single piece of content discovered from a source. Progresses through statuses: pending, transcript_acquired, analyzed, failed.
Transcript	The text content of an item. Acquired via 7 tiers (free first, paid fallback). Raw transcript is cleaned during enrichment.
Analysis	Structured AI output for an item: summary, entities, topics, signals, claims, quotes, sentiment. Produced by Claude Haiku.
Signal	A categorized indicator detected in content: acquisition rumor, product launch, executive move, financial result, etc.
Claim	A specific assertion extracted from content with a confidence score. Predictions, rumors, and facts tracked for accountability.
Brief	A daily synthesis report generated by Claude Sonnet. Contains cross-source themes, signal alerts, trend alerts, and story angles.
Beat	A subject area / domain (e.g., pinball industry). Contains the system prompt, glossary, entities, signal taxonomy, and webhook config.
Cortex	The living knowledge base. Includes entity profiles, glossary, ASR corrections, contradictions, and training questions. Self-improves with each pipeline run.

Documentation Status

v1.0

Last updated: March 2026

Covers pipeline, architecture, dashboard pages, Cortex, and workflow.