Skip to content

Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2

Open
Copilot wants to merge 2 commits into
mainfrom
copilot/add-fastapi-backend
Open

Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2
Copilot wants to merge 2 commits into
mainfrom
copilot/add-fastapi-backend

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 14, 2026

Replaces the Streamlit-only architecture with an API-first backend using FastAPI + file-backed persistence under data/. No Redis/Celery — uses BackgroundTasks for simplicity with a clear upgrade path.

Core changes

  • src/ingestion/transcribe.py — module-level whisper.load_model() replaced with a thread-safe lazy loader (get_whisper_model()); import whisper itself is also deferred so the server starts fast
  • src/utils.py (new) — video_id_from_url() canonicalises YouTube video IDs across watch/youtu.be/shorts/embed URL forms
  • src/jobs.py (new) — file-backed job store persisting to data/jobs/{job_id}.json; job IDs are UUID4-validated before any file I/O to prevent path traversal
  • src/storage.py (new) — gzip-compressed persistence for transcripts, per-model summaries, and FAISS indexes; video_id and model_key are regex-validated before use in paths
  • app/main.py (new) — FastAPI app with:
    • POST /api/v1/ingest/youtube → enqueues job, returns {job_id} (202)
    • GET /api/v1/job/{job_id} → polls job status
    • GET /api/v1/summary/{video_id}?model=bart|t5 → returns stored summary + comparison_available flag
    • GET /api/v1/compare/{video_id} → returns both summaries with compression/timing metrics
    • GET /health
  • tests/test_api.py (new) — 9 TestClient tests; heavy ML deps (faiss, transformers, torch, whisper) stubbed via sys.modules so the suite runs without any models installed
  • requirements.txt — added fastapi, uvicorn[standard], python-multipart, httpx; removed duplicate and unused entries

Usage

pip install fastapi "uvicorn[standard]" python-multipart
uvicorn app.main:app --reload

# Enqueue a job
curl -X POST /api/v1/ingest/youtube \
  -H "Content-Type: application/json" \
  -d '{"url":"https://www.youtube.com/watch?v=VIDEO_ID","detail_level":"medium","model":"bart"}'
# → {"job_id": "..."}

# Poll until status == "done"
curl /api/v1/job/{job_id}

# Fetch summary or compare both models
curl "/api/v1/summary/{video_id}?model=bart"
curl "/api/v1/compare/{video_id}"

Request fields (detail_level, model) use Literal type constraints; job errors are logged server-side only and not exposed in full to callers.

Original prompt

Add a FastAPI backend with a file-backed background job runner, persistent summary storage, and comparison endpoints for the TranscriptIQ project. This PR implements the backend pieces needed to replace the Streamlit UI with an API-first approach (React frontend will be added later). It keeps Redis/caching out of scope and uses a file-backed store under data/ so the work is runnable locally.

Relevant image (data retrieval flow):
image1

Summary of changes to add (each file will be created or modified as indicated):

  1. Modify: src/ingestion/transcribe.py
  • Replace module-level Whisper model load with a thread-safe lazy loader (get_whisper_model()).
  • Provide transcribe_audio(audio_path) that uses the lazy loader.
  • Rationale: avoid heavy model loads at FastAPI import-time; allows server to start fast.
  1. Add: src/jobs.py
  • File-backed job store with create_job(), update_job(), get_job().
  • Jobs are saved to data/jobs/{job_id}.json and include status, progress, result, error, timestamps.
  • Rationale: simple job status persistence for BackgroundTasks.
  1. Add: src/storage.py
  • Helpers to save/load compressed transcripts and summaries under data/{video_id}/
    • save_transcript(video_id, text, meta)
    • load_transcript(video_id)
    • save_summary(video_id, model_key, summary_text, metrics)
    • load_summary(video_id, model_key)
    • summary_exists(video_id, model_key)
  • Persistence helpers for FAISS index and chunks under data/indexes/{video_id}.index and {video_id}_chunks.json:
    • save_index(video_id, index, chunks)
    • load_index_and_chunks(video_id)
  • Use gzip compression for transcript and summaries to reduce disk footprint.
  1. Add: app/main.py (FastAPI app)
  • Endpoints:
    • POST /api/v1/ingest/youtube -> accepts JSON { url, detail_level, model } and enqueues a background job; returns { job_id }
    • GET /api/v1/job/{job_id} -> returns job status JSON
    • GET /api/v1/summary/{video_id}?model=bart|t5 -> returns stored summary and comparison_available flag
    • GET /api/v1/compare/{video_id} -> returns both summaries and lightweight comparison metrics
    • GET /health -> simple health check
  • Background job runner (run_youtube_job) executes process_youtube_pipeline(), saves transcript, saves BART summary (pipeline returns it), optionally runs T5 if requested, builds FAISS index via build_vector_store(), persists index and chunk metadata, and updates job status.
  • The API uses the existing summarize_text and build_vector_store functions and stores results via src/storage.py.
  1. Add: tests/test_api.py
  • Minimal tests using FastAPI TestClient.
  • Mocks heavy functions (process_youtube_pipeline, build_vector_store) to assert endpoints return job_id and status endpoints exist.
  1. Minor helper: ensure src/utils.video_id_from_url exists (if not already) to canonicalize YouTube video ids used as storage keys. If already present, reuse it.

  2. Requirements note (developer should add these if missing):

  • fastapi
  • uvicorn[standard]
  • python-multipart (for future upload endpoint)
  • pytest-asyncio (optional for async tests)

Behavioral notes & constraints

  • No Redis/Celery in this PR — BackgroundTasks are used for simplicity. Add Celery later for resilience.
  • process_youtube_pipeline currently returns (text, source, summary, metrics) where summary is BART result; run T5 only if requested.
  • This PR persists artifacts under data/ so they survive restarts. The FAISS index is saved to data/indexes via faiss.write_index.
  • The job runner will capture exceptions and write traceback to the job file for debugging.

Testing and verification steps (for reviewer)

  1. Create branch: add/fastapi-backend
  2. Add the files below and modify src/ingestion/transcribe.py as specified.
  3. Install new deps: pip install fastapi "uvicorn[standard]" python-multipart
  4. Start server: uvicorn app.main:app --reload
  5. POST to /api/v1/ingest/youtube with JSON { "url": "https://www.youtube.com/watch?v=VIDEOID", "detail_level": "medium", "model": "bart" } -> receives job_id (202)
  6. Poll GET /api/v1/job/{job_id} until status == done (BackgroundTasks run in-process for demo). Confirm data/{video_id}/ contains transcript and summary files and data/indexes/ contains index files when available.
  7. GET /api/v1/summary/{video_id}?model=bart returns the stored summary and comparison_available flag.
  8. If both summaries exist, GET /api/v1/compare/{video_id} returns both and comparison metrics.

Why this PR is valuable now

  • Provides a clean backend API to replace Streamlit with React later.
  • Keeps changes scoped and reversible; uses file storage to avoid introducing Redis or external services during initial migration.
  • Enables further improvements: Redis cache, Celery worker, React frontend, Dockerization.

No files should be merged to main by the tool — please create this PR on a feature branch and do not merge. The PR should include the changed and new files listed above and a clear PR description (this problem statement can be used)...

This pull request was created from Copilot chat.

@shravan606756 shravan606756 self-requested a review April 14, 2026 19:03
Copilot AI changed the title [WIP] Add FastAPI backend for TranscriptIQ project Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints Apr 14, 2026
@shravan606756 shravan606756 marked this pull request as ready for review April 20, 2026 13:41
Copilot AI review requested due to automatic review settings April 20, 2026 13:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an API-first backend to TranscriptIQ using FastAPI, introducing file-backed persistence for jobs and artifacts under data/ to support asynchronous ingestion, stored summaries, and model comparison without Redis/Celery.

Changes:

  • Introduces a FastAPI app with ingestion/job polling/summary/compare/health endpoints.
  • Adds file-backed persistence helpers for jobs (data/jobs/*.json) and artifacts (gzip transcripts/summaries + FAISS index/chunks).
  • Updates Whisper transcription to use a thread-safe lazy model loader to avoid import-time model loading.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
app/main.py Defines the FastAPI app, request models, background job runner, and REST endpoints.
src/jobs.py Implements a file-backed job store for BackgroundTasks with UUID4 job IDs.
src/storage.py Adds gzip JSON persistence for transcripts/summaries and save/load helpers for FAISS index + chunks.
src/utils.py Adds YouTube URL parsing helper to canonicalize video_id storage keys.
src/ingestion/transcribe.py Switches Whisper model initialization to a thread-safe lazy loader.
tests/test_api.py Adds TestClient-based API tests with heavy ML deps stubbed via sys.modules.
requirements.txt Adds FastAPI/uvicorn/python-multipart/httpx and removes duplicates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_api.py
Comment on lines +14 to +15
import pytest

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but not used in this test module. It’s safe to drop the import (the tmp_path / monkeypatch fixtures still work without importing pytest).

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +122 to +137
text, source, bart_summary, bart_metrics = process_youtube_pipeline(
url, detail_level
)

# --- Step 2: persist transcript ----------------------------------------
update_job(job_id, progress="Saving transcript…")
save_transcript(video_id, text, meta={"source": source, "url": url})

# --- Step 3: BART summary (returned by pipeline) ----------------------
update_job(job_id, progress="Saving BART summary…")
save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics)

# --- Step 3b: T5 summary (on request) ---------------------------------
requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
run_t5 = model in ("t5", "both")
if run_t5 and not summary_exists(video_id, "t5-base"):
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model request field allows "t5", but run_youtube_job() always runs process_youtube_pipeline() and saves a BART summary regardless (and only conditionally adds T5). If the API contract is that model=t5 means “only T5”, this is a behavior mismatch; if BART is always produced, consider renaming the field (e.g., extra_model / compare_with) or documenting/encoding it so callers can't request an unsupported mode.

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +212 to +218
model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
data = load_summary(video_id, model_key)
if data is None:
raise HTTPException(status_code=404, detail="Summary not found")

other_key = "t5-base" if model_key == "bart-large-cnn" else "bart-large-cnn"
comparison_available = summary_exists(video_id, other_key)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_summary() can raise ValueError for an unsafe video_id (via src/storage._validate_id). In this handler that would bubble up as a 500 instead of a client error. Consider catching ValueError around load_summary/summary_exists and returning an HTTP 400 with a clear message (same for compare_summaries).

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +234 to +236
bart_data = load_summary(video_id, "bart-large-cnn")
t5_data = load_summary(video_id, "t5-base")

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_summaries() calls load_summary(video_id, ...) directly; src/storage validates IDs and will raise ValueError for malformed/unsafe video_id, which would return a 500. Consider catching ValueError and returning 400 so invalid IDs don't become server errors.

Copilot uses AI. Check for mistakes.
Comment thread src/utils.py
Comment on lines +17 to +34
parsed = urlparse(url)

# youtu.be short links
if parsed.netloc in ("youtu.be", "www.youtu.be"):
vid = parsed.path.lstrip("/").split("/")[0]
if vid:
return vid

# Standard watch URL (?v=...)
if parsed.path == "/watch":
qs = parse_qs(parsed.query)
if "v" in qs:
return qs["v"][0]

# /shorts/<id>, /embed/<id>, /v/<id>
m = re.match(r"^/(?:shorts|embed|v)/([A-Za-z0-9_-]+)", parsed.path)
if m:
return m.group(1)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

video_id_from_url() will accept non-YouTube hosts for /watch?v=... and /shorts|embed|v/... paths, and it returns the extracted segment without validating it against an allowed character/length set. This means URLs like https://example.com/watch?v=... pass request validation and can later fail in storage (or create inconsistent IDs). Consider (1) validating parsed.netloc against known YouTube domains for all URL forms, and (2) validating the extracted ID (e.g., ^[A-Za-z0-9_-]{11}$ or the same safe regex used by src/storage.py).

Copilot uses AI. Check for mistakes.
Comment thread src/jobs.py
Comment on lines +109 to +110
with open(path, "w", encoding="utf-8") as fh:
json.dump(record, fh, indent=2)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job file writes in _save() are not atomic. Since get_job_status can read the JSON while a background task is updating it, readers can intermittently hit a partially-written file and fail with JSONDecodeError (500s). Consider writing to a temp file in the same directory and using os.replace() for an atomic rename (and optionally add a simple file lock) to avoid torn reads.

Suggested change
with open(path, "w", encoding="utf-8") as fh:
json.dump(record, fh, indent=2)
temp_path = path.with_name(f".{path.name}.{uuid.uuid4().hex}.tmp")
try:
with open(temp_path, "w", encoding="utf-8") as fh:
json.dump(record, fh, indent=2)
fh.flush()
os.fsync(fh.fileno())
os.replace(temp_path, path)
finally:
if temp_path.exists():
temp_path.unlink()

Copilot uses AI. Check for mistakes.
Comment thread src/jobs.py
"""

import json
import os
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os is imported but unused in this module. Removing it avoids dead imports and keeps the module tidy (and helps if stricter linting is enabled later).

Suggested change
import os

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics)

# --- Step 3b: T5 summary (on request) ---------------------------------
requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested_model_key is computed but never used. This is dead code and can be removed, or (if intended) used to drive which summaries are generated/saved to match the requested model value.

Suggested change
requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")

Copilot uses AI. Check for mistakes.
Comment thread app/main.py
Comment on lines +203 to +213
@app.get("/api/v1/summary/{video_id}", tags=["summaries"])
def get_summary(video_id: str, model: str = "bart"):
"""
Return the stored summary for *video_id*.

Query params
------------
model : "bart" (default) | "t5"
"""
model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
data = load_summary(video_id, model_key)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_summary() treats the model query param as an arbitrary str and falls back to BART for unknown values (_MODEL_KEY_MAP.get(model, "bart-large-cnn")). This can silently mask client errors (e.g., model=foo returns BART). Consider constraining model to Literal["bart","t5"] (or an Enum) so FastAPI returns 422 for invalid values, matching the documented contract.

Copilot uses AI. Check for mistakes.
Comment thread src/storage.py
Comment on lines +106 to +107
safe_key = model_key.replace("/", "_").replace("\\", "_")
return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_summary_path() validates model_key with _SAFE_MODEL_RE, which currently forbids / and \, but then safe_key = model_key.replace("/", "_").replace("\\", "_") is still applied. This sanitization is currently unreachable and can be removed, or the regex can be relaxed to allow / (and rely on the replacement) if you want to support HF-style names like org/model safely.

Suggested change
safe_key = model_key.replace("/", "_").replace("\\", "_")
return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz"
return _DATA_DIR / video_id / f"summary_{model_key}.json.gz"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants