Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints by Copilot · Pull Request #2 · shravan606756/TranscriptIQ

Copilot · 2026-04-14T18:53:05Z

Replaces the Streamlit-only architecture with an API-first backend using FastAPI + file-backed persistence under data/. No Redis/Celery — uses BackgroundTasks for simplicity with a clear upgrade path.

Core changes

src/ingestion/transcribe.py — module-level whisper.load_model() replaced with a thread-safe lazy loader (get_whisper_model()); import whisper itself is also deferred so the server starts fast
src/utils.py (new) — video_id_from_url() canonicalises YouTube video IDs across watch/youtu.be/shorts/embed URL forms
src/jobs.py (new) — file-backed job store persisting to data/jobs/{job_id}.json; job IDs are UUID4-validated before any file I/O to prevent path traversal
src/storage.py (new) — gzip-compressed persistence for transcripts, per-model summaries, and FAISS indexes; video_id and model_key are regex-validated before use in paths
app/main.py (new) — FastAPI app with:
- POST /api/v1/ingest/youtube → enqueues job, returns {job_id} (202)
- GET /api/v1/job/{job_id} → polls job status
- GET /api/v1/summary/{video_id}?model=bart|t5 → returns stored summary + comparison_available flag
- GET /api/v1/compare/{video_id} → returns both summaries with compression/timing metrics
- GET /health
tests/test_api.py (new) — 9 TestClient tests; heavy ML deps (faiss, transformers, torch, whisper) stubbed via sys.modules so the suite runs without any models installed
requirements.txt — added fastapi, uvicorn[standard], python-multipart, httpx; removed duplicate and unused entries

Usage

pip install fastapi "uvicorn[standard]" python-multipart
uvicorn app.main:app --reload

# Enqueue a job
curl -X POST /api/v1/ingest/youtube \
  -H "Content-Type: application/json" \
  -d '{"url":"https://www.youtube.com/watch?v=VIDEO_ID","detail_level":"medium","model":"bart"}'
# → {"job_id": "..."}

# Poll until status == "done"
curl /api/v1/job/{job_id}

# Fetch summary or compare both models
curl "/api/v1/summary/{video_id}?model=bart"
curl "/api/v1/compare/{video_id}"

Request fields (detail_level, model) use Literal type constraints; job errors are logged server-side only and not exposed in full to callers.

Original prompt

Add a FastAPI backend with a file-backed background job runner, persistent summary storage, and comparison endpoints for the TranscriptIQ project. This PR implements the backend pieces needed to replace the Streamlit UI with an API-first approach (React frontend will be added later). It keeps Redis/caching out of scope and uses a file-backed store under data/ so the work is runnable locally.

Relevant image (data retrieval flow):

Summary of changes to add (each file will be created or modified as indicated):

Modify: src/ingestion/transcribe.py

Replace module-level Whisper model load with a thread-safe lazy loader (get_whisper_model()).
Provide transcribe_audio(audio_path) that uses the lazy loader.
Rationale: avoid heavy model loads at FastAPI import-time; allows server to start fast.

Add: src/jobs.py

File-backed job store with create_job(), update_job(), get_job().
Jobs are saved to data/jobs/{job_id}.json and include status, progress, result, error, timestamps.
Rationale: simple job status persistence for BackgroundTasks.

Add: src/storage.py

Helpers to save/load compressed transcripts and summaries under data/{video_id}/
- save_transcript(video_id, text, meta)
- load_transcript(video_id)
- save_summary(video_id, model_key, summary_text, metrics)
- load_summary(video_id, model_key)
- summary_exists(video_id, model_key)
Persistence helpers for FAISS index and chunks under data/indexes/{video_id}.index and {video_id}_chunks.json:
- save_index(video_id, index, chunks)
- load_index_and_chunks(video_id)
Use gzip compression for transcript and summaries to reduce disk footprint.

Add: app/main.py (FastAPI app)

Endpoints:
- POST /api/v1/ingest/youtube -> accepts JSON { url, detail_level, model } and enqueues a background job; returns { job_id }
- GET /api/v1/job/{job_id} -> returns job status JSON
- GET /api/v1/summary/{video_id}?model=bart|t5 -> returns stored summary and comparison_available flag
- GET /api/v1/compare/{video_id} -> returns both summaries and lightweight comparison metrics
- GET /health -> simple health check
Background job runner (run_youtube_job) executes process_youtube_pipeline(), saves transcript, saves BART summary (pipeline returns it), optionally runs T5 if requested, builds FAISS index via build_vector_store(), persists index and chunk metadata, and updates job status.
The API uses the existing summarize_text and build_vector_store functions and stores results via src/storage.py.

Add: tests/test_api.py

Minimal tests using FastAPI TestClient.
Mocks heavy functions (process_youtube_pipeline, build_vector_store) to assert endpoints return job_id and status endpoints exist.

Minor helper: ensure src/utils.video_id_from_url exists (if not already) to canonicalize YouTube video ids used as storage keys. If already present, reuse it.
Requirements note (developer should add these if missing):

fastapi
uvicorn[standard]
python-multipart (for future upload endpoint)
pytest-asyncio (optional for async tests)

Behavioral notes & constraints

No Redis/Celery in this PR — BackgroundTasks are used for simplicity. Add Celery later for resilience.
process_youtube_pipeline currently returns (text, source, summary, metrics) where summary is BART result; run T5 only if requested.
This PR persists artifacts under data/ so they survive restarts. The FAISS index is saved to data/indexes via faiss.write_index.
The job runner will capture exceptions and write traceback to the job file for debugging.

Testing and verification steps (for reviewer)

Create branch: add/fastapi-backend
Add the files below and modify src/ingestion/transcribe.py as specified.
Install new deps: pip install fastapi "uvicorn[standard]" python-multipart
Start server: uvicorn app.main:app --reload
POST to /api/v1/ingest/youtube with JSON { "url": "https://www.youtube.com/watch?v=VIDEOID", "detail_level": "medium", "model": "bart" } -> receives job_id (202)
Poll GET /api/v1/job/{job_id} until status == done (BackgroundTasks run in-process for demo). Confirm data/{video_id}/ contains transcript and summary files and data/indexes/ contains index files when available.
GET /api/v1/summary/{video_id}?model=bart returns the stored summary and comparison_available flag.
If both summaries exist, GET /api/v1/compare/{video_id} returns both and comparison metrics.

Why this PR is valuable now

Provides a clean backend API to replace Streamlit with React later.
Keeps changes scoped and reversible; uses file storage to avoid introducing Redis or external services during initial migration.
Enables further improvements: Redis cache, Celery worker, React frontend, Dockerization.

No files should be merged to main by the tool — please create this PR on a feature branch and do not merge. The PR should include the changed and new files listed above and a clear PR description (this problem statement can be used)...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/shravan606756/TranscriptIQ/sessions/e749aaa4-58b9-4033-ac39-82f7022e82ab Co-authored-by: shravan606756 <185698838+shravan606756@users.noreply.github.com>

Copilot

Pull request overview

Adds an API-first backend to TranscriptIQ using FastAPI, introducing file-backed persistence for jobs and artifacts under data/ to support asynchronous ingestion, stored summaries, and model comparison without Redis/Celery.

Changes:

Introduces a FastAPI app with ingestion/job polling/summary/compare/health endpoints.
Adds file-backed persistence helpers for jobs (data/jobs/*.json) and artifacts (gzip transcripts/summaries + FAISS index/chunks).
Updates Whisper transcription to use a thread-safe lazy model loader to avoid import-time model loading.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`app/main.py`	Defines the FastAPI app, request models, background job runner, and REST endpoints.
`src/jobs.py`	Implements a file-backed job store for BackgroundTasks with UUID4 job IDs.
`src/storage.py`	Adds gzip JSON persistence for transcripts/summaries and save/load helpers for FAISS index + chunks.
`src/utils.py`	Adds YouTube URL parsing helper to canonicalize `video_id` storage keys.
`src/ingestion/transcribe.py`	Switches Whisper model initialization to a thread-safe lazy loader.
`tests/test_api.py`	Adds TestClient-based API tests with heavy ML deps stubbed via `sys.modules`.
`requirements.txt`	Adds FastAPI/uvicorn/python-multipart/httpx and removes duplicates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T13:48:45Z

+import pytest
+


pytest is imported but not used in this test module. It’s safe to drop the import (the tmp_path / monkeypatch fixtures still work without importing pytest).

Suggested change

import pytest

Copilot · 2026-04-20T13:48:46Z

+        text, source, bart_summary, bart_metrics = process_youtube_pipeline(
+            url, detail_level
+        )
+
+        # --- Step 2: persist transcript ----------------------------------------
+        update_job(job_id, progress="Saving transcript…")
+        save_transcript(video_id, text, meta={"source": source, "url": url})
+
+        # --- Step 3: BART summary (returned by pipeline) ----------------------
+        update_job(job_id, progress="Saving BART summary…")
+        save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics)
+
+        # --- Step 3b: T5 summary (on request) ---------------------------------
+        requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
+        run_t5 = model in ("t5", "both")
+        if run_t5 and not summary_exists(video_id, "t5-base"):


The model request field allows "t5", but run_youtube_job() always runs process_youtube_pipeline() and saves a BART summary regardless (and only conditionally adds T5). If the API contract is that model=t5 means “only T5”, this is a behavior mismatch; if BART is always produced, consider renaming the field (e.g., extra_model / compare_with) or documenting/encoding it so callers can't request an unsupported mode.

Copilot · 2026-04-20T13:48:46Z

+    model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
+    data = load_summary(video_id, model_key)
+    if data is None:
+        raise HTTPException(status_code=404, detail="Summary not found")
+
+    other_key = "t5-base" if model_key == "bart-large-cnn" else "bart-large-cnn"
+    comparison_available = summary_exists(video_id, other_key)


load_summary() can raise ValueError for an unsafe video_id (via src/storage._validate_id). In this handler that would bubble up as a 500 instead of a client error. Consider catching ValueError around load_summary/summary_exists and returning an HTTP 400 with a clear message (same for compare_summaries).

Copilot · 2026-04-20T13:48:46Z

+    bart_data = load_summary(video_id, "bart-large-cnn")
+    t5_data = load_summary(video_id, "t5-base")
+


compare_summaries() calls load_summary(video_id, ...) directly; src/storage validates IDs and will raise ValueError for malformed/unsafe video_id, which would return a 500. Consider catching ValueError and returning 400 so invalid IDs don't become server errors.

Copilot · 2026-04-20T13:48:47Z

+    parsed = urlparse(url)
+
+    # youtu.be short links
+    if parsed.netloc in ("youtu.be", "www.youtu.be"):
+        vid = parsed.path.lstrip("/").split("/")[0]
+        if vid:
+            return vid
+
+    # Standard watch URL  (?v=...)
+    if parsed.path == "/watch":
+        qs = parse_qs(parsed.query)
+        if "v" in qs:
+            return qs["v"][0]
+
+    # /shorts/<id>, /embed/<id>, /v/<id>
+    m = re.match(r"^/(?:shorts|embed|v)/([A-Za-z0-9_-]+)", parsed.path)
+    if m:
+        return m.group(1)


video_id_from_url() will accept non-YouTube hosts for /watch?v=... and /shorts|embed|v/... paths, and it returns the extracted segment without validating it against an allowed character/length set. This means URLs like https://example.com/watch?v=... pass request validation and can later fail in storage (or create inconsistent IDs). Consider (1) validating parsed.netloc against known YouTube domains for all URL forms, and (2) validating the extracted ID (e.g., ^[A-Za-z0-9_-]{11}$ or the same safe regex used by src/storage.py).

Copilot · 2026-04-20T13:48:47Z

+    with open(path, "w", encoding="utf-8") as fh:
+        json.dump(record, fh, indent=2)


Job file writes in _save() are not atomic. Since get_job_status can read the JSON while a background task is updating it, readers can intermittently hit a partially-written file and fail with JSONDecodeError (500s). Consider writing to a temp file in the same directory and using os.replace() for an atomic rename (and optionally add a simple file lock) to avoid torn reads.

Suggested change

with open(path, "w", encoding="utf-8") as fh:

json.dump(record, fh, indent=2)

temp_path = path.with_name(f".{path.name}.{uuid.uuid4().hex}.tmp")

try:

with open(temp_path, "w", encoding="utf-8") as fh:

json.dump(record, fh, indent=2)

fh.flush()

os.fsync(fh.fileno())

os.replace(temp_path, path)

finally:

if temp_path.exists():

temp_path.unlink()

Copilot · 2026-04-20T13:48:47Z

+"""
+
+import json
+import os


os is imported but unused in this module. Removing it avoids dead imports and keeps the module tidy (and helps if stricter linting is enabled later).

Suggested change

import os

Copilot · 2026-04-20T13:48:48Z

+        save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics)
+
+        # --- Step 3b: T5 summary (on request) ---------------------------------
+        requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")


requested_model_key is computed but never used. This is dead code and can be removed, or (if intended) used to drive which summaries are generated/saved to match the requested model value.

Suggested change

requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")

Copilot · 2026-04-20T13:48:48Z

+@app.get("/api/v1/summary/{video_id}", tags=["summaries"])
+def get_summary(video_id: str, model: str = "bart"):
+    """
+    Return the stored summary for *video_id*.
+
+    Query params
+    ------------
+    model : "bart" (default) | "t5"
+    """
+    model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn")
+    data = load_summary(video_id, model_key)


get_summary() treats the model query param as an arbitrary str and falls back to BART for unknown values (_MODEL_KEY_MAP.get(model, "bart-large-cnn")). This can silently mask client errors (e.g., model=foo returns BART). Consider constraining model to Literal["bart","t5"] (or an Enum) so FastAPI returns 422 for invalid values, matching the documented contract.

Copilot · 2026-04-20T13:48:48Z

+    safe_key = model_key.replace("/", "_").replace("\\", "_")
+    return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz"


_summary_path() validates model_key with _SAFE_MODEL_RE, which currently forbids / and \, but then safe_key = model_key.replace("/", "_").replace("\\", "_") is still applied. This sanitization is currently unreachable and can be removed, or the regex can be relaxed to allow / (and rely on the replacement) if you want to support HF-style names like org/model safely.

Suggested change

safe_key = model_key.replace("/", "_").replace("\\", "_")

return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz"

return _DATA_DIR / video_id / f"summary_{model_key}.json.gz"

Initial plan

998fa90

Copilot AI assigned Copilot and shravan606756 Apr 14, 2026

Copilot started work on behalf of shravan606756 April 14, 2026 18:53 View session

Add FastAPI backend: lazy Whisper, jobs, storage, main API, and tests

3f08b8a

Agent-Logs-Url: https://github.com/shravan606756/TranscriptIQ/sessions/e749aaa4-58b9-4033-ac39-82f7022e82ab Co-authored-by: shravan606756 <185698838+shravan606756@users.noreply.github.com>

shravan606756 unassigned Copilot Apr 14, 2026

shravan606756 self-requested a review April 14, 2026 19:03

Copilot AI changed the title ~~[WIP] Add FastAPI backend for TranscriptIQ project~~ Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints Apr 14, 2026

Copilot finished work on behalf of shravan606756 April 14, 2026 19:03

shravan606756 marked this pull request as ready for review April 20, 2026 13:41

Copilot AI review requested due to automatic review settings April 20, 2026 13:41

Copilot started reviewing on behalf of shravan606756 April 20, 2026 13:41 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2

Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2
Copilot wants to merge 2 commits into
mainfrom
copilot/add-fastapi-backend

Copilot AI commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		bart_data = load_summary(video_id, "bart-large-cnn")
		t5_data = load_summary(video_id, "t5-base")

		with open(path, "w", encoding="utf-8") as fh:
		json.dump(record, fh, indent=2)

-    with open(path, "w", encoding="utf-8") as fh:
-        json.dump(record, fh, indent=2)
+    temp_path = path.with_name(f".{path.name}.{uuid.uuid4().hex}.tmp")
+    try:
+        with open(temp_path, "w", encoding="utf-8") as fh:
+            json.dump(record, fh, indent=2)
+            fh.flush()
+            os.fsync(fh.fileno())
+        os.replace(temp_path, path)
+    finally:
+        if temp_path.exists():
+            temp_path.unlink()

		safe_key = model_key.replace("/", "_").replace("\\", "_")
		return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz"

Conversation

Copilot AI commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core changes

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 14, 2026 •

edited

Loading