Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2
Add FastAPI backend with file-backed jobs, persistent storage, and comparison endpoints#2Copilot wants to merge 2 commits into
Conversation
Agent-Logs-Url: https://github.com/shravan606756/TranscriptIQ/sessions/e749aaa4-58b9-4033-ac39-82f7022e82ab Co-authored-by: shravan606756 <185698838+shravan606756@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an API-first backend to TranscriptIQ using FastAPI, introducing file-backed persistence for jobs and artifacts under data/ to support asynchronous ingestion, stored summaries, and model comparison without Redis/Celery.
Changes:
- Introduces a FastAPI app with ingestion/job polling/summary/compare/health endpoints.
- Adds file-backed persistence helpers for jobs (
data/jobs/*.json) and artifacts (gzip transcripts/summaries + FAISS index/chunks). - Updates Whisper transcription to use a thread-safe lazy model loader to avoid import-time model loading.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
app/main.py |
Defines the FastAPI app, request models, background job runner, and REST endpoints. |
src/jobs.py |
Implements a file-backed job store for BackgroundTasks with UUID4 job IDs. |
src/storage.py |
Adds gzip JSON persistence for transcripts/summaries and save/load helpers for FAISS index + chunks. |
src/utils.py |
Adds YouTube URL parsing helper to canonicalize video_id storage keys. |
src/ingestion/transcribe.py |
Switches Whisper model initialization to a thread-safe lazy loader. |
tests/test_api.py |
Adds TestClient-based API tests with heavy ML deps stubbed via sys.modules. |
requirements.txt |
Adds FastAPI/uvicorn/python-multipart/httpx and removes duplicates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import pytest | ||
|
|
There was a problem hiding this comment.
pytest is imported but not used in this test module. It’s safe to drop the import (the tmp_path / monkeypatch fixtures still work without importing pytest).
| import pytest |
| text, source, bart_summary, bart_metrics = process_youtube_pipeline( | ||
| url, detail_level | ||
| ) | ||
|
|
||
| # --- Step 2: persist transcript ---------------------------------------- | ||
| update_job(job_id, progress="Saving transcript…") | ||
| save_transcript(video_id, text, meta={"source": source, "url": url}) | ||
|
|
||
| # --- Step 3: BART summary (returned by pipeline) ---------------------- | ||
| update_job(job_id, progress="Saving BART summary…") | ||
| save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics) | ||
|
|
||
| # --- Step 3b: T5 summary (on request) --------------------------------- | ||
| requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn") | ||
| run_t5 = model in ("t5", "both") | ||
| if run_t5 and not summary_exists(video_id, "t5-base"): |
There was a problem hiding this comment.
The model request field allows "t5", but run_youtube_job() always runs process_youtube_pipeline() and saves a BART summary regardless (and only conditionally adds T5). If the API contract is that model=t5 means “only T5”, this is a behavior mismatch; if BART is always produced, consider renaming the field (e.g., extra_model / compare_with) or documenting/encoding it so callers can't request an unsupported mode.
| model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn") | ||
| data = load_summary(video_id, model_key) | ||
| if data is None: | ||
| raise HTTPException(status_code=404, detail="Summary not found") | ||
|
|
||
| other_key = "t5-base" if model_key == "bart-large-cnn" else "bart-large-cnn" | ||
| comparison_available = summary_exists(video_id, other_key) |
There was a problem hiding this comment.
load_summary() can raise ValueError for an unsafe video_id (via src/storage._validate_id). In this handler that would bubble up as a 500 instead of a client error. Consider catching ValueError around load_summary/summary_exists and returning an HTTP 400 with a clear message (same for compare_summaries).
| bart_data = load_summary(video_id, "bart-large-cnn") | ||
| t5_data = load_summary(video_id, "t5-base") | ||
|
|
There was a problem hiding this comment.
compare_summaries() calls load_summary(video_id, ...) directly; src/storage validates IDs and will raise ValueError for malformed/unsafe video_id, which would return a 500. Consider catching ValueError and returning 400 so invalid IDs don't become server errors.
| parsed = urlparse(url) | ||
|
|
||
| # youtu.be short links | ||
| if parsed.netloc in ("youtu.be", "www.youtu.be"): | ||
| vid = parsed.path.lstrip("/").split("/")[0] | ||
| if vid: | ||
| return vid | ||
|
|
||
| # Standard watch URL (?v=...) | ||
| if parsed.path == "/watch": | ||
| qs = parse_qs(parsed.query) | ||
| if "v" in qs: | ||
| return qs["v"][0] | ||
|
|
||
| # /shorts/<id>, /embed/<id>, /v/<id> | ||
| m = re.match(r"^/(?:shorts|embed|v)/([A-Za-z0-9_-]+)", parsed.path) | ||
| if m: | ||
| return m.group(1) |
There was a problem hiding this comment.
video_id_from_url() will accept non-YouTube hosts for /watch?v=... and /shorts|embed|v/... paths, and it returns the extracted segment without validating it against an allowed character/length set. This means URLs like https://example.com/watch?v=... pass request validation and can later fail in storage (or create inconsistent IDs). Consider (1) validating parsed.netloc against known YouTube domains for all URL forms, and (2) validating the extracted ID (e.g., ^[A-Za-z0-9_-]{11}$ or the same safe regex used by src/storage.py).
| with open(path, "w", encoding="utf-8") as fh: | ||
| json.dump(record, fh, indent=2) |
There was a problem hiding this comment.
Job file writes in _save() are not atomic. Since get_job_status can read the JSON while a background task is updating it, readers can intermittently hit a partially-written file and fail with JSONDecodeError (500s). Consider writing to a temp file in the same directory and using os.replace() for an atomic rename (and optionally add a simple file lock) to avoid torn reads.
| with open(path, "w", encoding="utf-8") as fh: | |
| json.dump(record, fh, indent=2) | |
| temp_path = path.with_name(f".{path.name}.{uuid.uuid4().hex}.tmp") | |
| try: | |
| with open(temp_path, "w", encoding="utf-8") as fh: | |
| json.dump(record, fh, indent=2) | |
| fh.flush() | |
| os.fsync(fh.fileno()) | |
| os.replace(temp_path, path) | |
| finally: | |
| if temp_path.exists(): | |
| temp_path.unlink() |
| """ | ||
|
|
||
| import json | ||
| import os |
There was a problem hiding this comment.
os is imported but unused in this module. Removing it avoids dead imports and keeps the module tidy (and helps if stricter linting is enabled later).
| import os |
| save_summary(video_id, "bart-large-cnn", bart_summary, bart_metrics) | ||
|
|
||
| # --- Step 3b: T5 summary (on request) --------------------------------- | ||
| requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn") |
There was a problem hiding this comment.
requested_model_key is computed but never used. This is dead code and can be removed, or (if intended) used to drive which summaries are generated/saved to match the requested model value.
| requested_model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn") |
| @app.get("/api/v1/summary/{video_id}", tags=["summaries"]) | ||
| def get_summary(video_id: str, model: str = "bart"): | ||
| """ | ||
| Return the stored summary for *video_id*. | ||
|
|
||
| Query params | ||
| ------------ | ||
| model : "bart" (default) | "t5" | ||
| """ | ||
| model_key = _MODEL_KEY_MAP.get(model, "bart-large-cnn") | ||
| data = load_summary(video_id, model_key) |
There was a problem hiding this comment.
get_summary() treats the model query param as an arbitrary str and falls back to BART for unknown values (_MODEL_KEY_MAP.get(model, "bart-large-cnn")). This can silently mask client errors (e.g., model=foo returns BART). Consider constraining model to Literal["bart","t5"] (or an Enum) so FastAPI returns 422 for invalid values, matching the documented contract.
| safe_key = model_key.replace("/", "_").replace("\\", "_") | ||
| return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz" |
There was a problem hiding this comment.
_summary_path() validates model_key with _SAFE_MODEL_RE, which currently forbids / and \, but then safe_key = model_key.replace("/", "_").replace("\\", "_") is still applied. This sanitization is currently unreachable and can be removed, or the regex can be relaxed to allow / (and rely on the replacement) if you want to support HF-style names like org/model safely.
| safe_key = model_key.replace("/", "_").replace("\\", "_") | |
| return _DATA_DIR / video_id / f"summary_{safe_key}.json.gz" | |
| return _DATA_DIR / video_id / f"summary_{model_key}.json.gz" |
Replaces the Streamlit-only architecture with an API-first backend using FastAPI + file-backed persistence under
data/. No Redis/Celery — usesBackgroundTasksfor simplicity with a clear upgrade path.Core changes
src/ingestion/transcribe.py— module-levelwhisper.load_model()replaced with a thread-safe lazy loader (get_whisper_model());import whisperitself is also deferred so the server starts fastsrc/utils.py(new) —video_id_from_url()canonicalises YouTube video IDs across watch/youtu.be/shorts/embed URL formssrc/jobs.py(new) — file-backed job store persisting todata/jobs/{job_id}.json; job IDs are UUID4-validated before any file I/O to prevent path traversalsrc/storage.py(new) — gzip-compressed persistence for transcripts, per-model summaries, and FAISS indexes;video_idandmodel_keyare regex-validated before use in pathsapp/main.py(new) — FastAPI app with:POST /api/v1/ingest/youtube→ enqueues job, returns{job_id}(202)GET /api/v1/job/{job_id}→ polls job statusGET /api/v1/summary/{video_id}?model=bart|t5→ returns stored summary +comparison_availableflagGET /api/v1/compare/{video_id}→ returns both summaries with compression/timing metricsGET /healthtests/test_api.py(new) — 9TestClienttests; heavy ML deps (faiss,transformers,torch,whisper) stubbed viasys.modulesso the suite runs without any models installedrequirements.txt— addedfastapi,uvicorn[standard],python-multipart,httpx; removed duplicate and unused entriesUsage
Request fields (
detail_level,model) useLiteraltype constraints; job errors are logged server-side only and not exposed in full to callers.Original prompt
Add a FastAPI backend with a file-backed background job runner, persistent summary storage, and comparison endpoints for the TranscriptIQ project. This PR implements the backend pieces needed to replace the Streamlit UI with an API-first approach (React frontend will be added later). It keeps Redis/caching out of scope and uses a file-backed store under data/ so the work is runnable locally.
Relevant image (data retrieval flow):

Summary of changes to add (each file will be created or modified as indicated):
Minor helper: ensure src/utils.video_id_from_url exists (if not already) to canonicalize YouTube video ids used as storage keys. If already present, reuse it.
Requirements note (developer should add these if missing):
Behavioral notes & constraints
Testing and verification steps (for reviewer)
Why this PR is valuable now
No files should be merged to main by the tool — please create this PR on a feature branch and do not merge. The PR should include the changed and new files listed above and a clear PR description (this problem statement can be used)...
This pull request was created from Copilot chat.