[DO NOT MERGE] Investigate pocket vetoes (#51)#52
[DO NOT MERGE] Investigate pocket vetoes (#51)#52jeffreyksmithjr wants to merge 1 commit intomainfrom
Conversation
Two-pass analysis over the bot_detection DuckDB to see whether counting stale open PRs as rejections improves merge_rate signal quality against suspended-vs-active ground truth. Pass 1 used age-since-created_at as the staleness proxy (forced by the DuckDB schema). Negative result: CV AUC flat, Cohen's d worse. Pass 2 re-fetched updatedAt for every DB-OPEN PR (4,069 still currently open of 19,502; the other 79% have been closed or merged since the snapshot and are by definition non-stale) and recomputed staleness as idle-time. Much gentler reclassification — only 459 authors shift by >0.10 under idle_universal vs 1,836 under age_universal — but signal quality is still flat (CV AUC 0.5489 vs baseline 0.5494). Conclusion: idle-time is the right signal, but pocket vetoes don't catch a failure mode the current bad-egg model misses. No scoring change recommended on detection-accuracy grounds. Interpretability case for the change is separately defensible but out of scope here.
🥚 Diet Egg: HIGH TrustScore: 83% Score Breakdown
Top Contributions
|
There was a problem hiding this comment.
Code Review
This pull request implements an investigation into the "pocket veto" effect, analyzing if treating stale open pull requests as rejections enhances the signal for identifying suspended accounts. It adds scripts for fetching PR activity and performing statistical analysis, alongside the generated findings. Feedback suggests optimizing database interactions in the fetching script to avoid redundant connections and N+1 queries, and adding a check for empty data to prevent potential KeyError exceptions during DataFrame processing.
| con = duckdb.connect(str(DB_PATH), read_only=True) | ||
| repo_counts = con.execute(""" | ||
| SELECT repo, COUNT(*) AS n | ||
| FROM prs WHERE state='OPEN' | ||
| GROUP BY repo ORDER BY n DESC | ||
| """).fetchall() | ||
| con.close() | ||
|
|
||
| db_open_keys: dict[str, set[int]] = {} | ||
| con = duckdb.connect(str(DB_PATH), read_only=True) | ||
| for (repo,) in con.execute( | ||
| "SELECT DISTINCT repo FROM prs WHERE state='OPEN'" | ||
| ).fetchall(): | ||
| numbers = con.execute( | ||
| "SELECT number FROM prs WHERE state='OPEN' AND repo=?", [repo] | ||
| ).fetchall() | ||
| db_open_keys[repo] = {n[0] for n in numbers} | ||
| con.close() |
There was a problem hiding this comment.
The current implementation opens and closes the database connection twice and uses an N+1 query pattern to populate db_open_keys. This can be optimized by using a single connection and a single query to fetch all open PR numbers, then grouping them in memory. This reduces overhead and improves script performance.
con = duckdb.connect(str(DB_PATH), read_only=True)
repo_counts = con.execute("""
SELECT repo, COUNT(*) AS n
FROM prs WHERE state='OPEN'
GROUP BY repo ORDER BY n DESC
""").fetchall()
db_open_keys: dict[str, set[int]] = {}
all_open = con.execute("SELECT repo, number FROM prs WHERE state='OPEN'").fetchall()
for repo, num in all_open:
db_open_keys.setdefault(repo, set()).add(num)
con.close()| df = pd.DataFrame(rows) | ||
| df["updated_at"] = pd.to_datetime(df["updated_at"]) | ||
| df["fetch_now"] = pd.to_datetime(df["fetch_now"]) |
There was a problem hiding this comment.
If no relevant PR activity is found (i.e., the rows list is empty), pd.DataFrame(rows) will create a DataFrame without the expected columns. This will cause the subsequent pd.to_datetime calls to raise a KeyError. Adding a guard clause to handle the empty case will make the script more robust.
| df = pd.DataFrame(rows) | |
| df["updated_at"] = pd.to_datetime(df["updated_at"]) | |
| df["fetch_now"] = pd.to_datetime(df["fetch_now"]) | |
| if not rows: | |
| print("No relevant PR activity found.") | |
| return | |
| df = pd.DataFrame(rows) | |
| df["updated_at"] = pd.to_datetime(df["updated_at"]) | |
| df["fetch_now"] = pd.to_datetime(df["fetch_now"]) |
Summary
Investigation for #51: does counting stale open PRs as implicit rejections improve the
merge_ratesignal? Ran two passes over the bot_detection DuckDB. Short answer: no, not for detection accuracy. Not merging this; it is an evidence trail for closing the issue.What was done
created_atas the staleness proxy (forced by the DuckDB schema, which has noupdatedAt). Computed three merge-rate definitions over all 31,296 authors: v3 baseline,age_universal(90d),age_per_repo(using the DB's per-repostale_threshold_days). Ran 5-fold CV on a 2-feature logistic (merge_rate + log1p(median_additions)) against the 675-suspended / 30,554-active labeled subset.updatedAtfor every DB-OPEN PR via GraphQL (91 repos, one paginated call each, cached toopen_pr_activity.parquet) and added two idle-time variants. This addressed the main objection to Pass 1: under age-based staleness, a legitimate contributor likeharupy(2,771 merged PRs, active account) gets punished for open PRs that are in active discussion. Idle-time correctly spares them.Results
The refinement worked mechanically. Idle-time reclassifies 4x fewer authors than age-based, exactly the gentler behavior we wanted. But none of the variants beat the v3 baseline on CV AUC or Cohen's d, and differences are within noise.
Why it doesn't work
Suspended high-volume accounts in this dataset tend to either (a) be bots that v3 already scores low, or (b) have no open PRs at all (e.g.
amd-jmacaran: 105 PRs, merge_rate=1.000 under every definition). Pocket vetoes aren't catching a failure mode the current model misses. They're just redistributing mass among accounts the model already classifies adequately.A striking side-finding from Pass 2: 79% of DB-OPEN PRs have since been closed or merged (15,433 of 19,502). The age-based proxy treated all 19,502 as stale; idle-time correctly treats those 15,433 as non-stale. The age-based result was partly an artifact of the crude proxy, and the "is idle-time better" question was worth answering even though the final verdict didn't change.
Caveat worth recording for the future
For the narrow detection task, this is settled, don't ship. For
merge_rateas a human-facing interpretable signal,idle_per_repois still the most defensible definition. It correctly handles the drive-by-with-ignored-PR case (currently scored as merge_rate=1.000, which reads as "always accepted" but isn't true). That's an interpretability case, not an ML case, and a separate conversation.Files
experiments/bot_detection/scripts/pocket_veto_analysis.py: analysis scriptexperiments/bot_detection/scripts/fetch_open_pr_activity.py: one-shot GraphQL refetch forupdatedAtexperiments/bot_detection/data/open_pr_activity.parquet: cached refetch output (58KB)experiments/bot_detection/data/results/pocket_veto_analysis.json: machine-readable resultsexperiments/bot_detection/pocket_veto_findings.md: human-readable writeupNo changes to
src/good_egg/.Test plan
uv run python experiments/bot_detection/scripts/pocket_veto_analysis.pyreproduces the reported numbersuv run ruff check experiments/bot_detection/scripts/passes