Skip to content

[DO NOT MERGE] Investigate pocket vetoes (#51)#52

Closed
jeffreyksmithjr wants to merge 1 commit intomainfrom
pocket-vetos
Closed

[DO NOT MERGE] Investigate pocket vetoes (#51)#52
jeffreyksmithjr wants to merge 1 commit intomainfrom
pocket-vetos

Conversation

@jeffreyksmithjr
Copy link
Copy Markdown
Contributor

@jeffreyksmithjr jeffreyksmithjr commented Apr 12, 2026

Summary

Investigation for #51: does counting stale open PRs as implicit rejections improve the merge_rate signal? Ran two passes over the bot_detection DuckDB. Short answer: no, not for detection accuracy. Not merging this; it is an evidence trail for closing the issue.

What was done

  • Pass 1: age-since-created_at as the staleness proxy (forced by the DuckDB schema, which has no updatedAt). Computed three merge-rate definitions over all 31,296 authors: v3 baseline, age_universal (90d), age_per_repo (using the DB's per-repo stale_threshold_days). Ran 5-fold CV on a 2-feature logistic (merge_rate + log1p(median_additions)) against the 675-suspended / 30,554-active labeled subset.
  • Pass 2: re-fetched updatedAt for every DB-OPEN PR via GraphQL (91 repos, one paginated call each, cached to open_pr_activity.parquet) and added two idle-time variants. This addressed the main objection to Pass 1: under age-based staleness, a legitimate contributor like harupy (2,771 merged PRs, active account) gets punished for open PRs that are in active discussion. Idle-time correctly spares them.

Results

Definition CV AUC Cohen's d Authors shifted >0.10
v3 baseline 0.5494 0.096 n/a
age_universal (90d) 0.5488 0.081 1,836
age_per_repo 0.5486 0.073 2,111
idle_universal (90d) 0.5489 0.087 459
idle_per_repo 0.5488 0.085 529

The refinement worked mechanically. Idle-time reclassifies 4x fewer authors than age-based, exactly the gentler behavior we wanted. But none of the variants beat the v3 baseline on CV AUC or Cohen's d, and differences are within noise.

Why it doesn't work

Suspended high-volume accounts in this dataset tend to either (a) be bots that v3 already scores low, or (b) have no open PRs at all (e.g. amd-jmacaran: 105 PRs, merge_rate=1.000 under every definition). Pocket vetoes aren't catching a failure mode the current model misses. They're just redistributing mass among accounts the model already classifies adequately.

A striking side-finding from Pass 2: 79% of DB-OPEN PRs have since been closed or merged (15,433 of 19,502). The age-based proxy treated all 19,502 as stale; idle-time correctly treats those 15,433 as non-stale. The age-based result was partly an artifact of the crude proxy, and the "is idle-time better" question was worth answering even though the final verdict didn't change.

Caveat worth recording for the future

For the narrow detection task, this is settled, don't ship. For merge_rate as a human-facing interpretable signal, idle_per_repo is still the most defensible definition. It correctly handles the drive-by-with-ignored-PR case (currently scored as merge_rate=1.000, which reads as "always accepted" but isn't true). That's an interpretability case, not an ML case, and a separate conversation.

Files

  • experiments/bot_detection/scripts/pocket_veto_analysis.py: analysis script
  • experiments/bot_detection/scripts/fetch_open_pr_activity.py: one-shot GraphQL refetch for updatedAt
  • experiments/bot_detection/data/open_pr_activity.parquet: cached refetch output (58KB)
  • experiments/bot_detection/data/results/pocket_veto_analysis.json: machine-readable results
  • experiments/bot_detection/pocket_veto_findings.md: human-readable writeup

No changes to src/good_egg/.

Test plan

  • N/A, investigation only, no production code changes
  • uv run python experiments/bot_detection/scripts/pocket_veto_analysis.py reproduces the reported numbers
  • uv run ruff check experiments/bot_detection/scripts/ passes

Two-pass analysis over the bot_detection DuckDB to see whether counting
stale open PRs as rejections improves merge_rate signal quality against
suspended-vs-active ground truth.

Pass 1 used age-since-created_at as the staleness proxy (forced by the
DuckDB schema). Negative result: CV AUC flat, Cohen's d worse.

Pass 2 re-fetched updatedAt for every DB-OPEN PR (4,069 still currently
open of 19,502; the other 79% have been closed or merged since the
snapshot and are by definition non-stale) and recomputed staleness as
idle-time. Much gentler reclassification — only 459 authors shift by
>0.10 under idle_universal vs 1,836 under age_universal — but signal
quality is still flat (CV AUC 0.5489 vs baseline 0.5494).

Conclusion: idle-time is the right signal, but pocket vetoes don't
catch a failure mode the current bad-egg model misses. No scoring
change recommended on detection-accuracy grounds. Interpretability
case for the change is separately defensible but out of scope here.
@github-actions
Copy link
Copy Markdown

🥚 Diet Egg: HIGH Trust

Score: 83%

Score Breakdown

Component Value
Merge Rate 83% (48/58 PRs)

Top Contributions

Repository PRs Language Stars
2ndSetAI/good-egg 21 Python 29
jeffreyksmithjr/verskyt 9 Python 3
jeffreyksmithjr/galapagos_nao 7 Elixir 21
aws-samples/aws-big-data-blog 3 Java 894
pytorch/pytorch.github.io 2 HTML 279
elixir-nx/nx 1 Elixir 2874
numpy/numpy 1 Python 31836
melissawm/open-source-ai-contribution-policies 1 N/A 155
nerves-project/nerves_examples 1 Elixir 404
kilimchoi/engineering-blogs 1 Ruby 37735

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an investigation into the "pocket veto" effect, analyzing if treating stale open pull requests as rejections enhances the signal for identifying suspended accounts. It adds scripts for fetching PR activity and performing statistical analysis, alongside the generated findings. Feedback suggests optimizing database interactions in the fetching script to avoid redundant connections and N+1 queries, and adding a check for empty data to prevent potential KeyError exceptions during DataFrame processing.

Comment on lines +101 to +118
con = duckdb.connect(str(DB_PATH), read_only=True)
repo_counts = con.execute("""
SELECT repo, COUNT(*) AS n
FROM prs WHERE state='OPEN'
GROUP BY repo ORDER BY n DESC
""").fetchall()
con.close()

db_open_keys: dict[str, set[int]] = {}
con = duckdb.connect(str(DB_PATH), read_only=True)
for (repo,) in con.execute(
"SELECT DISTINCT repo FROM prs WHERE state='OPEN'"
).fetchall():
numbers = con.execute(
"SELECT number FROM prs WHERE state='OPEN' AND repo=?", [repo]
).fetchall()
db_open_keys[repo] = {n[0] for n in numbers}
con.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation opens and closes the database connection twice and uses an N+1 query pattern to populate db_open_keys. This can be optimized by using a single connection and a single query to fetch all open PR numbers, then grouping them in memory. This reduces overhead and improves script performance.

    con = duckdb.connect(str(DB_PATH), read_only=True)
    repo_counts = con.execute("""
        SELECT repo, COUNT(*) AS n
        FROM prs WHERE state='OPEN'
        GROUP BY repo ORDER BY n DESC
    """).fetchall()

    db_open_keys: dict[str, set[int]] = {}
    all_open = con.execute("SELECT repo, number FROM prs WHERE state='OPEN'").fetchall()
    for repo, num in all_open:
        db_open_keys.setdefault(repo, set()).add(num)
    con.close()

Comment on lines +162 to +164
df = pd.DataFrame(rows)
df["updated_at"] = pd.to_datetime(df["updated_at"])
df["fetch_now"] = pd.to_datetime(df["fetch_now"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If no relevant PR activity is found (i.e., the rows list is empty), pd.DataFrame(rows) will create a DataFrame without the expected columns. This will cause the subsequent pd.to_datetime calls to raise a KeyError. Adding a guard clause to handle the empty case will make the script more robust.

Suggested change
df = pd.DataFrame(rows)
df["updated_at"] = pd.to_datetime(df["updated_at"])
df["fetch_now"] = pd.to_datetime(df["fetch_now"])
if not rows:
print("No relevant PR activity found.")
return
df = pd.DataFrame(rows)
df["updated_at"] = pd.to_datetime(df["updated_at"])
df["fetch_now"] = pd.to_datetime(df["fetch_now"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant