Make the inner Jaccard loop cheaper

- See TidyObsidian/find-duplicate-blocks.py

For each candidate, you recompute `jaccard(tokens, cb["tokens"])` by doing `&` and `|` on Python sets. 
​
Change:

- Pre-store `len(tokens)` alongside `tokens` inside `canonical_blocks`, so you compute union size as `len_a + len_b - inter` and avoid building a full union set.

- If you keep tokens in a sorted list instead of a set, you can do an intersection with a two-pointer walk, which is often faster and more cache-friendly at this scale.

Both reduce per-comparison overhead without changing behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the inner Jaccard loop cheaper #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make the inner Jaccard loop cheaper #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions