- See TidyObsidian/find-duplicate-blocks.py
For each candidate, you recompute jaccard(tokens, cb["tokens"]) by doing & and | on Python sets.
Change:
-
Pre-store len(tokens) alongside tokens inside canonical_blocks, so you compute union size as len_a + len_b - inter and avoid building a full union set.
-
If you keep tokens in a sorted list instead of a set, you can do an intersection with a two-pointer walk, which is often faster and more cache-friendly at this scale.
Both reduce per-comparison overhead without changing behavior.
For each candidate, you recompute
jaccard(tokens, cb["tokens"])by doing&and|on Python sets.
Change:
Pre-store
len(tokens)alongsidetokensinsidecanonical_blocks, so you compute union size aslen_a + len_b - interand avoid building a full union set.If you keep tokens in a sorted list instead of a set, you can do an intersection with a two-pointer walk, which is often faster and more cache-friendly at this scale.
Both reduce per-comparison overhead without changing behavior.