- See TidyObsidian/find-duplicate-blocks.py
Use a hash-based pick: sort tokens by a stable hash (e.g., hash(token) or sha1(token)), then take the first N. This approximates MinHash and tends to distribute blocks more evenly across the token space.
This change shrinks candidate_indices per block, so the inner Jaccard loop runs far fewer times.
Use a hash-based pick: sort tokens by a stable hash (e.g., hash(token) or sha1(token)), then take the first N. This approximates MinHash and tends to distribute blocks more evenly across the token space.
This change shrinks
candidate_indicesper block, so the inner Jaccard loop runs far fewer times.