- See TidyObsidian/find-duplicate-blocks.py
This script takes 30-40 minutes to process 22K files. Currently, only reading and block extraction are parallel; step 5 runs in a single process.
A simple structural improvement:
-
Split all_blocks into chunks and run the “for each block, find candidate indices, check Jaccard, append or merge” logic in worker processes.
-
Have each worker build its own local canonical_blocks/token_index, then merge the results at the end (e.g., by re-running a cheaper global dedup on the worker outputs).
This is more work to implement, but if 30–40 minutes is mostly CPU, using all cores in step 5 can cut that substantially.
This script takes 30-40 minutes to process 22K files. Currently, only reading and block extraction are parallel; step 5 runs in a single process.
A simple structural improvement:
Split
all_blocksinto chunks and run the “for each block, find candidate indices, check Jaccard, append or merge” logic in worker processes.Have each worker build its own local
canonical_blocks/token_index, then merge the results at the end (e.g., by re-running a cheaper global dedup on the worker outputs).This is more work to implement, but if 30–40 minutes is mostly CPU, using all cores in step 5 can cut that substantially.