Parallelize block comparison itself

- See TidyObsidian/find-duplicate-blocks.py

This script takes 30-40 minutes to process 22K files. Currently, only reading and block extraction are parallel; step 5 runs in a single process.
​
A simple structural improvement:

- Split `all_blocks` into chunks and run the “for each block, find candidate indices, check Jaccard, append or merge” logic in worker processes.

- Have each worker build its own local `canonical_blocks/token_index`, then merge the results at the end (e.g., by re-running a cheaper global dedup on the worker outputs).

This is more work to implement, but if 30–40 minutes is mostly CPU, using all cores in step 5 can cut that substantially.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelize block comparison itself #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parallelize block comparison itself #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions