Skip to content

feat: per-locus CPU-time profiling (thread CPU time + harness)#15

Merged
wdecoster merged 1 commit into
mainfrom
feat/cpu-time-profiling
Jun 14, 2026
Merged

feat: per-locus CPU-time profiling (thread CPU time + harness)#15
wdecoster merged 1 commit into
mainfrom
feat/cpu-time-profiling

Conversation

@wdecoster

Copy link
Copy Markdown
Owner

Adds per-locus profiling instrumentation and an analysis harness to find which loci are consistently slow and why.

Why

The debug-only ;TIME= INFO field measured wall-clock time, which is inflated whenever other processes contend for the machine — making per-locus timings unreliable across runs.

What

  • Thread CPU time. TIME= now reports thread CPU time via repeats::thread_cpu_time() (clock_gettime(CLOCK_THREAD_CPUTIME_ID), hence the libc dep). This only advances while the thread is on-CPU, so it's immune to other-process noise. Valid here because each locus is genotyped start-to-finish on a single rayon worker, so the thread-CPU delta around it is exactly that locus's cost. Stays correct under multi-threading too.
    • RepeatInterval.created is now a Duration; new cpu_elapsed() / time_field() helpers collapse the four duplicated timing blocks in vcf.rs.
  • misc/profile_loci.py. Runs STRdust over N replicates, drops --warmup cold-cache runs, and aggregates per-locus median / MAD / CV CPU time, joining each locus with VCF features (depth, allele length, stdev, cluster count, outliers, QUICKREF). Prints the consistently-slowest loci and Spearman feature correlations. --from-vcfs re-aggregates existing debug VCFs without rerunning.

Findings (chr22:20–21 Mb, 1607 loci, 4 replicates)

  • Measurement is reliable: slow loci show low CV (~0.06–0.15) across replicates.
  • Per-locus VCF features are weak predictors of CPU time (strongest is repeat-length stdev, Spearman ~0.30; depth/length ~0.18).
  • The real driver is genomic context: slow loci cluster in the 22q11.2 segmental-duplication block (~3× slower at equal depth), where reads multi-map and minimap2 spends the time on chaining/alignment candidates.

So this is a diagnostic tool, not itself a speedup — but it cleanly ruled out feature-based heuristics and the read-slicing direction.

Notes

  • Branches off the 0.19.0 chore already on main.
  • libc was already in the lockfile transitively; the direct dep adds a single line.

Measure per-locus genotyping cost by thread CPU time instead of wall clock,
so timings are not inflated when other processes contend for the machine.
Each locus is genotyped start-to-finish on a single rayon worker, so the
thread-CPU delta around it is that locus's CPU cost.

- repeats.rs: add thread_cpu_time() (clock_gettime(CLOCK_THREAD_CPUTIME_ID),
  hence the libc dep), store it in RepeatInterval.created (now a Duration),
  and add cpu_elapsed()/time_field() helpers.
- vcf.rs: the debug-only ;TIME= INFO field now reports thread CPU seconds;
  the four duplicated timing blocks collapse to repeat.time_field(args.debug).
- misc/profile_loci.py: run STRdust over N replicates, drop warmup runs,
  aggregate per-locus median/MAD/CV CPU time and correlate with VCF features
  (depth, allele length, stdev, cluster count, ...) to find consistently slow
  loci.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wdecoster wdecoster merged commit 08c4a06 into main Jun 14, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant