Skip to content

First bigbio release of msgf+#14

Open
ypriverol wants to merge 60 commits intomasterfrom
dev
Open

First bigbio release of msgf+#14
ypriverol wants to merge 60 commits intomasterfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

No description provided.

ypriverol and others added 30 commits April 13, 2026 07:51
…Java 17

- Change skipTests from true to false so mvn test actually runs
- Update maven-compiler-plugin source/target from 1.8 to 17 (matches runtime)
- Add missing compile dependencies: jmzml 1.7.11, fastutil 8.5.12,
  slf4j-api 1.7.36, logback-classic 1.2.12, commons-io 2.15.1
  (master code references these classes but they were not declared)
- @ignore TestMzML test that requires Windows-specific DMS files

Result: 120 tests run, 53 active, 67 skipped, 0 failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…shold

In MZIdentMLGen.addSpectrumIdentificationResults(), change `break` to
`continue` when a match has DeNovoScore below the minimum threshold.
The `break` was incorrectly stopping emission of all subsequent matches
for that spectrum, silently dropping valid PSMs from the mzid output.

Also add null safety check for spectrum index lookup — if a spectrum
index is not found in the spectrum file, log a warning and skip
instead of throwing a NullPointerException.

Add TestMZIdentMLGen with two integration tests:
- testMzidScoreCompleteness: runs MSGF+ search, verifies every SII
  has all 4 score CVParams (RawScore, DeNovoScore, SpecEValue, EValue)
- testMzidStructuralValidity: verifies output mzid has required
  mzIdentML structure elements

Closes MSGFPlus#157

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new -msLevel CLI parameter to filter spectra by MS level.
Accepts single value (e.g., -msLevel 2) or comma-separated range
(e.g., -msLevel 2,3). Default is 2 (MS2 only).

Changes:
- ParamManager: add MS_LEVEL enum and registration
- IntRangeParameter: enable single-value parsing, fix typo
- SearchParams: add minMSLevel/maxMSLevel fields
- SpecKey: filter spectra by MS level in getSpecKeyList()
- SpectraAccessor: add setMSLevelRange(), wire to parsers
- MzMLAdapter/MzXMLSpectraMap: fix maxMSLevel to be inclusive
- MSGFPlus/MSGFDB/MSGFDBLib: wire MS level parameters
- pom.xml: remove fastutil shade filter (jmzml 1.7.11 needs full fastutil)

Tests: TestIntRangeParameter (9 tests), TestMSLevelFiltering (6 tests)

Benchmark (TMT 1.1GB, TDA):
  Baseline: 1245s, 6654 PSMs@1%FDR
  -msLevel 2: 957s (-23%), 6936 PSMs@1%FDR (+4.2%)

Closes MSGFPlus#159

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(MSGFPlus#159): add -msLevel parameter for MS level filtering
fix(MSGFPlus#157): preserve PSM scores when DeNovoScore is below threshold
fix: enable test suite and fix broken build dependencies
Remove standalone scripts, legacy tools, and unused classes that are
not referenced by the core MSGF+ search pipeline, reducing codebase
by ~22,000 lines.

Deleted entire packages:
- ims/ (9 files) — legacy IMS utilities
- ipa/ (5 files) — unused isotope pattern analysis
- msgf2d/ (8 files) — abandoned 2D scoring experiment
- msdictionary/ (7 files) — unused genome dictionary tool
- mstag/ (3 files) — unused sequence tagging
- scripts/ (6 files) — standalone CLI utilities
- msutil/test/ (3 files) — misplaced test classes
- msgf/test/ (2 files) — legacy test stubs
- msgf/analysis/ (1 file) — unused ROC generator

Cleaned mixed packages:
- misc/: removed 59 standalone scripts, kept 5 core utilities
- msgf/: removed 6 unused graph/scoring classes
- msutil/: removed 9 unused filter/annotation classes
- msdbsearch/: removed 4 standalone DB tools
- parser/: removed 9 legacy format parsers (InsPecT, Mascot, etc.)
- ui/: removed 6 legacy entry points (MSGF, MSGFLib, etc.)
- mzid/: removed 1 unused adapter stub
- msscorer/: removed 1 unused stats class
- suffixarray/: removed 1 unused mass array class

Also removed dead test methods and cleaned dangling imports.
Tests: 119 run, 0 failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite README.md:
- Full parameter reference tables covering all 30+ flags organized
  by category (core search, fragmentation, enzyme, filtering, etc.)
- Quick start examples for basic and TMT searches
- Modification file format documentation with examples
- Build-from-source instructions
- Updated requirements to Java 17+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add parameter docs to README and CI/CD workflow
Remove dead code: 150 unused classes, -22K lines
Write search results directly to TSV from in-memory objects, bypassing
mzIdentML serialization. Output is column-identical to MzIDToTsv
(verified by diff on test.mgf search). This avoids generating large
.mzid files when only TSV is needed downstream (e.g. OpenMS
MSGFPlusAdapter, Percolator).

- New DirectTSVWriter class with same score/protein/mod logic as
  MZIdentMLGen but streaming tab-delimited output
- New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both
- Includes fixed + variable mods, MGF Title column, decoy filtering
- Backwards compatible: default remains mzid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now
includes all PSMFeatureFinder columns needed for Percolator:
ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio,
MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency,
NumMatchedMainIons, and all error statistics (MeanError/StdevError
for All and Top7, both absolute and relative).

These features were previously only available as UserParams in mzid
and were not extracted by OpenMS's addMSGFFeatures() — now they are
directly accessible as TSV columns.

The peptide modification format (M+15.995) is already compatible with
OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms
it to bracket notation M[+15.995] for AASequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX
streaming parser that extracts only the 11 fields MSGF+ needs. The new
parser builds a spectrum index in a single pass, then preloads all
spectra into memory on first random access, eliminating repeated XML
parsing during the search phase.

Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads):
- Wall time: 957s -> 853s (-10.9%)
- PSMs at 1% FDR: 6,936 (unchanged)
- Score completeness: 100% (unchanged)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port

- Remove jmzml (JAXB-based mzML parser) dependency from pom.xml
- Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap,
  MzMLSpectraIterator, SpectrumConverter
- Add referenceableParamGroupRef resolution to StaxMzMLParser: builds
  a map of param groups during index pass, resolves refs during
  spectrum parsing (critical for files that define polarity, MS level,
  etc. in referenceable groups)
- Move turnOffLogs() utility to StaxMzMLParser, update all callers
- Keep fastutil dependency (needed by jmzidml at runtime)

JAR size reduced from 39.5MB to 38MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports
anywhere in the codebase — a dead dependency from earlier development.
All spectrum file format parsing uses custom implementations:
mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL
(custom parsers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove mzXML file format support entirely:
- Delete embedded jrap/stax library (20 files, ~5,800 lines)
- Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter
- Delete MzXMLToMgf utility and mzXML test resources (38MB)
- Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager
- Update misc/scripts/ui classes to remove mzXML code paths

mzXML is a legacy format superseded by mzML. Users with mzXML files
can convert to mzML using msconvert (ProteoWizard).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache,
  fix class-level doc (preload-all, not bounded LRU), check index before
  preloading, propagate exceptions instead of returning null
- StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted
- SpectraAccessor: throw exception instead of System.exit(-1),
  validate specFormat is non-null in constructor
- SelectSpectra: update stale .mzXML reference to .mzML
- pom.xml: fix duplicate <manifest>, remove stale comments, note
  fastutil is required by jmzidentml at runtime

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Write search results directly to TSV from in-memory objects, bypassing
mzIdentML serialization. Output is column-identical to MzIDToTsv
(verified by diff on test.mgf search). This avoids generating large
.mzid files when only TSV is needed downstream (e.g. OpenMS
MSGFPlusAdapter, Percolator).

- New DirectTSVWriter class with same score/protein/mod logic as
  MZIdentMLGen but streaming tab-delimited output
- New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both
- Includes fixed + variable mods, MGF Title column, decoy filtering
- Backwards compatible: default remains mzid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now
includes all PSMFeatureFinder columns needed for Percolator:
ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio,
MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency,
NumMatchedMainIons, and all error statistics (MeanError/StdevError
for All and Top7, both absolute and relative).

These features were previously only available as UserParams in mzid
and were not extracted by OpenMS's addMSGFFeatures() — now they are
directly accessible as TSV columns.

The peptide modification format (M+15.995) is already compatible with
OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms
it to bracket notation M[+15.995] for AASequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ConvertToMgf-based tests (class removed in PR #7) with
StaxMzMLParser and SpectraAccessor mzML parsing tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: native TSV output — bypass mzIdentML for OpenMS/Percolator pipelines
perf: replace jmzml JAXB parser with StAX-based mzML reader
* chore: add CI/release packaging and benchmark scaffolding

Split infra and repository maintenance updates into a dedicated reviewable change set, including workflow automation, Docker packaging, benchmark scripts/docs, and project documentation updates. Exclude large local benchmark artifacts and keep this PR focused on non-hot-path code organization and release hygiene.

Made-with: Cursor

* chore: keep benchmark folder local-only

Remove benchmark scripts/docs from this branch and ignore the entire benchmark directory so local benchmarking assets do not appear in review PRs.

Made-with: Cursor

* docs: keep single canonical primitives plan

Fold memory-reduction guidance into the balanced primitives plan and remove the old duplicate plan file so review and maintenance use one canonical document.

Made-with: Cursor

* chore: narrow PR1 plans to scope-only docs

Remove unrelated strategy and optimization plan documents from PR1 so this branch stays focused on infra/packaging cleanup. Keep only the plans index file in this PR.

Made-with: Cursor

* chore: remove legacy ZippedReleases folder

Delete the obsolete Windows release helper scripts and reference files under ZippedReleases from the repository.

Made-with: Cursor

* chore: remove legacy extlib dependency jar

Delete the obsolete jrap/stax legacy jar under extlib as part of repository cleanup.

Made-with: Cursor

* fix: address copilot review feedback for PR11

Align docs with actual supported legacy formats, update release pipeline to build from tag version with tests, and fix Docker build JDK requirement.

Made-with: Cursor

* chore: minor packaging/docs hygiene for PR1

Normalize ignore files, shrink Docker build context, align agent README with dev/CI, and clarify release workflow step naming.

Made-with: Cursor

* docs: trim examples folder to small referenced artifacts

Remove duplicate Tryp_Pig_Bov DB/index copies (tests use src/test/resources),
drop large unlinked Excel/PNG teaching files, and add docs/examples/README.md
so the directory purpose is obvious. Link the index from the main README.

Made-with: Cursor

* chore: remove IntelliJ IDEA tips screenshots from docs

Made-with: Cursor

* docs: replace legacy HTML manuals with Markdown

Convert docs/*.html to GitHub-flavored Markdown (pandoc), fix internal links,
add docs/README.md as the documentation index, and remove unused style.css.

Made-with: Cursor

* docs: strip leftover HTML span wrappers from converted Markdown

Made-with: Cursor
Add a workflow-dispatch benchmark pipeline on a fixed self-hosted runner profile, with public-data download, metrics emission, and baseline TSV comparison under benchmark/ci/PXD001819 for future dataset expansion.

Made-with: Cursor
Use uppercase PXD001819 naming in workflow-visible labels/artifacts and update README to state mzXML is not available in this fork.

Made-with: Cursor
- run_ci.sh: count only opening <SpectrumIdentificationItem> tags for
  sii_count (prior substring match double-counted closing tags and
  picked up SpectrumIdentificationItemRef)
- run_ci.sh: always emit peak_rss_kb and cpu_percent (NA when GNU time
  does not expose them) so metrics file format is consistent
- compare_metrics.py: support an `optional` column; optional missing/NA
  metrics warn instead of failing CI
- baseline.tsv: add optional column, mark peak_rss_kb optional, fix
  ubuntu-latest note to reference the self-hosted runner, widen
  sii_count floor to match the de-duplicated count
- README pointers: update stale references to a non-existent
  benchmark/run_pxd001819_benchmark.sh script
- benchmark/README.md: describe the actual committed CI scaffold
  instead of an uncommitted local harness layout

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- extract_metrics.py: stream-parse mzIdentML with ElementTree.iterparse
  so SII counting and PSM 1% FDR counting no longer rely on line-shaped
  regex matches over XML
- run_ci.sh: use a bash array for SEARCH_ARGS (safe against future
  flags with spaces), atomic .part downloads, validate cached gzip,
  default MSGFPLUS_THREADS to 8 to match the workflow, drop the
  always-zero java_exit metric, and emit integer wall_time_sec
- workflow: pin Python via actions/setup-python@v5 so self-hosted
  runners have a known 3.11 interpreter for the helper scripts
- compare_metrics.py: add test_compare_metrics.py covering
  in-range pass, out-of-range fail, missing required/optional,
  NA, non-numeric, and empty-range rows (7 tests, all passing)
- .gitignore: drop redundant benchmark/** patterns (already covered
  by benchmark/* + ci/ allowlist); add __pycache__/ and *.pyc
- docs: describe new helper and test scripts in both READMEs
fix(benchmark): harden PXD001819 scaffold per review feedback
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

Important

Review skipped

Too many files!

This PR contains 282 files, which is 132 over the limit of 150.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a7ebe46c-5bea-4cb3-93b8-e1187d442620

📥 Commits

Reviewing files that changed from the base of the PR and between f0bc79e and 78af285.

⛔ Files ignored due to path filters (18)
  • benchmark/ci/PXD001819/baseline.tsv is excluded by !**/*.tsv
  • docs/IntelliJ_IDEA_Tips/InvalidateCaches.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging1.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging2_TestClass.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging2_TestMethod.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging3.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging4.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging5.png is excluded by !**/*.png
  • docs/IntelliJ_IDEA_Tips/UnitTestDebugging6.png is excluded by !**/*.png
  • docs/examples/Find_best_peptide_for_each_scan.xlsx is excluded by !**/*.xlsx
  • docs/examples/IsotopeRangeExample1.xlsx is excluded by !**/*.xlsx
  • docs/examples/IsotopeRangeExample2.xlsx is excluded by !**/*.xlsx
  • docs/examples/MassErrorHistogram_QC_Mam_19_01-run3_03Jan20.png is excluded by !**/*.png
  • docs/examples/MassErrorHistogram_QC_Mam_19_01-run3_03Jan20.xlsx is excluded by !**/*.xlsx
  • docs/examples/PepQValue_Computation_Mockup.xlsx is excluded by !**/*.xlsx
  • docs/examples/QValue_Computation.xlsx is excluded by !**/*.xlsx
  • docs/examples/QValue_Computation_Excerpt.xlsx is excluded by !**/*.xlsx
  • extlib/jrap_StAX_v5.2.jar is excluded by !**/*.jar
📒 Files selected for processing (282)
  • .claude/CLAUDE.md
  • .claude/investigations/001-mgf-scan-number-extraction-failure.md
  • .claude/investigations/002-evalue-target-decoy-leakage-to-percolator.md
  • .claude/investigations/README.md
  • .claude/plans/README.md
  • .claude/skills/README.md
  • .claude/skills/score-output-safety.md
  • .dockerignore
  • .github/workflows/benchmark-pxd001819.yml
  • .github/workflows/ci.yml
  • .github/workflows/release.yml
  • .gitignore
  • Dockerfile
  • README.md
  • ZippedReleases/Distribute_Program.bat
  • ZippedReleases/Distribute_Program_Use_Proto-2.bat
  • ZippedReleases/ReferenceFiles/BuildSuffixArray.bat
  • ZippedReleases/ReferenceFiles/Convert_mzid_to_tsv.bat
  • ZippedReleases/ReferenceFiles/HowTo_Run_MSGFPlus.txt
  • ZippedReleases/ReferenceFiles/MSGFPlus_Mods1.txt
  • ZippedReleases/ReferenceFiles/MSGFPlus_Mods2.txt
  • ZippedReleases/ReferenceFiles/MSGFPlus_Mods3.txt
  • ZippedReleases/ReferenceFiles/Start_MSGFPlus.bat
  • ZippedReleases/ReferenceFiles/Syntax.txt
  • ZippedReleases/ZipFilesForRelease.bat
  • benchmark/README.md
  • benchmark/ci/PXD001819/compare_metrics.py
  • benchmark/ci/PXD001819/extract_metrics.py
  • benchmark/ci/PXD001819/run_ci.sh
  • benchmark/ci/PXD001819/test_compare_metrics.py
  • benchmark/ci/README.md
  • docs/BuildSA.html
  • docs/BuildSA.md
  • docs/Changelog.html
  • docs/Changelog.md
  • docs/IsobaricLabeling.md
  • docs/MS-GFDB.html
  • docs/MS-GFDB.md
  • docs/MSGFDB_ModFile.html
  • docs/MSGFDB_ModFile.md
  • docs/MSGFPlus.html
  • docs/MSGFPlus.md
  • docs/MzidToTsv.html
  • docs/MzidToTsv.md
  • docs/README.md
  • docs/ScoringParamGen.html
  • docs/ScoringParamGen.md
  • docs/Troubleshooting.md
  • docs/examples/IndexFasta.bat
  • docs/examples/README.md
  • docs/examples/Tryp_Pig_Bov.canno
  • docs/examples/Tryp_Pig_Bov.cnlcp
  • docs/examples/Tryp_Pig_Bov.csarr
  • docs/examples/Tryp_Pig_Bov.cseq
  • docs/examples/Tryp_Pig_Bov.fasta
  • docs/index.html
  • docs/style.css
  • pom.xml
  • src/main/java/edu/ucsd/msjava/ims/DtaToMSGFInput.java
  • src/main/java/edu/ucsd/msjava/ims/DtaToMSGFInputDB.java
  • src/main/java/edu/ucsd/msjava/ims/GetTheBestPerPeptide.java
  • src/main/java/edu/ucsd/msjava/ims/GetTheBestPerScan.java
  • src/main/java/edu/ucsd/msjava/ims/MaskSpectra.java
  • src/main/java/edu/ucsd/msjava/ims/Misc.java
  • src/main/java/edu/ucsd/msjava/ims/OptimizeCE.java
  • src/main/java/edu/ucsd/msjava/ims/SplitDta.java
  • src/main/java/edu/ucsd/msjava/ims/Summarize.java
  • src/main/java/edu/ucsd/msjava/ipa/Feature.java
  • src/main/java/edu/ucsd/msjava/ipa/IPA.java
  • src/main/java/edu/ucsd/msjava/ipa/MS1SpectraMap.java
  • src/main/java/edu/ucsd/msjava/ipa/MSGFPlusResultSet.java
  • src/main/java/edu/ucsd/msjava/ipa/PSM.java
  • src/main/java/edu/ucsd/msjava/misc/AgilentQTOF.java
  • src/main/java/edu/ucsd/msjava/misc/AnnotatedMgfToMSGFInput.java
  • src/main/java/edu/ucsd/msjava/misc/AnnotatedSpecGenerator.java
  • src/main/java/edu/ucsd/msjava/misc/CIDETDPairs.java
  • src/main/java/edu/ucsd/msjava/misc/CalcFastaDBSize.java
  • src/main/java/edu/ucsd/msjava/misc/ChargePrediction.java
  • src/main/java/edu/ucsd/msjava/misc/Chores.java
  • src/main/java/edu/ucsd/msjava/misc/Clauser.java
  • src/main/java/edu/ucsd/msjava/misc/CompGraphPaper.java
  • src/main/java/edu/ucsd/msjava/misc/CompactSATest.java
  • src/main/java/edu/ucsd/msjava/misc/CompareSearchResults.java
  • src/main/java/edu/ucsd/msjava/misc/CompositionFirst.java
  • src/main/java/edu/ucsd/msjava/misc/ControlNew.java
  • src/main/java/edu/ucsd/msjava/misc/ConvertToMgf.java
  • src/main/java/edu/ucsd/msjava/misc/CountID.java
  • src/main/java/edu/ucsd/msjava/misc/CountPSMs.java
  • src/main/java/edu/ucsd/msjava/misc/CountSequestIDs.java
  • src/main/java/edu/ucsd/msjava/misc/DatToTxt.java
  • src/main/java/edu/ucsd/msjava/misc/Deconvolution.java
  • src/main/java/edu/ucsd/msjava/misc/FileFilter.java
  • src/main/java/edu/ucsd/msjava/misc/FilteringEfficiency.java
  • src/main/java/edu/ucsd/msjava/misc/FindPSMIntersection.java
  • src/main/java/edu/ucsd/msjava/misc/GetProteinLength.java
  • src/main/java/edu/ucsd/msjava/misc/GetSearchParams.java
  • src/main/java/edu/ucsd/msjava/misc/HCDCIDETD.java
  • src/main/java/edu/ucsd/msjava/misc/HeckPercolator.java
  • src/main/java/edu/ucsd/msjava/misc/HeckRevision.java
  • src/main/java/edu/ucsd/msjava/misc/HeckWhole.java
  • src/main/java/edu/ucsd/msjava/misc/IPA.java
  • src/main/java/edu/ucsd/msjava/misc/IPRGStudy.java
  • src/main/java/edu/ucsd/msjava/misc/ISBETDAnalysis.java
  • src/main/java/edu/ucsd/msjava/misc/LibraryScripts.java
  • src/main/java/edu/ucsd/msjava/misc/MS2ToMgf.java
  • src/main/java/edu/ucsd/msjava/misc/MSGFDBToInspect.java
  • src/main/java/edu/ucsd/msjava/misc/MSGFDBToQSpec.java
  • src/main/java/edu/ucsd/msjava/misc/MSGFLogger.java
  • src/main/java/edu/ucsd/msjava/misc/MSGFPlusPaper.java
  • src/main/java/edu/ucsd/msjava/misc/MakePrefixDB.java
  • src/main/java/edu/ucsd/msjava/misc/MassCalc.java
  • src/main/java/edu/ucsd/msjava/misc/MergeTargetDecoyFiles.java
  • src/main/java/edu/ucsd/msjava/misc/MiscScripts.java
  • src/main/java/edu/ucsd/msjava/misc/MultiThreadExercise.java
  • src/main/java/edu/ucsd/msjava/misc/MzXMLToMgf.java
  • src/main/java/edu/ucsd/msjava/misc/PEMMRProcessor.java
  • src/main/java/edu/ucsd/msjava/misc/ParamToTxt.java
  • src/main/java/edu/ucsd/msjava/misc/PepIdxToFasta.java
  • src/main/java/edu/ucsd/msjava/misc/PhosAnalysis.java
  • src/main/java/edu/ucsd/msjava/misc/PreprocessSpec.java
  • src/main/java/edu/ucsd/msjava/misc/RunMSGFDBOnGrid.java
  • src/main/java/edu/ucsd/msjava/misc/RunManifestWriter.java
  • src/main/java/edu/ucsd/msjava/misc/RunOMSSAOnCCMS.java
  • src/main/java/edu/ucsd/msjava/misc/SpectraSTToMSGFInput.java
  • src/main/java/edu/ucsd/msjava/misc/SpectraSTToTSV.java
  • src/main/java/edu/ucsd/msjava/misc/SplitFasta.java
  • src/main/java/edu/ucsd/msjava/misc/SplitMgf.java
  • src/main/java/edu/ucsd/msjava/misc/SuffixArrayTest.java
  • src/main/java/edu/ucsd/msjava/misc/SwedCAD.java
  • src/main/java/edu/ucsd/msjava/misc/TopDownAnalysis.java
  • src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java
  • src/main/java/edu/ucsd/msjava/misc/VennDiagram.java
  • src/main/java/edu/ucsd/msjava/misc/Zubarev.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/DBScanner.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/FilterDatabase.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/MakePairedSpectra.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/MassCalibrator.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/ReverseLibDB.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/ScoredSpectraMap.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/SearchParams.java
  • src/main/java/edu/ucsd/msjava/msdbsearch/ShuffleDB.java
  • src/main/java/edu/ucsd/msjava/msdictionary/Codon.java
  • src/main/java/edu/ucsd/msjava/msdictionary/GenomeLocator.java
  • src/main/java/edu/ucsd/msjava/msdictionary/GenomeSplitter.java
  • src/main/java/edu/ucsd/msjava/msdictionary/GenomeTranslator.java
  • src/main/java/edu/ucsd/msjava/msdictionary/MSDicLauncher.java
  • src/main/java/edu/ucsd/msjava/msdictionary/ProteinLocator.java
  • src/main/java/edu/ucsd/msjava/msdictionary/TestMSDictionary.java
  • src/main/java/edu/ucsd/msjava/msgf/AminoAcidGraph.java
  • src/main/java/edu/ucsd/msjava/msgf/DeNovoSequencer.java
  • src/main/java/edu/ucsd/msjava/msgf/GenericDeNovoGraph.java
  • src/main/java/edu/ucsd/msjava/msgf/LengthPredictor.java
  • src/main/java/edu/ucsd/msjava/msgf/PercolatorAdapter.java
  • src/main/java/edu/ucsd/msjava/msgf/PrimitiveAminoAcidGraph.java
  • src/main/java/edu/ucsd/msjava/msgf/PrimitiveGeneratingFunction.java
  • src/main/java/edu/ucsd/msjava/msgf/PrimitiveGeneratingFunctionGroup.java
  • src/main/java/edu/ucsd/msjava/msgf/ReachableNode.java
  • src/main/java/edu/ucsd/msjava/msgf/ScoredSpectrumSumHP.java
  • src/main/java/edu/ucsd/msjava/msgf/SpectrumGraphNode.java
  • src/main/java/edu/ucsd/msjava/msgf/analysis/ROCGenerator.java
  • src/main/java/edu/ucsd/msjava/msgf/test/MSGFTest.java
  • src/main/java/edu/ucsd/msjava/msgf/test/MSGFValidation.java
  • src/main/java/edu/ucsd/msjava/msgf2d/BacktrackPointer2D.java
  • src/main/java/edu/ucsd/msjava/msgf2d/BacktrackTable2D.java
  • src/main/java/edu/ucsd/msjava/msgf2d/CombinePairedSpectra.java
  • src/main/java/edu/ucsd/msjava/msgf2d/GeneratingFunction2D.java
  • src/main/java/edu/ucsd/msjava/msgf2d/ScoreBound2D.java
  • src/main/java/edu/ucsd/msjava/msgf2d/ScoreDist2D.java
  • src/main/java/edu/ucsd/msjava/msgf2d/ScoreDistMerged.java
  • src/main/java/edu/ucsd/msjava/msgf2d/TestMSGF2D.java
  • src/main/java/edu/ucsd/msjava/msscorer/ListStat.java
  • src/main/java/edu/ucsd/msjava/msscorer/NewRankScorer.java
  • src/main/java/edu/ucsd/msjava/msscorer/NewScorerFactory.java
  • src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java
  • src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java
  • src/main/java/edu/ucsd/msjava/mstag/Tag.java
  • src/main/java/edu/ucsd/msjava/mstag/TagTester.java
  • src/main/java/edu/ucsd/msjava/mstag/Tagger.java
  • src/main/java/edu/ucsd/msjava/msutil/CompositionAASet.java
  • src/main/java/edu/ucsd/msjava/msutil/DBSearchIOFiles.java
  • src/main/java/edu/ucsd/msjava/msutil/GappedPeptide.java
  • src/main/java/edu/ucsd/msjava/msutil/IntMassAASet.java
  • src/main/java/edu/ucsd/msjava/msutil/NominalMassAASet.java
  • src/main/java/edu/ucsd/msjava/msutil/ScoringFunction.java
  • src/main/java/edu/ucsd/msjava/msutil/SpecFileFormat.java
  • src/main/java/edu/ucsd/msjava/msutil/SpecKey.java
  • src/main/java/edu/ucsd/msjava/msutil/SpectraAccessor.java
  • src/main/java/edu/ucsd/msjava/msutil/Spectrum.java
  • src/main/java/edu/ucsd/msjava/msutil/SpectrumAnnotator.java
  • src/main/java/edu/ucsd/msjava/msutil/SpectrumRecalibrator.java
  • src/main/java/edu/ucsd/msjava/msutil/SpectrumTester.java
  • src/main/java/edu/ucsd/msjava/msutil/TopNFilter.java
  • src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraIteratorTest.java
  • src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraMapTest.java
  • src/main/java/edu/ucsd/msjava/msutil/test/SpectraTest.java
  • src/main/java/edu/ucsd/msjava/mzid/DirectPinWriter.java
  • src/main/java/edu/ucsd/msjava/mzid/DirectTSVWriter.java
  • src/main/java/edu/ucsd/msjava/mzid/MZIdentMLGen.java
  • src/main/java/edu/ucsd/msjava/mzid/MzIDParser.java
  • src/main/java/edu/ucsd/msjava/mzid/MzIdAdapter.java
  • src/main/java/edu/ucsd/msjava/mzml/MzMLAdapter.java
  • src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraIterator.java
  • src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraMap.java
  • src/main/java/edu/ucsd/msjava/mzml/SpectrumConverter.java
  • src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java
  • src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraIterator.java
  • src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraMap.java
  • src/main/java/edu/ucsd/msjava/params/IntRangeParameter.java
  • src/main/java/edu/ucsd/msjava/params/ParamManager.java
  • src/main/java/edu/ucsd/msjava/parser/InsPecTPSM.java
  • src/main/java/edu/ucsd/msjava/parser/InsPecTParser.java
  • src/main/java/edu/ucsd/msjava/parser/MSGFDBParser.java
  • src/main/java/edu/ucsd/msjava/parser/MSGappedDictionaryPSM.java
  • src/main/java/edu/ucsd/msjava/parser/MSGappedDictionaryParser.java
  • src/main/java/edu/ucsd/msjava/parser/MascotParser.java
  • src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraIterator.java
  • src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraMap.java
  • src/main/java/edu/ucsd/msjava/parser/MzXMLToMgfConverter.java
  • src/main/java/edu/ucsd/msjava/parser/OMSSAParser.java
  • src/main/java/edu/ucsd/msjava/parser/PSM.java
  • src/main/java/edu/ucsd/msjava/parser/PSMList.java
  • src/main/java/edu/ucsd/msjava/parser/PepXMLParser.java
  • src/main/java/edu/ucsd/msjava/parser/SearchParameterParser.java
  • src/main/java/edu/ucsd/msjava/parser/SortedSpectraIterator.java
  • src/main/java/edu/ucsd/msjava/scripts/AgilentCyclicSpecPreProcess.java
  • src/main/java/edu/ucsd/msjava/scripts/CountSpectra.java
  • src/main/java/edu/ucsd/msjava/scripts/GetDBInfo.java
  • src/main/java/edu/ucsd/msjava/scripts/MergeSpectra.java
  • src/main/java/edu/ucsd/msjava/scripts/SelectSpectra.java
  • src/main/java/edu/ucsd/msjava/scripts/SpecFileValidator.java
  • src/main/java/edu/ucsd/msjava/suffixarray/MassArray.java
  • src/main/java/edu/ucsd/msjava/ui/MSDictionary.java
  • src/main/java/edu/ucsd/msjava/ui/MSGF.java
  • src/main/java/edu/ucsd/msjava/ui/MSGFDB.java
  • src/main/java/edu/ucsd/msjava/ui/MSGFDBLib.java
  • src/main/java/edu/ucsd/msjava/ui/MSGFLib.java
  • src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java
  • src/main/java/edu/ucsd/msjava/ui/MSProfile.java
  • src/main/java/edu/ucsd/msjava/ui/MzIDToTsv.java
  • src/main/java/edu/ucsd/msjava/ui/PRMSpecGen.java
  • src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java
  • src/main/java/org/systemsbiology/jrap/stax/Base64.java
  • src/main/java/org/systemsbiology/jrap/stax/ByteBufferIterator.java
  • src/main/java/org/systemsbiology/jrap/stax/DataProcessingInfo.java
  • src/main/java/org/systemsbiology/jrap/stax/EndPatternStringIterator.java
  • src/main/java/org/systemsbiology/jrap/stax/FileHeaderParser.java
  • src/main/java/org/systemsbiology/jrap/stax/IndexParser.java
  • src/main/java/org/systemsbiology/jrap/stax/LineIterator.java
  • src/main/java/org/systemsbiology/jrap/stax/MLScanAndHeaderParser.java
  • src/main/java/org/systemsbiology/jrap/stax/MSInstrumentInfo.java
  • src/main/java/org/systemsbiology/jrap/stax/MSOperator.java
  • src/main/java/org/systemsbiology/jrap/stax/MSXMLParser.java
  • src/main/java/org/systemsbiology/jrap/stax/MSXMLSequentialParser.java
  • src/main/java/org/systemsbiology/jrap/stax/MZXMLFileInfo.java
  • src/main/java/org/systemsbiology/jrap/stax/ParentFile.java
  • src/main/java/org/systemsbiology/jrap/stax/Scan.java
  • src/main/java/org/systemsbiology/jrap/stax/ScanAndHeaderParser.java
  • src/main/java/org/systemsbiology/jrap/stax/ScanHeader.java
  • src/main/java/org/systemsbiology/jrap/stax/SoftwareInfo.java
  • src/main/java/org/systemsbiology/jrap/stax/StringBuilderReader.java
  • src/main/java/org/systemsbiology/jrap/stax/TestParser.java
  • src/test/java/edu/ucsd/msjava/mzid/AnalysisProtocolCollectionGenTest.java
  • src/test/java/ims/IMSMSGFTest.java
  • src/test/java/ims/IMSMiscTest.java
  • src/test/java/ims/IMSResultProcessor.java
  • src/test/java/ims/SarcTest.java
  • src/test/java/ipa/TestIPA.java
  • src/test/java/msgfplus/TestDirectPinWriter.java
  • src/test/java/msgfplus/TestIntRangeParameter.java
  • src/test/java/msgfplus/TestMSGFLogger.java
  • src/test/java/msgfplus/TestMSGFPlus.java
  • src/test/java/msgfplus/TestMSLevelFiltering.java
  • src/test/java/msgfplus/TestMZIdentMLGen.java
  • src/test/java/msgfplus/TestMassCalibrator.java
  • src/test/java/msgfplus/TestMinSpectraPerThread.java
  • src/test/java/msgfplus/TestMisc.java
  • src/test/java/msgfplus/TestParsers.java
  • src/test/java/msgfplus/TestPrecursorCalIntegration.java
  • src/test/java/msgfplus/TestPrecursorCalScaffolding.java
  • src/test/java/msgfplus/TestPrimitiveRegression.java
  • src/test/java/msgfplus/TestRunManifestWriter.java
  • src/test/java/msgfplus/TestScoring.java

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ypriverol and others added 29 commits April 16, 2026 23:21
…ives-gf

Replaces FlexAminoAcidGraph + GeneratingFunction with CSR-based
PrimitiveAminoAcidGraph + flat-array PrimitiveGeneratingFunction in
DBScanner.computeSpecEValue hot path. Ported without the experimental
remainingUses / eager ScoreDist release (proven 5.9% CPU regression
for 3.3% RSS gain).

Legacy FlexAminoAcidGraph and GeneratingFunction remain for other
callers. Parity verified by new TestPrimitiveRegression (ported from
feat/primitives-gf) and existing test suite.

Phase 1 of 3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites PrimitiveGeneratingFunctionGroup as a running merger. Each
per-mass-index GF is computed, merged into the aggregate via
addProbDist, and released before the next mass index is built, so peak
memory stays at one graph + one GF regardless of tolerance-window
width. Math is identical because addProbDist with scoreDiff=0 and
aaProb=1f is a linear sum; the running aggregate transparently widens
its bounds if a later GF has a wider score range.

This addresses the Phase-1 Astral regression (127% slower, +7.3% RSS)
caused by concurrent accumulation of distByNode across all mass
indices. TMT parity preserved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds two user-facing pages to close documentation gaps surfaced in the
landscape review:

- Troubleshooting.md: centroiding (MSConvert peakPicking filter), XML
  prolog / encoding errors, FASTA size limits and split-and-merge
  workaround, -Xmx sizing table, -tasks tuning for OOM, thread-cap
  behaviour, and the OpenMS TOPPAS adapter workaround.
- IsobaricLabeling.md: fixed-mod recipes for TMT-6/10/11, TMTpro-16/18,
  iTRAQ-4/8, the correct -protocol flag per label, and a full worked
  Mods.txt example. Addresses the recurring "does MS-GF+ support
  TMT-16plex?" question (issue MSGFPlus#82) which is a docs, not a code, gap.

Also wires the two pages into docs/README.md's table of contents.

The .gitignore update excludes a local-only working research note
(.claude/investigations/msgfplus_research_report.md) from the repo,
consistent with the existing "benchmark harness is local-only"
convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Astral profiling revealed that synchronized Hashtable contention in the
shared scorer tables dominated runtime: Hashtable.get alone took 23%
of CPU and the surrounding monitor machinery (TrySpin, SafeFetch,
ObjectSynchronizer) added another ~20% — 43% of CPU was sync overhead
with 4 worker threads serializing through per-table monitors.

NewRankScorer tables (fragOFFTable, insignificantFragOFFTable,
rankDistTable, ionErrDistTable, noiseErrDistTable, ionExistenceTable)
are populated once in readFromFile and read-only during search, so
plain HashMap is safe. NewScorerFactory.scorerTable is a mutable cache
with possible concurrent writes before warmup, so it uses
ConcurrentHashMap. ScoringParameterGenerator(s) use HashMap too to
match the NewRankScorer field types (build-time, not search path).

Benchmarks (baseline = dev, branch = feat/primitives-optimization):
  PXD001819 LFQ:  176.5s -> 109.4s (-38.0%),  2254 -> 2472 MB (+9.7%)
  TMT:            644.3s -> 265.6s (-58.8%),  2872 -> 3125 MB (+8.8%)
  Astral:        2155.9s -> 723.3s (-66.5%),  6775 -> 7627 MB (+12.6%)

PSM@1%FDR and SII counts match exactly on all three datasets. RSS
grows modestly because worker threads now actually run in parallel
(no more monitor serialization), so 4 concurrent graph/GF states are
alive instead of effectively one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf(msgf): CSR graph + streaming GF merge + drop Hashtable sync
Native Math.log (libmLog) was 5.46% of CPU in the post-PR#15 Astral
profile. The call sites in NewRankScorer.getErrorScore and
getNodeScore / getMissingIonScore compute log(x/y) over frequency
arrays that are immutable after scorer load; the denominator scale
factor min(ionType.getCharge(), numSegments) is also load-time
constant. Cache the resulting float values once at the end of
readFromInputStream and replace the runtime Math.log calls with
direct array indexing.

Scoring results are bit-identical: same expressions, same operand
ordering, same float rounding; the only difference is that the cast
to float happens once per cell at load instead of per call. Both
hot-path methods keep a fallback to the original computation so
legacy callers that populate the tables without going through
readFromInputStream still work.

Benchmarks on the current machine state (baseline = dev HEAD jar,
same run, same thermal state):

  PXD001819 LFQ:  122.7s -> 110.4s (-10.0%),  2410 ->  2292 MB (-4.9%)
  TMT:            295.7s -> 277.9s ( -6.0%),  2793 ->  2818 MB (+0.9%)
  Astral:        1002.9s -> 883.5s (-12.0%),  7707 ->  7351 MB (-4.6%)

PSM@1%FDR and SII counts match exactly on all three datasets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf(scorer): precompute log scores in NewRankScorer
Two small quick-wins from the April 2026 landscape review (research
report Q1 and Q3, msgfplus_research_report.md §0.1):

Q1 - "Skip spectrum since it is not centroided" now tells the user how
to fix the input:
  - profile spectra: "Re-run MSConvert with --filter \"peakPicking true 1-\"
    to centroid the spectra."
  - dense centroided: "Pass -allowDenseCentroidedPeaks 1 if the spectrum
    is already centroided."
This is the most common onboarding failure mode surfaced in issue MSGFPlus#116
and in user support threads; the previous message only said that a
spectrum was skipped, leaving users to guess why.

Q3 - New AnalysisProtocolCollectionGenTest locks in the fix from issue
MSGFPlus#72. If the mzid Enzyme/@missedCleavages attribute ever stops reflecting
the user's -maxMissedCleavages (including the 0 and -1/"no limit"
sentinel values), the test fails. Three cases covered: 2, 0, -1.

No production-code behaviour change for the Q3 path; Q1 is an error
message tweak only. 144 tests pass (was 141).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rser

quantms and other nf-core / SDRF-driven pipelines use
ThermoRawFileParser (TRFP) for raw-to-mzML conversion, not MSConvert.
The previous inline error message mentioned only MSConvert, which is
the wrong tool for a large and growing user base.

- SpecKey.java: per-spectrum hint now covers both paths in one short
  line: "ThermoRawFileParser centroids Thermo MS2 by default;
  MSConvert --filter \"peakPicking true 1-\"".
- Troubleshooting.md: restructured the "Skip spectrum since it is not
  centroided" section into a per-tool fix list (TRFP, MSConvert,
  OpenMS) so users on any pipeline can map the error to their own
  conversion step.

No production behaviour change; strings + docs only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The thread-count cap in MSGFPlus.runMSGFPlus and MSGFDB.runMSGFDB
previously hardcoded "minimum 250 spectra per thread" (ui/MSGFPlus.java,
ui/MSGFDB.java). On many-core hosts running small inputs (e.g. 20 cores,
~1,000 spectra) this capped the search at ~4 threads, surprising users.

Rather than guess a new default, expose the divisor as -minSpectraPerThread
(default 250, min 1). Power users can lower it to raise parallelism on
small inputs; everyone else gets identical behaviour to before.

Wired in both MSGFPlus and the deprecated MSGFDB entry points so behaviour
stays consistent. Addresses issue MSGFPlus#52.

Tests: TestMinSpectraPerThread covers default, override, zero-rejection,
and MSGFDB registration. mvn -B verify: 145/145 tests pass, 57 skipped.

Docs: Troubleshooting.md and MSGFPlus.md now show the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a lightweight static logger at edu.ucsd.msjava.misc.MSGFLogger with
info/debug/warn/error levels. Debug is gated on the existing -verbose 0/1
flag; warn/error go to stderr with [Warning]/[Error] prefixes. No external
dependencies (no slf4j/log4j) to keep the jar small.

Wires MSGFPlus.main() to call MSGFLogger.setVerbose(...) once after
parseParams, so the whole run inherits the CLI setting. Migrates the
top-level main() and the runMSGFPlus(ParamManager) dispatch loop:

  - Error paths: System.err.println("[Error] ...") -> MSGFLogger.error(...)
  - "Processing N spectra" (summary)          -> info
  - Per-file enumeration                      -> debug
  - Per-file "Processing"/"Ignoring" banner   -> info
  - Per-file "Writing results to"/"Output... exists" detail -> debug
  - "MS-GF+ complete" footer                  -> info
  - Decoy-ratio mismatch errors               -> MSGFLogger.error

Default behaviour (-verbose 0) is unchanged for all non-debug messages.
Running with -verbose 1 now exposes the per-file enumeration and the
per-file output/ignore details.

Intentionally narrow scope: the other ~260 System.out.println call sites
across DBScanner, ConcurrentMSGFPlus, BuildSA, etc. are unchanged. This
PR establishes the logger and wiring; case-by-case migration of those
sites can follow as they are touched.

Tests: TestMSGFLogger (7 tests) covers info-always, debug-gating, warn/
error stderr routing, format interpolation, and the isVerbose getter.
mvn -B verify: 152/152 tests pass, 57 skipped (same as before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Writes <output.mzid>.manifest.json next to each mzIdentML output
capturing the run context: MS-GF+ version and timestamp; Java version,
vendor, OS; max heap and thread count; enzyme / instrument /
activation / protocol; precursor tolerance, isotope-error range,
length and charge bounds, missed-cleavage cap; spec-file and
FASTA-file absolute paths with byte sizes; and the original CLI argv
verbatim. Downstream pipelines (quantms, Galaxy-P, custom
reanalysis scripts) can then verify or reproduce a search without
re-parsing logs.

Called from MSGFPlus.runMSGFPlus after each successful per-file
search. Failures to write are MSGFLogger.warn()-logged and never
abort the search — manifests are advisory metadata, not output.

JSON is hand-rolled (stable key order, UTF-8, 2-space indent) so no
new dependency is pulled into the shaded jar.

Tests: TestRunManifestWriter covers required identity fields, echoed
SearchParams values, argv preservation, null-argv tolerance, and
end-to-end sidecar write/read.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e context (Q8)

The Stax XML parser raises terse "ParseError in XML prolog" exceptions
when an mzML file has a byte-order mark or malformed encoding
declaration — a recurring onboarding failure reported in
thomasp85/MSGFplus#8. Users see the message but not the fix.

Wraps both parse entry points (buildIndex, preloadAllSpectra) in
catches that:
  - preserve the original XMLStreamException as the cause
  - prepend the full mzML file path and the parse phase
  - when the underlying message looks like a BOM / prolog / encoding
    failure, point the user at ThermoRawFileParser / MSConvert and
    docs/Troubleshooting.md with a concrete shell command to detect
    a BOM (`head -c 3 file.mzML | xxd`)

No behaviour change for well-formed mzML: annotate() is only reached
on an exception, and the wrapped message still includes the original
parser error verbatim so tooling that matches on Stax substrings
keeps working.

Tests: TestStaxMzMLParserErrorContext (2 cases) feeds BOM-prefixed
and garbage-prolog mzML to the constructor and verifies the wrapped
message is emitted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Emits a Percolator .pin file directly from the in-memory result list,
bypassing the separate msgf2pin converter. This closes one of the
long-standing MS²Rescore / Percolator integration gaps for bigbio/msgfplus
users.

Column layout (tab-separated):
  SpecId Label ScanNr ExpMass CalcMass
  RawScore DeNovoScore lnSpecEValue lnEValue IsotopeError
  PepLen dM absdM
  Charge{min..max}          (one-hot over params' charge range)
  NumMatchedMainIons ExplainedIonCurrentRatio NTermIonCurrentRatio
  CTermIonCurrentRatio MS2IonCurrent IsolationWindowEfficiency
  MeanErrorTop7 StdevErrorTop7 MeanRelErrorTop7 StdevRelErrorTop7
  Peptide Proteins

Label is +1 when at least one target protein matches, -1 when every
match is a decoy (Percolator uses those as the null distribution).
Peptide is emitted in Percolator's pre.PEPTIDE.post flanking format
with inline mod masses (+/-mass.mmm). Proteins are tab-separated so
Percolator's "read all remaining columns" rule works as-is.

When -addFeatures 1 is off the feature columns are zero-filled rather
than dropped — the column count stays stable across runs so any
downstream config that references a column index keeps working.

Wiring: new "pin" entry (index 3) on the existing -outputFormat
EnumParameter, a SearchParams.writePin() getter, and a call site in
MSGFPlus.runMSGFPlus next to the existing writeTsv / writeMzid blocks.
PIN output file path is outputFile.getPath().replaceAll("\\.mzid$", ".pin").

formatPeptideWithMods is duplicated from DirectTSVWriter for now;
extracting a shared PeptideFormatter is a clean follow-up.

Tests: TestDirectPinWriter (4 cases) covers flag acceptance, the
writePin getter, index stability vs the existing entries, and the
header shape via reflection on writeHeader.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore(reliability): actionable centroiding error + missedCleavages test
feat: -minSpectraPerThread flag, MSGFLogger, and run-manifest sidecar (Q9/Q11/Q4)
CI surfaced an assertion failure in
TestStaxMzMLParserErrorContext.garbledPrologAlwaysProducesAnnotatedMessage:
getCause() was null on the wrapped exception.

Root cause: the XMLStreamException(msg, location, nested) constructor
stores the original exception as an internal "nested exception" but
does NOT invoke Throwable.initCause(), so getCause() returns null. This
is a JDK idiosyncrasy of javax.xml.stream.XMLStreamException, not of the
generic Throwable chain.

Fix: drop the 3-arg constructor and call initCause(e) explicitly on the
wrapped exception so standard Java exception-chaining utilities
(printStackTrace, causal frames, test assertions) see the original
parser error as expected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore(mzml): annotate StaxMzMLParser BOM/prolog errors (Q8)
feat(mzid): DirectPinWriter + -outputFormat 3 (pin) (Q7)
DirectPinWriter now emits two additional PSM-level features in the
.pin output consumed by Percolator / MS²Rescore / Mokapot:

- lnDeltaSpecEValue: log(rank1 SpecEValue / rank2 SpecEValue) for
  rank-1 PSMs, 0 otherwise. Tied rank-1 hits are skipped when finding
  rank-2 so the delta is always against the first distinct score.
- matchedIonRatio: NumMatchedMainIons / PepLen, peptide-length-
  normalized ion-match density.

Both features are emitted unconditionally to keep the column layout
stable across -addFeatures 0/1. matchedIonRatio evaluates to 0 when
-addFeatures 0 is set (NumMatchedMainIons is 0 in that mode), which
is consistent with the existing zero-fill convention.

Covered by 11 new unit tests including rank-tie handling, NaN/
non-positive guards, malformed input, and header column ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…header renames

Aligns DirectPinWriter's .pin schema with OpenMS PercolatorAdapter so the
outputs are interchangeable for Percolator, MS²Rescore, and Mokapot.

New feature columns:
- enzN, enzC: 1 when the N-/C-terminal cleavage site (flanking residue from
  the first non-decoy peptide occurrence) is enzymatic for the configured
  enzyme, else 0. Protein-boundary '-' always counts as enzymatic.
- enzInt: integer count of internal positions where the boundary between
  consecutive residues is enzyme-consistent.
- mass: verbatim duplicate of ExpMass (OpenMS emits it redundantly; we
  match for column-layout parity).

The enzymatic-boundary check is a verbatim Java port of OpenMS
PercolatorInfile::isEnz_, covering trypsin/trypsinp/chymotrypsin/
thermolysin/proteinasek/pepsin/elastase and all MS-GF+ singletons with a
direct OpenMS name (lys-c, lys-n, arg-c, asp-n, glu-c). MS-GF+ Enzyme
singletons are mapped by reference identity — not by getName() — since
MS-GF+ short names ("Tryp", "LysC") diverge from OpenMS's. Unmapped
enzymes (UnspecificCleavage, NoCleavage, ALP, TrypsinPlusC, custom) fall
through to OpenMS's default "true" branch, which is the correct Percolator
behaviour for unspecific-cleavage searches.

Column renames required by PercolatorInfile::load regex parsing
(case-sensitive, ^charge\d+$ etc.):
  PepLen       → peplen
  Charge2..K   → charge2..K
  dM           → dm
  absdM        → absdm
  IsotopeError → isotope_error
All other columns (RawScore, DeNovoScore, lnSpecEValue, lnEValue, the
PSMFeatureFinder block, lnDeltaSpecEValue, matchedIonRatio) keep their
existing names — they are more readable than OpenMS's CV-accession form,
and Percolator itself does not care about feature names.

Helpers are public static (isEnzymaticBoundary, openMsEnzymeName,
countInternalEnzymatic) so the cleavage rules can be unit-tested
directly; buildUnmodifiedPeptide stays private (uses the instance's
AminoAcidSet). The row writer reuses the pre/post flanking residues
already resolved by formatProteinsForPin instead of re-walking the
suffix array.

Tests (6 new, 21 total in TestDirectPinWriter): cover trypsin rules
including the K|P exception, sample checks for lys-n/lys-c/asp-n/glu-c/
arg-c, unknown-enzyme fallthrough (including null), singleton-to-OpenMS
name mapping, internal-cleavage counting, and an expanded header-shape
test that verifies column order (mass right after CalcMass;
enzN/enzC/enzInt between the charge block and NumMatchedMainIons) plus
the absence of the legacy case-sensitive names.

Full suite: 183 tests, 0 failures, 57 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…colator

MS-GF+'s scorer can produce literal 'NaN' (or Inf) strings for PSMFeatureFinder
stats like MeanErrorTop7 and StdevErrorTop7 when a PSM has too few matched
ions to compute variance. Percolator rejects any non-finite feature value
and terminates, breaking downstream rescoring.

Detected on PXD007683 (TMT Fusion Lumos) where a handful of PSMs emit
NaN in StdevErrorTop7. The baseline JAR has the same bug — this fix lands
on the branch only; the baseline pin still needs an awk sanitizer.

Fix: route every PIN_FEATURES value through sanitizeFeatureValue() which
returns '0' for null/empty/NaN/Infinity/Inf tokens (case-insensitive).
Matches the existing zero-fill convention for missing features.

Covered by a new 10-assertion unit test hitting every rejection path plus
the happy-path numeric pass-through cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sorCal

Introduces the CLI and per-file state plumbing for Achievement B,
two-pass precursor mass calibration (P2-cal). This commit only adds
inert scaffolding; no search behavior changes.

- Register -precursorCal StringParameter (auto|on|off, default auto).
- Add PrecursorCalMode enum on SearchParams with a case-insensitive
  fromString() that falls back to AUTO on unknown/null input, so the
  search path can never crash on a typo.
- Add precursorMassShiftPpm field + accessors to DBSearchIOFiles. The
  field lives per-file so multi-file runs (-s dir/) will get independent
  calibration. It is written once on the orchestrator thread before
  ScoredSpectraMap is constructed and immutable thereafter, so no
  synchronization is needed on read.
- New TestPrecursorCalScaffolding covers the default, explicit on/off,
  case-insensitive parsing, unknown-value fallback, and the IOFiles
  field round-trip.

No call site reads the new state yet; that wiring lands with the
MassCalibrator commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llection

Introduces edu.ucsd.msjava.msdbsearch.MassCalibrator — the standalone
Achievement B pre-pass. Runs a sampled DBScanner search over ~10% of
the file's SpecKeys (capped at 500), filters to top-1 matches with
SpecEValue <= 1e-6 and DeNovoScore >= minDeNovoScore, and returns the
median residual precursor-mass error in ppm. Below 200 confident PSMs
the calibrator returns 0.0 so the main search falls through unchanged.

Design notes:

- Calibrator is instantiated once per file on the orchestrator thread
  with the specKeyList already built. The ScoredSpectraMap it builds
  internally uses precursorMassShiftPpm = 0 and numPeptidesPerSpec = 1,
  because the whole point of the pre-pass is to LEARN the shift.
- Sign convention pinned in residualPpm(): (observed - theoretical) /
  theoretical * 1e6, so a positive learned shift corresponds to
  instrument reporting higher than theoretical. Downstream correction
  applies mass * (1 - shiftPpm * 1e-6).
- median() sorts a defensive copy so caller's list is untouched.
  Empty list returns 0.0 (documented contract — used by the "no data"
  fallback path).
- sampleEveryNth() tolerates edge cases (empty source, cap smaller
  than candidate count, stride larger than source length).
- Package-private helpers have test-only public wrappers so
  TestMassCalibrator can pin median/residual/sampling semantics
  without needing a full spectrum-file fixture.

The class is not yet wired into MSGFPlus.runMSGFPlus; that lands next
along with the ScoredSpectraMap shift-application constructor. Because
of that, the -precursorCal flag is still a no-op in this commit.

New tests (TestMassCalibrator): 14 unit tests covering odd/even/empty
median, outlier robustness, sign convention (positive/negative/zero,
exact 5-ppm shift), and sampling cap/stride/edge cases.

204 tests pass (up from 190), zero failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dSpectraMap

Threads the learned precursor-mass shift through ScoredSpectraMap and
calls MassCalibrator from MSGFPlus.runMSGFPlus() between SpecKey list
construction and task fan-out.

ScoredSpectraMap changes:

- New constructor arg precursorMassShiftPpm (double). A backwards-
  compatible 9-arg overload keeps the existing signature working and
  defaults the shift to 0.0, so call sites that do not opt in are
  unchanged.
- Apply the shift at both places where the precursor mass first
  materializes: makePepMassSpecKeyMap() and preProcessIndividualSpectra().
- Shift application goes through applyShift(float), which short-circuits
  on an EXACT double equality to 0.0. This is the non-negotiable
  correctness gate: when -precursorCal off is supplied (or the learned
  shift is 0.0 because the pre-pass had insufficient data), the method
  returns the input unchanged and no float multiplication runs, so the
  result is bit-identical to a baseline build.

MSGFPlus.runMSGFPlus() changes:

- After SpecKeys are built and SpecDataType is constructed, look up the
  current file's DBSearchIOFiles and, unless mode == OFF, run the
  MassCalibrator pre-pass.
- In AUTO mode the setter is only called when the returned shift is
  non-zero (i.e. >=200 confident PSMs were collected). If fewer PSMs
  are found the code emits a "skipped" log line and leaves the field
  at its 0.0 default — same fast path as OFF.
- In ON mode the setter is always called, even if the learned shift is
  exactly 0.0 (which, on the fast path, is still a no-op but documents
  user intent).
- The task-loop ScoredSpectraMap builders now pass the learned shift.

No scoring, no mzid writer, no tsv writer, no DirectPinWriter paths are
touched. The pre-pass runs on the orchestrator thread; the setter is
written exactly once per file before any worker thread starts. No
synchronization is needed on the field.

All 204 existing tests still pass. A dedicated regression test for the
-precursorCal off bit-identical gate lands with the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ode bit-identity

The pre-pass in MassCalibrator.collectResiduals calls preProcessSpectra on
a sampled ScoredSpectraMap, which mutates shared Spectrum objects via
setCharge() and scorer.getScoredSpectrum(). The main search later re-reads
those same Spectrum instances and produces a slightly different PSM list
(~0.1% drift observed on test.mgf: 10727 vs 10742 PSMs).

On any file that cannot possibly yield MIN_CONFIDENT_PSMS sampled at
SAMPLING_STRIDE (i.e. specKeyList.size() < 200 * 10 = 2000), the pre-pass
would run and always return 0.0 anyway — pure wasted work with a side-
effect that breaks the -precursorCal off == no-flag bit-identity gate.

Guard: early-exit with 0.0 when the spec count is below the threshold.
Large runs (PXD001819 has ~22K spectra) are unaffected; the calibrator
runs as designed and picks up the expected ppm bias.

Validated on PXD001819 (Percolator, 1% FDR):
  baseline (origin/dev @ 1d481aa)              : 14903 targets
  branch + -precursorCal off (A only)          : 15001 targets (+98)
  branch + -precursorCal auto (A + B combined) : 15157 targets (+254)

The deeper fix — stop the pre-pass from mutating shared Spectrum state —
is a follow-up. Tracked in the TODO at the top of learnPrecursorShiftPpm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end validation of Achievement B's non-negotiable correctness
invariant: -precursorCal off must produce a PSM list identical to a
no-flag baseline (which defaults to AUTO).

Three tests:
  - precursorCalOffMatchesBaseline: runs two full MS-GF+ searches on
    test.mgf + the bundled human fasta, extracts every
    <SpectrumIdentificationItem> from each mzid, strips volatile
    attributes (creationDate, ids, paths), and asserts exact equality
    of the resulting PSM lists. Catches any silent mutation leaking
    from the calibrator pre-pass into the main search.
  - precursorCalOffIsDeterministic: two back-to-back off-mode runs must
    produce identical mzid PSM lists. Pins the no-op path against
    accidental non-determinism (e.g. HashSet iteration order drift).
  - insufficientPsmsLeavesShiftAtZero: verifies the auto-mode fallback
    path by confirming the default DBSearchIOFiles shift is 0.0.

On small fixtures like test.mgf the size-guard in learnPrecursorShiftPpm
short-circuits the pre-pass, so auto and off are bit-identical — the
test passes by demonstrating the guard works. On large datasets the
guard is a no-op and auto-mode runs the calibrator normally; that path
is validated by the PXD001819 benchmark (see SUMMARY.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e pre-pass

The earlier size-guard (MIN_CONFIDENT_PSMS × SAMPLING_STRIDE = 2000) used
specKeyList.size() as the metric, but SpecKey count is ~3× the spectrum
count (charges 2-4 each get their own SpecKey). test.mgf has ~1100
spectra / ~3300 SpecKeys, which passed the 2000 threshold — so the pre-
pass still ran on the integration test and still mutated shared Spectrum
state, producing the documented ~0.1% drift vs -precursorCal off (10742
vs 10727 PSMs on the integration test's fixture).

Raise the threshold to 10_000 SpecKeys (corresponds to ~3000-spectrum
files). Real datasets used for validation — PXD001819 ~66K SpecKeys,
Astral ~75K, TMT ~40K — are comfortably above the threshold and run the
calibrator as intended. Small test fixtures stay bit-identical to the
off-mode path.

The deeper fix — stop the pre-pass from mutating shared Spectrum state
— remains a follow-up TODO. Until that lands, the guard keeps the
non-negotiable -precursorCal off ≡ baseline correctness gate intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf(msgfplus): Percolator pin features (Achievement A) + precursor mass calibration (Achievement B)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants