Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
191ba71
fix: enable test suite, fix missing dependencies, update compiler to …
ypriverol Apr 13, 2026
d7a8b61
fix(#157): preserve PSM scores when DeNovoScore is below threshold
ypriverol Apr 13, 2026
4ba9e0e
feat(#159): add -msLevel parameter for MS level filtering
ypriverol Apr 13, 2026
65c2592
Merge pull request #5 from bigbio/feature/159-ms-level-filtering
ypriverol Apr 13, 2026
5f0ced1
Merge pull request #4 from bigbio/fix/157-mzid-missing-psm-scores
ypriverol Apr 13, 2026
4f74816
Merge pull request #3 from bigbio/fix/test-infrastructure
ypriverol Apr 13, 2026
c400668
refactor: remove dead code — 150 unused classes across 10 packages
ypriverol Apr 13, 2026
79e1779
docs: modernize README with full parameter reference
ypriverol Apr 13, 2026
3909fe8
Merge pull request #8 from bigbio/feature/ci-readme-cleanup
ypriverol Apr 14, 2026
e955869
Merge pull request #7 from bigbio/refactor/dead-code-removal
ypriverol Apr 14, 2026
ae86a1a
feat: add direct TSV output (-outputFormat tsv|mzid|both)
ypriverol Apr 14, 2026
a0a36bb
feat: include Percolator features in direct TSV output
ypriverol Apr 14, 2026
09d62e2
perf: replace jmzml JAXB parser with StAX-based mzML reader
ypriverol Apr 13, 2026
31cb92c
refactor: remove jmzml dependency, add referenceableParamGroupRef sup…
ypriverol Apr 13, 2026
dbe981d
chore: remove unused jmzReader dependency
ypriverol Apr 13, 2026
17b202d
refactor: remove mzXML support and jrap/stax library
ypriverol Apr 13, 2026
55daec4
Update src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java
ypriverol Apr 14, 2026
f2a9075
fix: address Copilot review, clean up pom.xml
ypriverol Apr 14, 2026
5332df4
feat: add direct TSV output (-outputFormat tsv|mzid|both)
ypriverol Apr 14, 2026
65088eb
feat: include Percolator features in direct TSV output
ypriverol Apr 14, 2026
3eb2966
fix: update TestParsers after dead-code removal rebase
ypriverol Apr 14, 2026
1f00a2b
Merge pull request #9 from bigbio/feature/native-tsv-output
ypriverol Apr 15, 2026
d4c1f9c
Merge pull request #6 from bigbio/feature/stax-mzml-reader
ypriverol Apr 15, 2026
87e7e75
chore: split infra/packaging updates into reviewable PR (#11)
ypriverol Apr 16, 2026
d7ebe6a
feat: add dataset-scoped PXD001819 benchmark CI scaffold
ypriverol Apr 16, 2026
b3f2e98
chore: align benchmark naming and mzXML messaging
ypriverol Apr 16, 2026
ea0de94
docs: drop PXD001819 plan file; point READMEs at CI docs
ypriverol Apr 16, 2026
cf6275d
fix(benchmark): address Copilot review on PXD001819 CI scaffold
timosachsenberg Apr 16, 2026
ed14765
fix(benchmark): harden PXD001819 scaffold per review feedback
claude Apr 16, 2026
3c47109
Merge pull request #13 from bigbio/claude/review-msgfplus-pr-12-YfoTI
ypriverol Apr 16, 2026
032d088
Merge pull request #12 from bigbio/benchmark
ypriverol Apr 16, 2026
b1a1498
feat(msgf): port primitive CSR graph + flat-array GF from feat/primit…
ypriverol Apr 16, 2026
9d07047
perf(msgf): stream mass-index GF merging to drop peak memory
ypriverol Apr 17, 2026
757506c
docs: add troubleshooting guide and isobaric-labeling recipes
ypriverol Apr 17, 2026
8442f2c
perf(scorer): drop java.util.Hashtable for HashMap/ConcurrentHashMap
ypriverol Apr 17, 2026
e597230
Merge pull request #15 from bigbio/feat/primitives-optimization
ypriverol Apr 17, 2026
cafbf73
perf(scorer): cache per-scorer log tables to avoid runtime Math.log
ypriverol Apr 17, 2026
9cdae16
Merge pull request #16 from bigbio/perf/precompute-log-scores
ypriverol Apr 17, 2026
b433305
chore(reliability): actionable centroiding error + missedCleavages test
ypriverol Apr 17, 2026
7e3a69d
chore(reliability): broaden centroiding hint to cover ThermoRawFilePa…
ypriverol Apr 17, 2026
928a4f4
feat(msgf): add -minSpectraPerThread flag to override thread-cap divisor
ypriverol Apr 17, 2026
97431ce
feat(misc): add MSGFLogger, wire verbose flag into MSGFPlus entry point
ypriverol Apr 17, 2026
095100e
feat(misc): add RunManifestWriter sidecar for run reproducibility
ypriverol Apr 17, 2026
db73197
chore(mzml): annotate StaxMzMLParser BOM/prolog errors with actionabl…
ypriverol Apr 17, 2026
779eec2
feat(mzid): DirectPinWriter + -outputFormat 3 (pin) (Q7)
ypriverol Apr 17, 2026
fb76029
Merge pull request #17 from bigbio/chore/reliability-quick-wins
ypriverol Apr 17, 2026
1bd9ff2
Merge pull request #18 from bigbio/feat/logger-and-run-manifest
ypriverol Apr 17, 2026
03e6e5a
fix(mzml): explicit initCause on annotated XMLStreamException
ypriverol Apr 18, 2026
911b070
Merge pull request #19 from bigbio/feat/stax-error-context
ypriverol Apr 18, 2026
1d481aa
Merge pull request #20 from bigbio/feat/direct-pin-writer
ypriverol Apr 18, 2026
a098715
feat(pin): add lnDeltaSpecEValue and matchedIonRatio Percolator features
ypriverol Apr 18, 2026
b86d65d
feat(pin): OpenMS PercolatorAdapter parity — enzN/enzC/enzInt/mass + …
ypriverol Apr 18, 2026
b020f0e
fix(pin): sanitize NaN/Infinity feature values before emitting to Per…
ypriverol Apr 19, 2026
f3cb45e
feat(mass-cal): SearchParams + DBSearchIOFiles scaffolding for precur…
ypriverol Apr 18, 2026
0bc40cc
feat(mass-cal): MassCalibrator class with DBScanner-based residual co…
ypriverol Apr 18, 2026
98d0e8f
feat(mass-cal): wire MassCalibrator into MSGFPlus.runMSGFPlus + Score…
ypriverol Apr 18, 2026
6279905
fix(mass-cal): size-guard in learnPrecursorShiftPpm to preserve off-m…
ypriverol Apr 18, 2026
e7f7f1c
test(mass-cal): integration test for -precursorCal off bit-identity gate
ypriverol Apr 18, 2026
0a29486
fix(mass-cal): raise size-guard threshold so test.mgf doesn't trip th…
ypriverol Apr 19, 2026
78af285
Merge pull request #22 from bigbio/feat/msgfplus-perf-ab
ypriverol Apr 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
35 changes: 35 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# MS-GF+ Project — Claude Context

## Overview

MS-GF+ is a mass spectrometry database search tool for peptide identification.
The codebase is Java (Maven build). Benchmark harness scripts are local-only (not committed).

## Branch

Primary integration branch: `dev`

## Key Directories

- `src/main/java/edu/ucsd/msjava/` — core Java source
- `msdbsearch/` — database search engine (DBScanner, ScoredSpectraMap)
- `msutil/` — spectrum utilities (SpecKey, SpecKeyResult, SpectrumMetadata)
- `mzid/` — mzIdentML output generation
- `mzml/` — mzML parser (StaxMzMLParser — streaming rewrite)
- `parser/` — input file parsers (MgfSpectrumParser, etc.)
- `ui/` — CLI entry points (MSGFPlus, MSGFDB)
- Local benchmark harness/scripts are intentionally out-of-tree and not committed as `benchmark/`
- `src/test/` — unit tests

## Build

```bash
mvn -B verify
```

## Conventions

- Java 17+
- Maven for dependency management
- mzIdentML (`.mzid`) as primary output format
- TSV export via DirectTSVWriter
91 changes: 91 additions & 0 deletions .claude/investigations/001-mgf-scan-number-extraction-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Investigation 001: MGF Scan Number Extraction Failure

**Status:** OPEN
**Date observed:** 2026-04-15
**Severity:** Medium — functional (spectra still searched, but scan numbers missing in output)
**Branch:** `feature/streaming-mzml-parser`

## What Was Observed

When running the baseline benchmark against MGF files, MS-GF+ emits repeated warnings:

```
Unable to extract the scan number from the title: id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05.mzML;controllerType=0
Expected format is DatasetName.ScanStart.ScanEnd.Charge
```

The warning appeared for every spectrum in the MGF file (`test.mgf`), suggesting
the entire file uses a TITLE format that the parser cannot handle.

## Where It Was Observed

- **Run:** Baseline benchmark (`baseline/MSGFPlus.jar`, v2026.03.25)
- **Input:** `test.mgf` — MGF file with TITLE lines in PRIDE/ProteomeXchange format
- **Database:** `human-uniprot-contaminants.revCat.fasta`

## Relevant Code

### `MgfSpectrumParser.extractScanRangeFromTitle()` — the parser

```
src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java:278-316
```

The method splits the title on `.` and expects:
- `token.length > 3` → `DatasetName.ScanStart.ScanEnd.Charge`
- `token.length == 3 && title.endsWith(".")` → `DatasetName.ScanStart.ScanEnd.`

The PRIDE-format title `id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05.mzML;controllerType=0`
splits to `["id=PXD002047;TCGA-AA-A02O-01A-23_W_VU_20130205_A0218_10A_R_FR05", "mzML;controllerType=0"]`
(only 2 tokens), so it falls through to the `else` branch and emits the warning.

### `MgfSpectrumParser.warnScanNotFoundInTitle()` — the warning

```
src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java:384-392
```

Capped at `MAX_SCAN_MISSING_WARNINGS` prints, then silently counts the rest.
Final total printed by `SpecKey.java:139`.

## Hypotheses

1. **Title format mismatch (most likely):** The MGF file uses a PRIDE/ProteomeXchange
`TITLE` format that encodes the source file reference and controller info with
semicolons, not the `Dataset.Start.End.Charge` convention. The parser has no
fallback for alternative formats.

2. **Possible alternative scan encodings in TITLE:** Some MGF generators embed scan
numbers as `scan=NNNN` or `scans=NNNN` within the TITLE string. The parser
doesn't attempt to extract these.

3. **`index=` fallback:** When scan extraction fails, the spectrum gets assigned
`index=N` as its ID (from `specIndexMap`). This means the mzIdentML output
will reference spectra by index rather than native scan number, which may
affect downstream tools that expect scan-based references.

## Impact

- **Search results:** Not affected — MS-GF+ still searches the spectra correctly.
- **Output traceability:** Degraded — mzIdentML references use index instead of
native scan IDs, making it harder to trace PSMs back to the raw data.
- **Benchmark:** May cause metric discrepancies if downstream scripts parse scan
numbers from the mzIdentML output.

## Potential Fixes

1. Add regex-based fallback in `extractScanRangeFromTitle()` to detect patterns like:
- `scan=(\d+)` or `scans=(\d+)`
- `spectrum=(\d+)`
- `index=(\d+)`
2. Support PRIDE USI-style TITLE parsing: extract scan from
`controllerType=0 controllerNumber=1 scan=NNNN` if present.
3. Allow users to specify a scan number extraction regex via CLI parameter.

## Next Steps

- [ ] Examine the actual MGF file to see the full TITLE line format
- [ ] Check if `scan=` or similar key-value pairs are embedded in the TITLE
- [ ] Review how other tools (MaxQuant, Comet, X!Tandem) handle non-standard TITLE formats
- [ ] Decide on backward-compatible fix approach
- [ ] Add unit test covering PRIDE-format TITLE strings
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Investigation 002: E-value Leaks Target/Decoy Information to Percolator

**Status:** OPEN
**Date reported:** 2026-04-15
**Severity:** HIGH — affects FDR estimation for all downstream rescoring tools
**Source:** EuBIC-MS Symposium 04/2026, Copenhagen — Henry Emanuel Weber, Ruhr-Universität Bochum (Jun.-Prof. Julien Urchueguía group)
**Slide screenshot:** `assets/Screenshot_2026-04-15_at_13.23.09-*.png`

## What Was Observed

When MS-GF+ results are passed to rescoring tools (Percolator, MS2Rescore, Oktoberfest),
the target and decoy score distributions become **completely separated** — 100% separation.
This does NOT happen with Comet results on the same data.

The presenter found that **removing the E-value (MS:1002053) from the MS-GF+ features
fixed the problem**, confirming that the E-value is the source of information leakage.

Key observations from the slide:
- **Comet + TDA/Percolator/MS2Rescore/Oktoberfest:** Normal overlapping distributions
- **MS-GF+ + TDA:** Normal overlapping distributions (E-value not used as feature)
- **MS-GF+ + Percolator/MS2Rescore/Oktoberfest:** Perfect separation (E-value used as feature)

## The Mechanism

### How MS-GF+ computes the E-value

The E-value is computed as:

```
E-value = SpecEValue × numDistinctPeptides
```

See `MZIdentMLGen.java:347`:
```java
double eValue = specEValue * numPeptides;
```

Where:
- **SpecEValue** (`MS:1002052`) = spectral-level E-value from the generating function
(computed per spectrum, independent of target/decoy status)
- **numDistinctPeptides** = count of distinct peptide sequences of the matched length
in the **entire** concatenated target-decoy database
(from `CompactSuffixArray.getNumDistinctPeptides()`)

### Why it leaks

The `numDistinctPeptides` multiplier is derived from the suffix array built over the
**concatenated target+decoy database** (`-tda 1` mode). The count includes both target
and decoy peptides.

However, the critical issue is that `numDistinctPeptides` is looked up by **peptide
length** (see `CompactSuffixArray.java:138-140`):

```java
public int getNumDistinctPeptides(int length) {
return numDistinctPeptides[length];
}
```

This is the same multiplier for targets and decoys of the same length, so the
E-value itself doesn't directly encode target/decoy status. The leakage likely
comes from a subtler mechanism:

**Hypothesis 1: Database-size asymmetry**
When `-tda 1` is used, MS-GF+ generates reversed decoys internally. The number
of distinct peptides at each length may differ slightly between the target and
decoy halves. Since the E-value uses the combined count, it implicitly encodes
information about the database composition. Percolator, being a machine learning
model, can learn to exploit even tiny systematic differences.

**Hypothesis 2: Score distribution coupling**
The generating function that produces SpecEValue is computed using score
distributions that are calibrated on the full database. If the score distribution
shape differs systematically between target and decoy hits (which it does — true
matches exist only for targets), the SpecEValue already carries some target/decoy
signal that gets amplified by the numPeptides multiplier.

**Hypothesis 3: Q-value propagation**
The Q-value (`MS:1002054`) is explicitly computed from TDA and directly encodes
target/decoy ranking. If Q-value is also passed to Percolator alongside E-value,
the combined features create a perfect classifier. However, the presenter
specifically identified E-value (not Q-value) as the problematic score.

**Hypothesis 4: E-value scale differences**
SpecEValue is a per-spectrum probability; E-value is SpecEValue × database_size.
Since all peptides (target and decoy) use the same `numDistinctPeptides[length]`,
the E-value is a monotonic transform of SpecEValue for peptides of the same
length. But across different lengths, the scaling differs, and Percolator could
learn length-dependent patterns that correlate with target/decoy status.

## Relevant Code

### E-value computation

- `MZIdentMLGen.java:345-347` — `eValue = specEValue * numPeptides`
- `DirectTSVWriter.java:138-141` — same computation for TSV output
- `DBScanner.java:853-854` — same computation for MSGFDB output
- `MSGFDBResultGenerator.java:92-104` — `getPValue()` and `getEValue()` static methods

### numDistinctPeptides lookup

- `CompactSuffixArray.java:138-140` — `getNumDistinctPeptides(length)`
- `CompactSuffixArray.java:196-228` — counting logic over suffix array
- `SuffixArrayForMSGFDB.java:43-46` — wrapper

### Scores written to mzIdentML

- `MS:1002049` — RawScore (integer, safe)
- `MS:1002050` — DeNovoScore (integer, safe)
- `MS:1002052` — SpecEValue (spectral E-value, probably safe)
- `MS:1002053` — EValue (database E-value, **LEAKS**)
- `MS:1002054` — QValue (from TDA, **inherently encodes T/D**)

## Impact

- **All rescoring workflows are affected:** Any tool that uses MS-GF+ E-value as a
feature (Percolator, MS2Rescore, Oktoberfest) will produce artificially inflated
identification rates
- **Published results may be affected:** Studies using MS-GF+ → Percolator pipelines
may report overly optimistic PSM counts
- **FDR estimates are unreliable:** The 100% target/decoy separation means FDR
cannot be meaningfully estimated

## Which Scores Leak?

### Safe scores (no target/decoy information)
| CV Accession | Name | Why safe |
|-------------|-------------|----------|
| MS:1002049 | RawScore | Integer score from generating function, per-spectrum |
| MS:1002050 | DeNovoScore | Integer de novo score, per-spectrum |
| MS:1002052 | SpecEValue | Spectral E-value from generating function, per-spectrum. No TDA dependency. |

### Unsafe scores (leak target/decoy information)
| CV Accession | Name | Why it leaks |
|-------------|------------|--------------|
| MS:1002053 | EValue | `SpecEValue × numDistinctPeptides` — database-size multiplier may introduce asymmetry. Confirmed as the leak source by the presenter. |
| MS:1002054 | QValue | **Directly computed from TDA** via `TargetDecoyAnalysis.getPSMQValue()` — it IS the target/decoy separation. Passing this to Percolator is giving it the answer key. |
| MS:1002055 | PepQValue | Same as QValue but at peptide level. Also directly from TDA. |

### Q-value is categorically worse than E-value

The Q-value (`MS:1002054`) is computed by `TargetDecoyAnalysis.getFDRMap()` which:
1. Separates PSMs into target and decoy lists (by protein prefix, e.g. `XXX_`)
2. Sorts both by score
3. Walks down the ranked list computing `FDR = decoyCount / targetCount`
4. Converts FDRs to Q-values (monotonic minimum)

This is a **direct encoding** of target vs decoy status. If Percolator receives
QValue as a feature, it can trivially reconstruct whether a PSM is target or
decoy — far more directly than the E-value leakage. The EValue leakage is subtle
(the presenter had to investigate to find it); QValue leakage is by definition.

In practice, most rescoring tools (Percolator, MS2Rescore) likely skip QValue
because it's already an FDR estimate. But EValue looks like a "normal" search
engine score and gets picked up as a feature — which is why the EValue leak
is the one that actually manifests.

## Proposed Fix: Only Output SpecEValue (Omit EValue and QValue)

Since the downstream workflow is always `MS-GF+ → Percolator/rescoring tool → FDR`,
MS-GF+ does not need to output its own EValue or QValue. The rescoring tool will
compute its own FDR.

### What to change
1. **Stop writing EValue (MS:1002053) to mzIdentML** — or make it optional via CLI flag
2. **Stop writing QValue (MS:1002054) and PepQValue (MS:1002055)** — same treatment
3. **Keep SpecEValue (MS:1002052)** — this is the per-spectrum score, safe for rescoring
4. **Keep RawScore (MS:1002049) and DeNovoScore (MS:1002050)** — integer scores, safe

### Where to change
- `MZIdentMLGen.java:346-421` — mzIdentML output (remove/gate EValue, QValue, PepQValue CV params)
- `DirectTSVWriter.java:140-208` — TSV output (same)
- `DBScanner.java:853` — MSGFDB TSV output (same)
- `MSGFPlus.java` / `MSGFDB.java` — add CLI flag (e.g. `--no-evalue` or `--percolator-safe`)

### Impact on MSGFPlusAdapter (OpenMS)
The OpenMS `MSGFPlusAdapter` extracts scores from MS-GF+ mzIdentML output. If we
stop outputting EValue by default, the adapter needs to be updated to use SpecEValue
instead. This should be coordinated with the OpenMS team, or we add a CLI flag
so existing workflows keep working.

### Backward compatibility
- Add a flag like `-rescoring 1` that omits EValue/QValue from output
- Default behavior unchanged (EValue/QValue still written) for backward compat
- Document clearly that `-rescoring 1` should be used when piping to Percolator

## Next Steps

- [ ] Reproduce the issue: run MS-GF+ on a benchmark dataset, feed to Percolator,
plot target/decoy distributions with and without E-value
- [ ] Contact Henry Emanuel Weber / Julien Urchueguía group for their test dataset
and exact Percolator configuration
- [ ] Analyze whether SpecEValue alone also leaks (likely not, but should verify)
- [ ] Check if the leakage magnitude depends on database size (small DB = more leakage?)
- [ ] Review what scores MS2Rescore/Percolator extract from MS-GF+ mzIdentML by default
- [ ] Implement `-rescoring 1` CLI flag to omit EValue/QValue/PepQValue from output
- [ ] Coordinate with OpenMS team on MSGFPlusAdapter changes (use SpecEValue instead of EValue)
- [ ] Add skill documentation (DONE — see `.claude/skills/score-output-safety.md`)

## References

- Slide: "Target and decoy distributions" — EuBIC-MS Symposium 04/2026, Copenhagen
- Presenter: Henry Emanuel Weber, Medical Bioinformatics, Ruhr-Universität Bochum
- Group: Jun.-Prof. Julien Urchueguía
- Talk: "Leveling the playing field" (slide 9)
10 changes: 10 additions & 0 deletions .claude/investigations/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Investigations

Tracked issues, bugs, and behaviors that need further analysis.

Each investigation should document:
1. **What was observed** — error messages, unexpected behavior
2. **Where it was observed** — which run, dataset, configuration
3. **Relevant code** — source files and line numbers
4. **Hypotheses** — potential root causes
5. **Status** — open / in-progress / resolved
7 changes: 7 additions & 0 deletions .claude/plans/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Plans

Implementation plans and design documents for MS-GF+ features and improvements.

Each plan should be a separate markdown file named descriptively, e.g.:
- `streaming-mzml-parser.md`
- `mgf-scan-number-parsing.md`
8 changes: 8 additions & 0 deletions .claude/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Skills

Project-specific skills for AI agents working on MS-GF+.

Skills encode domain knowledge and repeatable workflows, e.g.:
- Running benchmarks
- Building and testing the JAR
- Interpreting mzIdentML output
Loading
Loading