Skip to content

Normalize malformed hydrate ingredient names to canonical ' x N H2O'#60

Merged
realmarcin merged 1 commit into
mainfrom
followup/normalize-hydrate-names
Jun 14, 2026
Merged

Normalize malformed hydrate ingredient names to canonical ' x N H2O'#60
realmarcin merged 1 commit into
mainfrom
followup/normalize-hydrate-names

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

What

Fixes parse-artifact hydrate preferred_terms (84 occurrences across 24 files) by mapping each to the corpus's dominant clean sibling ( x N H2O) sharing its separator-stripped key.

Targeted (genuinely malformed)

  • Concatenated, no separator: CaCl22H2OCaCl2 x 2 H2O, CoSO47H2O, MnCl24H2O, Na2MoO42H2O, NiCl26H2O, FeCl36H2O.
  • x without spaces: MgSO4x7H2OMgSO4 x 7 H2O, CaCl2x2H2O, CaCl2 x2H2O.
  • Rare dot-operator unicode (U+22C5 / U+2219 ): FeSO4⋅7H2O/MgSO4∙7H2Ox form.

Deliberately left untouched

  • Established · (U+00B7) and (U+30FB) middot notation — a widely-used style (hundreds of occurrences), not a parse artifact. Unifying all separators would be a ~7k-row cosmetic churn with no grounding benefit.
  • Malformed forms with no matching-hydrate clean sibling (CuCl2⋅4H2O, Na2Sx6H2O, Ni2SO4 ⋅ 6H2O) — left as-is rather than mapped to a different hydrate.

Grounding (term.id) is unaffected; mapping is corpus-driven (only maps to an existing canonical of the same hydrate). linkml-validate: 24/24 clean.

🤖 Generated with Claude Code

Fix parse-artifact hydrate preferred_terms (84 occurrences, 24 files) by mapping
each to the corpus's dominant clean sibling sharing its separator-stripped key:
  - concatenated, no separator:  CaCl22H2O -> CaCl2 x 2 H2O, CoSO47H2O, MnCl24H2O
  - 'x' without spaces:          MgSO4x7H2O -> MgSO4 x 7 H2O, CaCl2x2H2O
  - rare dot-operator unicode:   FeSO4⋅7H2O / MgSO4∙7H2O -> ' x ' form
Established '·' (U+00B7) / '・' (U+30FB) middot notation is left untouched (it is
a widely-used style, not a parse artifact). Forms with no matching-hydrate clean
sibling (e.g. CuCl2⋅4H2O, Na2Sx6H2O) are left as-is rather than mapped to a
different hydrate. Grounding (term.id) is unaffected; linkml-validate clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 34ed455 into main Jun 14, 2026
3 checks passed
@realmarcin realmarcin deleted the followup/normalize-hydrate-names branch June 14, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant