Skip to content

coding-chemist/Curie

Repository files navigation

Curie

A citation-backed RAG pipeline for NMR structure elucidation — retrieves first, reasons second, and shows its work.

Curie — how a single NMR file becomes a candidate structure with per-peak provenance

Live demo: available on request — DM me on LinkedIn and I'll send the access link.


Why Curie exists

I have a Masters in Chemistry. The first time I sat through an NMR structure-elucidation session — really sat through it — I struggled. Reading peaks, mapping shifts to fragments, working backwards to a structure. It was hard. Not the kind of hard you fake your way through.

I transitioned into AI after that. But the Chemistry never quite left.

I've been a mentor at RJSF since 2021. In 2025, at one of the Chemistry Research Drive sessions, a professor was walking the group through NMR structure elucidation — peak by peak, candidate by candidate, slowly building toward a structure with reasoning every chemist in the room could follow. And it hit me:

This can be automated with an LLM. I know it's hard. But it's possible.

That was the spark. I don't claim Curie is perfect. I claim it's scalable — and that each iteration makes it better. The pull keeps growing, because I happen to sit in a rare intersection: domain knowledge in both spaces.

The name is half-joke, half-tribute. Marie Curie identified unknown substances by their fingerprints — radioactive emissions then, NMR peaks now. And curious is the only honest word for why I keep going.


What Curie does

You give it a 1H + 13C NMR spectrum and a molecular formula hint. It returns:

  • A ranked list of candidate structures with confidence scores
  • Per-peak interpretation — which atom in the structure each peak comes from, grounded in retrieved analogues with citations
  • Interactive RDKit visualisation — hover a peak, the corresponding atom lights up; hover an atom, its peaks light up
  • A "why this, not that" explanation — ruled-out candidates with reasons
  • Ambiguity warnings — when the signal alone isn't enough and 2D NMR is needed

The whole point is that Curie shows its work. No black-box "your molecule is X." Every conclusion is traceable.


How it works

Retrieve first. Reason second.

That's the entire discipline. The LLM never invents a structure — every claim it makes is grounded in a candidate that Layer 1 actually retrieved, then double-checked by an RDKit substructure agent before Layer 3 closes the loop with forward prediction.

Stage What happens Why it matters
1 · FAISS retrieval Embed query peaks → top-K candidates from textbook NMR corpus LLMs hallucinate molecules. Vector retrieval pins the reasoning to known chemistry.
2 · Grounded peak interpretation Per-peak LLM reasoning → fragment mapping with provenance citations Every conclusion traces back to a retrieved analogue. No black-box verdicts.
Agent · RDKit substructure check Validates each fragment claim against the molecule's bonds and topology Catches LLM outputs that sound chemically reasonable but aren't.
3 · Forward NMR prediction Simulate spectra for top-3 → compare to input → match/mismatch verdict Closes the loop. Final confidence is empirical, not just retrieval similarity.

Benchmarked at 60% top-1 · 100% top-5 on the textbook validation set. The 60% is intentional, not a ceiling — Curie is calibrated to not over-rank close analogues as exact matches, because in pharma research the gap between a 95% structural match and a 100% one can mean a different molecule entirely.


Scope (v0.1) — what Curie does, and doesn't yet

Curie does step 2 of NMR elucidation — substructure inference + analog retrieval. It does not do combinatorial structure assembly.

Case What you get
A — Known compound Exact match returned with full provenance
B — Close analog Top-ranked analog + per-peak grounding + confidence
C — Novel scaffold Substructure profile + "needs 2D NMR (HSQC/HMBC)" guidance

I'd rather have a tool that's honest about case C than one that hallucinates a structure to look impressive.


Architecture

Curie — system architecture across client, backend, and LLM provider layers

Three layers, two deployments:

Layer What it does Where it runs
Client React + Vite + Tailwind + RDKit-JS — interactive spectrum and structure viewer Vercel
Backend FastAPI + FAISS index + LangChain ChatService + RDKit substructure agent + SSE event bus Hugging Face Spaces
LLM Providers Google Gemini 2.0 Flash (primary) · Groq Llama (fallback via LangChain with_fallbacks) External APIs

Why a fallback chain, not a single provider

Curie's reasoning is the kind of work a researcher would re-run if it failed once. So fallback isn't "redundancy theatre" — it's the difference between a usable tool and one that 503s in front of a recruiter. LangChain's with_fallbacks lets the primary be the best model (Gemini 2.0 Flash for free-tier reasoning quality) while keeping a fast Groq backup if Gemini hits a quota wall.

Why FAISS, not pure LLM elucidation

An LLM alone, given peaks, will hallucinate a structure. With FAISS retrieval anchoring it to known analogs, the LLM is constrained to reason about what's actually plausible — not invent. This is the difference between a tool a chemist trusts and one they laugh at.


What a session looks like

1. Enter peaks (or load a preset)

Curie — NMR to Structure input page with 1H/13C peak entry and example presets

Drop a .csv, .xlsx, or .jdx file, or type peaks in directly. Four tier-1 presets are wired up (Ibuprofen, Acetophenone, Vanillin, 4-MeO-cinnamate) so anyone can see the pipeline run end-to-end without their own data.

2. Watch the 3-layer pipeline execute

Curie — live pipeline progress: Features done, Retrieval running, Interpret/Scoring/Prediction queued

The architecture from the diagram above, made visible: Features → Retrieval → Interpret → Scoring → Prediction → Complete. Streamed over Server-Sent Events so you watch the reasoning happen, not just the final answer. The little "E geometry" tag is the RDKit substructure agent surfacing a constraint it already found from the peak list.

3. Final structure with the peak list anchored

Curie — final elucidated structure with 1H and 13C peak shifts listed alongside

The chosen structure renders with RDKit, with both 1H and 13C peak shifts laid out on either side as a chemist would read them. Molecular formula, MW, and degree of unsaturation surface at the top — the metadata you'd want before trusting the result.

4. Hover Para-Aromatic Ring — the benzene atoms light up

Curie — Para-Aromatic Ring fragment highlighted in purple on the benzene ring

5. Hover Trans Vinyl (E) — the vinyl bond lights up

Curie — Trans Vinyl fragment highlighted in teal on the vinyl bond between aromatic ring and chloro-isopropyl group

These two views together are the interaction that makes Curie useful instead of just impressive. Hover a fragment label — the matching atoms in the structure glow. Reverse it: hover a peak on the side rails, the responsible atoms highlight. It's the explanation a chemist would draw on a whiteboard, made interactive and reproducible.

This is the interaction that makes Curie useful instead of just impressive. Hover Para-Aromatic Ring or Trans Vinyl (E) — the matching atoms in the structure glow, and the responsible peaks light up. Reverse it: hover a peak, the atom highlights. It's the explanation a chemist would draw on a whiteboard, made interactive.


Tech Stack

Layer Stack
Frontend React · Vite · Tailwind · RDKit-JS
Backend Python · FastAPI · FAISS · LangChain · RDKit · pdfplumber
LLMs Google Gemini 2.0 Flash (primary) · Groq Llama 3.3 70B (fallback)
Streaming Server-Sent Events (SSE) for live pipeline updates
Deployment Vercel (frontend) · Hugging Face Spaces (backend)

Project structure

curie/
├── src/                React UI — file upload + interactive spectrum/structure viewer
├── backend/
│   ├── app/
│   │   ├── core/       FAISS retrieval, event bus (SSE)
│   │   ├── services/   ChatService (LangChain with_fallbacks), Layer 2/3 logic
│   │   ├── routes/     /api/v1/elucidate, /api/v1/molecule, /api/v1/stream
│   │   └── models/     Pydantic schemas
│   ├── prompts/        Layer 2 grounded-reasoning prompt templates
│   ├── scripts/        Data ingestion, embedding builds
│   └── Dockerfile      HF Spaces deployment
└── assets/             Architecture + flow illustrations + screenshots

Run locally

Frontend

npm install
npm run dev

Backend

cd backend
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env              # add GOOGLE_API_KEY (primary) + GROQ_API_KEY (fallback)
python run.py

Known limitations (v0.1)

  • Novel scaffolds outside the FAISS training corpus can still produce confidently-wrong structures. For inputs the index hasn't seen, the pipeline currently skips the intermediate Layer 1 / Layer 2 views and jumps to a structure verdict — losing the provenance signal that makes Curie trustworthy on known compounds.
  • No mass-spec or IR cross-validation yet. Single-spectroscopy reasoning has inherent ambiguity at this scope — Curie can't break a tie the spectrum itself can't break.
  • Top-1 accuracy ≈ 60% on textbook compounds, 100% top-5. Treat Curie as a retrieval-grounded hypothesis generator, not an oracle. The "What's next" items below directly address the first two.

What's next

I'm still digging. The pull keeps growing.

  • Combinatorial structure assembly — case C (novel scaffolds) gets a real candidate set, not just guidance
  • 2D NMR (HSQC, HMBC) ingestion — collapses ambiguity that 1D alone can't resolve
  • Wider compound library — current FAISS index is textbook-scope; production targets pharma-scale
  • Reasoning trace export — researchers want the full audit log as a downloadable PDF for IP records

This isn't a finished tool. It's a tool that knows what it doesn't know yet — and that knows where to grow.


Author

Sindhuja Sivaraman · MSc Chemistry · MS Data Science → Senior Engineer, AI/ML — HTC Global Services Portfolio · GitHub

I know it's hard. But it's possible.

About

NMR structure elucidation via 3-layer RAG — FAISS retrieval → grounded LLM peak interpretation → forward NMR prediction, with RDKit substructure validation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors