NeuroGraph Agent

Agentic Biomedical Research Intelligence for Alzheimer's Disease

NeuroGraph Agent is a research intelligence prototype that combines public biomedical data, literature mining, clinical trial analysis, target-disease evidence integration, research gap scoring, knowledge graph construction, and evidence cards to explore hypothesis-generating Alzheimer's disease research signals.

⚠️ Disclaimer: This is a research intelligence prototype. All outputs are computationally derived and hypothesis-generating only. This is not clinical decision support and must not inform clinical practice, drug development, or patient care. All findings require expert biomedical validation.

Live Demo

Launch the Streamlit Dashboard → )

Explore the dashboard to:

Browse and filter 499 scored Alzheimer's disease targets
Identify potential research gap candidates such as TREM2 and CD33
Compare literature and clinical trial text signals side by side
Inspect individual target evidence cards with full provenance
Explore the knowledge graph and theme landscape
Download processed data, reports, and evidence briefs

1. Overview

NeuroGraph Agent is a multi-step research intelligence pipeline built to address a practical challenge in biomedical research: integrating fragmented evidence sources into a structured, navigable layer of insight.

The system ingests data from three public biomedical APIs — Open Targets for disease-target association evidence, Semantic Scholar for literature metadata, and ClinicalTrials.gov for trial registrations — and applies a sequence of text mining, statistical scoring, and graph construction methods to produce structured research signals across 499 Alzheimer's disease-associated targets.

What makes NeuroGraph Agent different from a standard data analysis project is its end-to-end integration:

Raw API data is cleaned, validated, and enriched through a reproducible 10-step pipeline
NMF topic models extract thematic structure from both literature and trial corpora independently
All 499 targets are matched against paper and trial text using whole-word regex tokenisation
Research gap scores are computed as a weighted composite of five evidence signals
A NetworkX knowledge graph connects disease, targets, papers, trials, and research themes
Each target receives an individual evidence card — a structured, cautious research brief generated purely from pipeline outputs, with no LLM involvement
A 7-page Streamlit dashboard makes all outputs interactive, filterable, and downloadable

Alzheimer's disease was selected as the first domain because it is one of the most studied and underfunded neurodegenerative conditions globally, with a large and growing target landscape in Open Targets, and a rich body of publicly available literature and trial data.

2. Problem Statement

Biomedical researchers face a large and fragmented evidence landscape:

Clinical trial registrations are stored separately from the literature that motivates them
Target-disease association evidence from curated databases is disconnected from trial activity
Literature volume is too large to review comprehensively without computational assistance
Trial registrations often describe interventions by drug name or mechanism, not by molecular target symbol — making it difficult to assess target-level trial coverage directly
Identifying where evidence is accumulating but translation may be limited requires cross-source signal integration

NeuroGraph Agent is a prototype that structures this evidence into a navigable intelligence layer, combining source integration, text mining, scoring, graph construction, and interactive visualisation into a single end-to-end system.

It does not replace expert review. It is designed to surface structured hypothesis-generating signals that can direct expert attention more efficiently.

3. Key Features

Feature	Description
Multi-source API integration	Open Targets, Semantic Scholar, ClinicalTrials.gov
Literature mining	TF-IDF vectorisation + NMF topic extraction (6 topics)
Clinical trial analysis	TF-IDF + NMF topic extraction (6 topics) on trial protocol text
Full target mention matching	Whole-word regex matching across all 499 targets and all 478 papers + 1,000 trials
Research gap scoring	Weighted composite of 5 normalised signals per target
Knowledge graph	NetworkX DiGraph: 812 nodes, 829 edges, 6 node types, 6 edge types
Evidence cards	499 individual target research briefs, generated programmatically
Streamlit dashboard	7-page interactive dashboard with filters, charts, and download buttons
Downloadable outputs	CSVs, GraphML, GEXF, JSON, markdown reports
Ethical guardrails	Cautious language, no clinical claims, explicit disclaimers throughout

4. Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        NeuroGraph Agent Pipeline                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Step 1 │ Project Setup & Configuration                             │
│                                                                     │
│  Step 2 │ API Data Collection                                       │
│         │  ├── Open Targets  →  499 Alzheimer's disease targets     │
│         │  ├── Semantic Scholar  →  478 literature records          │
│         │  └── ClinicalTrials.gov  →  1,000 trial registrations     │
│                                                                     │
│  Step 3 │ Data Cleaning & Validation                                │
│         │  ├── Literature: text fields, deduplication, year filter  │
│         │  ├── Trials: status, phase, protocol text assembly        │
│         │  └── Targets: association scores, symbol validation       │
│                                                                     │
│  Step 4 │ Exploratory Data Analysis                                 │
│         │  └── 42 output figures, EDA summary report                │
│                                                                     │
│  Step 5 │ Text Mining: TF-IDF + NMF Topic Extraction               │
│         │  ├── Literature corpus  →  6 NMF topics                   │
│         │  └── Trial corpus  →  6 NMF topics                        │
│                                                                     │
│  Step 5.5│ Full Target Mention Matching                             │
│         │  └── Whole-word regex across all 499 × 1,478 records      │
│                                                                     │
│  Step 6 │ Research Gap Scoring                                      │
│         │  └── 5-signal composite score, tier classification        │
│                                                                     │
│  Step 7 │ Knowledge Graph Construction                              │
│         │  └── NetworkX DiGraph: 812 nodes, 829 edges               │
│                                                                     │
│  Step 8 │ Streamlit Dashboard                                       │
│         │  └── 7-page interactive dashboard                         │
│                                                                     │
│  Step 9 │ Evidence Card Generation                                  │
│         │  └── 499 individual target research briefs                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

5. Data Sources

Source	What Was Collected	Volume
Open Targets	Disease-target association scores, gene symbols, approved names, biotypes, datasource scores for Alzheimer's disease (MONDO:0004975)	499 targets
Semantic Scholar	Paper title, abstract, year, citation count, venue, fields of study	478 papers
ClinicalTrials.gov	Trial title, brief summary, study status, phase, conditions, interventions, eligibility criteria, outcomes	1,000 registrations

All data was retrieved via public APIs. No proprietary data was used. Raw files are stored locally and not redistributed.

6. Methodology

6.1 Text Mining: TF-IDF + NMF

TF-IDF (Term Frequency–Inverse Document Frequency) was applied independently to the literature corpus (title + abstract) and to the trial corpus (protocol text). Non-negative Matrix Factorisation (NMF) with k = 6 topics was applied to each TF-IDF matrix, yielding statistically derived thematic groupings. Topic labels were assigned manually based on top NMF terms.

Literature topics identified: Amyloid/Tau Pathology, Neuroinflammation & Immune, Care, Lifestyle & QoL, Cognitive & Neuropsychological, Drug/Therapeutic Interventions, and mixed overlapping themes.

Trial topics identified: Care, Lifestyle & QoL, Amyloid/Tau Pathology, Neuroinflammation & Immune (multiple clusters), and Other/Mixed.

6.2 Target Mention Matching

Each of the 499 gene symbols was matched against every paper abstract and every trial protocol text using whole-word regex tokenisation. This produced per-target counts of:

literature_mentions — symbol occurrences across papers
trial_mentions — symbol occurrences across trial text
literature_documents_mentioned — distinct papers containing the symbol
trial_documents_mentioned — distinct trials containing the symbol

Symbols with short or ambiguous names were flagged with a symbol_ambiguity_flag (low / moderate) to indicate elevated false-positive risk.

6.3 Research Gap Scoring

Each target received a composite gap_score (0–1) derived from five normalised signals:

Signal Component	Description
Evidence strength	Normalised Open Targets association score
Literature signal	Normalised literature mention count
Trial signal	Normalised trial mention count (inverted — fewer = higher gap)
Underexploration	Literature-to-evidence coverage ratio
Lit-to-trial gap	Normalised difference between literature and trial mention counts

A conservative_gap_score additionally penalises targets with a moderate or high ambiguity flag. Targets are assigned to one of four interpretation tiers:

Tier	Count	Description
Potential research gap candidate	19	Literature signal present; comparatively low trial-text signal
Established or trial-covered target	8	Strong trial-text presence; useful as comparator
No direct text match in this corpus	471	Symbol not detected in retrieved corpus
Lower-priority signal	1	Weak evidence and mention signals

6.4 Knowledge Graph Construction

A directed NetworkX graph was constructed with:

Node types (812 total):

Disease (1): Alzheimer's disease
Target (499): scored gene targets
Paper (150): literature records with direct symbol matches
Trial (150): trial registrations with direct symbol matches
LiteratureTheme (6): NMF topic nodes for the literature corpus
TrialTheme (6): NMF topic nodes for the trial corpus

Edge types (829 total):

disease_target_evidence — disease → target via Open Targets association
target_mentioned_in_paper — target → paper via whole-word symbol match
target_mentioned_in_trial — target → trial via whole-word symbol match
paper_has_literature_theme — paper → NMF literature topic
trial_has_theme — trial → NMF trial topic
theme_overlap — LiteratureTheme ↔ TrialTheme via shared top terms

Graph exported in GraphML, GEXF, and JSON formats for downstream analysis in tools such as Gephi or D3.js.

6.5 Evidence Card Generation

Each of the 499 targets received a structured, programmatically generated evidence card containing:

Target identity (symbol, name, biotype, IDs)
Evidence strength classification (from Open Targets association score)
Literature signal classification (from mention counts)
Trial text signal classification (from mention counts)
Dominant research themes (from matched paper/trial topic assignments)
A rule-based narrative explaining why the target may be of interest
Matched paper and trial records with metadata
Limitations and suggested next validation steps

Evidence cards are generated entirely from pipeline outputs, with no LLM involvement. All language is cautious and explicitly hypothesis-generating.

7. Results Summary

Metric	Value
Targets scored	499
Potential research gap candidates	19
Established / trial-covered comparators	8
Papers analysed	478
Clinical trials analysed	1,000
NMF topics extracted	6 literature + 6 trial
Knowledge graph nodes	812
Knowledge graph edges	829
Evidence cards generated	499
Output figures	42
Markdown reports	8

Top Gap Candidates (by gap score)

Rank	Symbol	Approved Name	Gap Score	Literature Mentions
1	TREM2	Triggering receptor expressed on myeloid cells 2	0.748	56
6	CD33	CD33 molecule	—	6
8	CDK5	Cyclin dependent kinase 5	—	2
9	ADAM10	ADAM metallopeptidase domain 10	—	1
10	CSF1R	Colony stimulating factor 1 receptor	—	1

All gap scores are hypothesis-generating signals, not therapeutic recommendations. Literature and trial mention counts reflect this corpus sample only.

Established Comparator Targets

For context, well-studied targets such as APOE (70 literature mentions, 21 trial mentions) and APP (25 literature mentions, 76 trial mentions) serve as calibration references — demonstrating what a target with high evidence coverage looks like in this pipeline.

8. Dashboard Pages

Page	Content
Overview	Key metrics, pipeline summary, top gap candidates, score distribution charts
Gap Candidates	Filterable table of all 499 scored targets; filter by interpretation, symbol ambiguity, gap score, and mention counts
Target Explorer	4-tab view per target: Evidence Card, Matching Papers, Matching Trials, Research Brief
Evidence Cards	Browse all 499 evidence cards; filter by interpretation and evidence strength; signal profile chart
Themes	NMF topic sizes, top terms, paper/trial browsers for each literature and trial theme
Knowledge Graph	Graph metrics, static subgraph figures, interactive target edge explorer split by edge type
Reports & Downloads	Download all processed CSVs, graph files (GraphML / GEXF / JSON), and markdown reports

9. Output Files

outputs/
├── evidence_cards/          # 499 individual markdown evidence card files
│   ├── TREM2_evidence_card.md
│   ├── APOE_evidence_card.md
│   └── ... (499 total)
├── figures/                 # 42 PNG output figures
│   ├── eda_01_*.png         # EDA figures
│   ├── gap_01_*.png         # Gap scoring figures
│   └── kg_01_*.png          # Knowledge graph figures
├── graphs/                  # Knowledge graph exports
│   ├── neurograph_knowledge_graph.graphml
│   ├── neurograph_knowledge_graph.gexf
│   └── neurograph_knowledge_graph.json
└── reports/                 # Markdown analysis reports
    ├── eda_summary.md
    ├── text_mining_summary.md
    ├── gap_scoring_summary.md
    ├── knowledge_graph_summary.md
    ├── evidence_card_generation_summary.md
    └── top_gap_candidate_briefs.md    # 100KB research brief pack

Data/processed/
├── research_gap_scores.csv             # Full 499-target scored table
├── strong_text_supported_gap_candidates.csv
├── target_text_mentions_full.csv       # Raw mention counts per target
├── target_evidence_cards.csv           # 499-row evidence card table
├── knowledge_graph_nodes.csv           # 812 nodes with attributes
├── knowledge_graph_edges.csv           # 829 edges with attributes
├── literature_topics.csv               # 6 NMF literature topics
├── clinical_trial_topics.csv           # 6 NMF trial topics
├── literature_topic_assignments.csv    # Per-paper topic scores
└── clinical_trial_topic_assignments.csv

10. Technology Stack

Component	Technology
Language	Python 3.10+
Data processing	pandas, numpy
Text mining	scikit-learn (TF-IDF, NMF)
Knowledge graph	NetworkX
Visualisation	matplotlib
Dashboard	Streamlit
API access	requests
Graph export	NetworkX (GraphML, GEXF, JSON)
Environment	Standard Python virtual environment

No proprietary libraries, paid APIs, or LLM calls are used anywhere in the pipeline. All processing is local, deterministic, and fully reproducible from the original API data.

11. CI & Testing

This project includes a lightweight GitHub Actions CI workflow at .github/workflows/ci.yml that runs automatically on every push and pull request to main.

What the workflow validates:

Step	Check
File existence	9 required files must be committed (processed CSVs, reports, `app.py`)
Import check	`pandas`, `numpy`, `scikit-learn`, `networkx`, `streamlit` must import without error
Syntax check	`app.py`, all `src/processing/.py`, all `src/dashboard/.py` compiled with `py_compile`
Project structure tests	68 pytest tests covering CSV shape, required columns, gap score range, node types, edge types, and forbidden phrase guardrails in evidence cards

This workflow does not call external biomedical APIs or rerun the full data pipeline. It is designed as a lightweight reproducibility and project-health check.

How to Run Tests Locally

pip install -r requirements.txt
pytest tests/test_project_structure.py -v
python -m py_compile app.py

Running pytest -v will execute all 68 project structure tests. Tests complete in under 2 seconds and require no internet connection or API keys.

12. Ethical Disclaimer

NeuroGraph Agent is built with explicit ethical guardrails:

Not clinical decision support. This system must not be used to inform clinical practice, patient care, or drug development decisions.
No therapeutic claims. A high gap score does not mean a target is therapeutically valid, safe, or ready for development.
No absence claims. Low trial-text signal does not prove that no relevant trials exist — trial registrations may describe interventions by drug name rather than molecular target symbol.
Incomplete corpus. The literature corpus covers 478 papers; the trial corpus covers 1,000 registrations. These are representative samples, not exhaustive reviews.
Symbol ambiguity. Gene symbol matching is lexical and may produce false positives for short or ambiguous symbols.
Expert review required. All outputs require review by qualified biomedical researchers or clinicians before any finding can inform research decisions.
Hypothesis-generating only. Every output in this system is a structured signal for expert-directed investigation, not a conclusion.

These guardrails are enforced in the evidence card text, dashboard language, and all generated reports.

13. About

NeuroGraph Agent was designed and built as an end-to-end research intelligence portfolio project demonstrating:

Multi-source public API integration and data engineering
NLP and unsupervised learning applied to biomedical text
Systematic scoring and ranking of targets across multiple evidence dimensions
Knowledge graph construction and traversal
Programmatic evidence card generation
Interactive Streamlit dashboard development
Responsible AI: cautious language, explicit limitations, and ethical guardrails throughout

The project was built in Python, without proprietary tools, paid APIs, or LLM assistance in evidence generation.

=======

NeuroGraph-Agent

An Agentic Research System for Neurodegenerative Disease Discovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroGraph Agent

Agentic Biomedical Research Intelligence for Alzheimer's Disease

Live Demo

Table of Contents

1. Overview

2. Problem Statement

3. Key Features

4. Pipeline Architecture

5. Data Sources

6. Methodology

6.1 Text Mining: TF-IDF + NMF

6.2 Target Mention Matching

6.3 Research Gap Scoring

6.4 Knowledge Graph Construction

6.5 Evidence Card Generation

7. Results Summary

Top Gap Candidates (by gap score)

Established Comparator Targets

8. Dashboard Pages

9. Output Files

10. Technology Stack

11. CI & Testing

How to Run Tests Locally

12. Ethical Disclaimer

13. About

NeuroGraph-Agent

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
Data/processed		Data/processed
assets/screenshots		assets/screenshots
outputs		outputs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
clean_all_data.py		clean_all_data.py
collect_all_data.py		collect_all_data.py
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NeuroGraph Agent

Agentic Biomedical Research Intelligence for Alzheimer's Disease

Live Demo

Table of Contents

1. Overview

2. Problem Statement

3. Key Features

4. Pipeline Architecture

5. Data Sources

6. Methodology

6.1 Text Mining: TF-IDF + NMF

6.2 Target Mention Matching

6.3 Research Gap Scoring

6.4 Knowledge Graph Construction

6.5 Evidence Card Generation

7. Results Summary

Top Gap Candidates (by gap score)

Established Comparator Targets

8. Dashboard Pages

9. Output Files

10. Technology Stack

11. CI & Testing

How to Run Tests Locally

12. Ethical Disclaimer

13. About

NeuroGraph-Agent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages