Skip to content

sjarmak/scix-agent

Repository files navigation

SciX Agent

Agent-navigable knowledge layer on the full NASA ADS/SciX corpus. Transforms 32.4M scientific papers and 299M citation edges into infrastructure that AI agents can navigate programmatically via a 15-tool MCP server.

What This Does

Instead of returning ranked lists, the system exposes the structural topology of science -- citation graphs, research communities, and multi-model embeddings -- through Model Context Protocol tools.

Key capabilities:

  • Hybrid search: INDUS dense embeddings (domain-specific scientific similarity, full 32.4M corpus) + BM25 lexical (via ts_rank_cd, optional ParadeDB pg_search) fused via Reciprocal Rank Fusion
  • Graph intelligence: PageRank, HITS, Leiden community detection at multiple resolutions on the full citation graph
  • Entity extraction: LLM-based extraction of methods, datasets, instruments from abstracts and full text
  • Session state: Working sets that let agents accumulate and reason over papers across a research session
  • Full-text search: Body text ingested for 46% of papers (14.9M of 32.4M) with GIN-indexed tsvector

Architecture

Single PostgreSQL 16 instance with pgvector 0.8.2. No separate search engine or vector database.

Dimension Value
Papers 32.4M (1800--2026)
With abstracts 23.3M (72%)
With full text 14.9M ingested (46%)
Citation edges 299.3M
Edge resolution 99.6%
Embeddings INDUS 768d (32.4M, full corpus)

Discipline coverage (papers may belong to multiple):

Collection Papers
Physics 17.1M
Earth science 13.1M
General 5.8M
Astronomy 3.0M

Content coverage:

Field Coverage
Title >99%
Affiliations 96%
DOI 87%
Abstract 72%
Cited papers 54%
Keywords 49%
References 40%
Full text 46% ingested (14.9M of 32.4M)

Project Structure

src/scix/                     -- Python package (76 modules)
  mcp_server.py               -- MCP stdio server (15 tools)
  mcp_server_http.py          -- MCP HTTP/streamable transport
  search.py                   -- Hybrid search with RRF fusion
  db.py                       -- DB helpers (connection pool, IndexManager, IngestLog)
  ingest.py                   -- JSONL -> PostgreSQL via COPY
  field_mapping.py            -- ADS/SciX JSONL -> SQL field mapping + transforms
  embed.py                    -- INDUS embedding pipeline (SPECTER2/nomic as 20K pilots only)
  graph_metrics.py            -- PageRank, HITS, community detection
  extract.py                  -- LLM entity extraction
  session.py                  -- Agent working set management
  sources/                    -- OpenAlex, ar5iv, S2 source modules
  jit/                        -- JIT entity resolution (cache, router, NER)
  eval/                       -- Retrieval evaluation framework
scripts/                      -- CLI tools (114 scripts)
  ingest.py                   -- Corpus ingestion CLI
  embed_fast.py               -- GPU embedding pipeline
  harvest_*.py                -- External data harvesters
  eval_*.py                   -- Evaluation scripts
  link_*.py                   -- Entity linking scripts
  setup_db.sh                 -- Idempotent database creation
migrations/                   -- Numbered SQL migrations (001..054)
schema.sql                    -- Consolidated PostgreSQL schema (generated)
tests/                        -- pytest suite (153 test files)
docs/                         -- Documentation
  ADR/                        -- Architecture decision records
  DEPLOYMENT.md               -- Operator deploy guide (k8s + compose)
  ADS_INTEGRATION.md          -- BeeHive placement + corpus-sync options
  UPGRADING.md                -- Migration + semver tag contract
  runbooks/                   -- Operational runbooks
  figures/                    -- Data visualizations
deploy/                       -- Container + k8s deployment manifests
  Dockerfile                  -- Hardened multi-stage build
  k8s/                        -- ADS AWS cluster manifests
  compose/                    -- Backoffice docker-compose variant

Setup

Prerequisites

  • Python 3.11+
  • PostgreSQL 16 with pgvector 0.8.2
  • (Optional) NVIDIA GPU for embedding pipeline

Installation

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# For embedding pipeline
pip install -e ".[embed]"

# For MCP server
pip install -e ".[mcp]"

# For graph analytics
pip install -e ".[graph]"

Environment

cp .env.example .env
# Edit .env with your credentials:
#   ADS_API_KEY=<your NASA ADS/SciX API key>
#   SCIX_DSN=dbname=scix

Database

# Create database and apply schema
scripts/setup_db.sh

# Ingest ADS/SciX metadata
python scripts/ingest.py ads_metadata_by_year_picard/

Running the MCP Server

python -m scix.mcp_server

Testing

# Set test DSN to avoid hitting production database
export SCIX_TEST_DSN="dbname=scix_test"

# Run all tests
pytest

# Run only unit tests (no database required)
pytest -m "not integration"

The default DSN (dbname=scix) points at the production database with 32M papers. Integration tests that write data require SCIX_TEST_DSN to be set.

MCP Tools

Search & Discovery

Tool Description
search Search the corpus by natural-language query. Modes: hybrid (INDUS + BM25 via RRF), semantic, or keyword.
concept_search Retrieve papers tagged with a Unified Astronomy Thesaurus (UAT) concept, with optional expansion to descendant concepts.
facet_counts Distribution of paper counts grouped by year, doctype, arxiv_class, database, bibgroup, or property.

Paper Access

Tool Description
get_paper Full metadata for a paper by bibcode: title, abstract, authors, affiliations, keywords, citation counts. Optionally includes linked entities.
read_paper Read inside a paper's full-text body. section='methods' / role='method' reads from papers_fulltext.sections JSONB (14.4M papers / 96.2%) when available, falling back to flat-body heuristic parsing. Also supports in-paper keyword search via search_query.

Citation Graph

Tool Description
citation_traverse Walk the citation graph: forward (papers that cite it), backward (references), or both. Each returned edge is annotated with intent (method / background / result_comparison) when covered by citation_contexts. Also supports mode='chain' for shortest-path search between two papers.
citation_similarity Find structurally related papers via co-citation (cited together) or bibliographic coupling (shared references).
cited_by_intent Filter incoming citations to a target by their structural-citation intent (method / background / result_comparison) — answers "which papers used X as their method?" or "which papers compared their results to X?" via the citation_contexts.intent classification.
temporal_evolution Citations-per-year for a paper, or publications-per-year for a search query.

Entities

Tool Description
entity Cross-discipline entity lookup across 13 vocabularies (gene, software, mission, organism, target, observable, chemical, location, taxon, plus methods/datasets/instruments/materials — ~9M entities). action='resolve' maps free text to canonical entities; action='papers' returns papers tagged with an entity via document_entities (57.7M paper-entity links). Each result row includes precision_estimate + precision_band from the dbl.3 NER quality profile.
entity_context Full profile of a known entity by ID: canonical name, type, external identifiers, aliases, related entities, paper count.

Provenance & Replication

Tool Description
claim_blame Trace a natural-language claim to its earliest non-retracted origin paper by walking reverse references over citation contexts. Returns the origin bibcode plus a Hop chain with intent and intent_weight.
find_replications Enumerate forward citations to a target paper, each annotated with citation intent and an inferred replication relation (replicates / refutes / qualifies / partial / unknown).

Graph Analytics & Session

Tool Description
graph_context Citation-graph analytics for a paper: PageRank, HITS hub/authority, community membership at coarse/medium/fine resolution. Optionally returns sibling papers in the same community.
find_gaps Surface papers in unexplored communities that cite papers you already inspected. Reads from implicit session state across get_paper calls. Optional query parameter auto-seeds the working set via concept_search for single-call workflows.

Performance

MCP tool latency at 32M papers:

Operation Latency
Semantic search (HNSW) p95 < 10ms
Hybrid search (3-signal RRF) p95 < 200ms

License

Apache License 2.0. See LICENSE for details.

NASA ADS/SciX metadata is subject to the ADS terms of service.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors