SciX Agent

Agent-navigable knowledge layer on the full NASA ADS/SciX corpus. Transforms 32.4M scientific papers and 299M citation edges into infrastructure that AI agents can navigate programmatically via a 15-tool MCP server.

What This Does

Instead of returning ranked lists, the system exposes the structural topology of science -- citation graphs, research communities, and multi-model embeddings -- through Model Context Protocol tools.

Key capabilities:

Hybrid search: INDUS dense embeddings (domain-specific scientific similarity, full 32.4M corpus) + BM25 lexical (via ts_rank_cd, optional ParadeDB pg_search) fused via Reciprocal Rank Fusion
Graph intelligence: PageRank, HITS, Leiden community detection at multiple resolutions on the full citation graph
Entity extraction: LLM-based extraction of methods, datasets, instruments from abstracts and full text
Session state: Working sets that let agents accumulate and reason over papers across a research session
Full-text search: Body text ingested for 46% of papers (14.9M of 32.4M) with GIN-indexed tsvector

Architecture

Single PostgreSQL 16 instance with pgvector 0.8.2. No separate search engine or vector database.

Dimension	Value
Papers	32.4M (1800--2026)
With abstracts	23.3M (72%)
With full text	14.9M ingested (46%)
Citation edges	299.3M
Edge resolution	99.6%
Embeddings	INDUS 768d (32.4M, full corpus)

Discipline coverage (papers may belong to multiple):

Collection	Papers
Physics	17.1M
Earth science	13.1M
General	5.8M
Astronomy	3.0M

Content coverage:

Field	Coverage
Title	>99%
Affiliations	96%
DOI	87%
Abstract	72%
Cited papers	54%
Keywords	49%
References	40%
Full text	46% ingested (14.9M of 32.4M)

Project Structure

src/scix/                     -- Python package (76 modules)
  mcp_server.py               -- MCP stdio server (15 tools)
  mcp_server_http.py          -- MCP HTTP/streamable transport
  search.py                   -- Hybrid search with RRF fusion
  db.py                       -- DB helpers (connection pool, IndexManager, IngestLog)
  ingest.py                   -- JSONL -> PostgreSQL via COPY
  field_mapping.py            -- ADS/SciX JSONL -> SQL field mapping + transforms
  embed.py                    -- INDUS embedding pipeline (SPECTER2/nomic as 20K pilots only)
  graph_metrics.py            -- PageRank, HITS, community detection
  extract.py                  -- LLM entity extraction
  session.py                  -- Agent working set management
  sources/                    -- OpenAlex, ar5iv, S2 source modules
  jit/                        -- JIT entity resolution (cache, router, NER)
  eval/                       -- Retrieval evaluation framework
scripts/                      -- CLI tools (114 scripts)
  ingest.py                   -- Corpus ingestion CLI
  embed_fast.py               -- GPU embedding pipeline
  harvest_*.py                -- External data harvesters
  eval_*.py                   -- Evaluation scripts
  link_*.py                   -- Entity linking scripts
  setup_db.sh                 -- Idempotent database creation
migrations/                   -- Numbered SQL migrations (001..054)
schema.sql                    -- Consolidated PostgreSQL schema (generated)
tests/                        -- pytest suite (153 test files)
docs/                         -- Documentation
  ADR/                        -- Architecture decision records
  DEPLOYMENT.md               -- Operator deploy guide (k8s + compose)
  ADS_INTEGRATION.md          -- BeeHive placement + corpus-sync options
  UPGRADING.md                -- Migration + semver tag contract
  runbooks/                   -- Operational runbooks
  figures/                    -- Data visualizations
deploy/                       -- Container + k8s deployment manifests
  Dockerfile                  -- Hardened multi-stage build
  k8s/                        -- ADS AWS cluster manifests
  compose/                    -- Backoffice docker-compose variant

Setup

Prerequisites

Python 3.11+
PostgreSQL 16 with pgvector 0.8.2
(Optional) NVIDIA GPU for embedding pipeline

Installation

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# For embedding pipeline
pip install -e ".[embed]"

# For MCP server
pip install -e ".[mcp]"

# For graph analytics
pip install -e ".[graph]"

Environment

cp .env.example .env
# Edit .env with your credentials:
#   ADS_API_KEY=<your NASA ADS/SciX API key>
#   SCIX_DSN=dbname=scix

Database

# Create database and apply schema
scripts/setup_db.sh

# Ingest ADS/SciX metadata
python scripts/ingest.py ads_metadata_by_year_picard/

Running the MCP Server

python -m scix.mcp_server

Testing

# Set test DSN to avoid hitting production database
export SCIX_TEST_DSN="dbname=scix_test"

# Run all tests
pytest

# Run only unit tests (no database required)
pytest -m "not integration"

The default DSN (dbname=scix) points at the production database with 32M papers. Integration tests that write data require SCIX_TEST_DSN to be set.

MCP Tools

Search & Discovery

Tool	Description
`search`	Search the corpus by natural-language query. Modes: hybrid (INDUS + BM25 via RRF), semantic, or keyword.
`concept_search`	Retrieve papers tagged with a Unified Astronomy Thesaurus (UAT) concept, with optional expansion to descendant concepts.
`facet_counts`	Distribution of paper counts grouped by year, doctype, arxiv_class, database, bibgroup, or property.

Paper Access

Tool	Description
`get_paper`	Full metadata for a paper by bibcode: title, abstract, authors, affiliations, keywords, citation counts. Optionally includes linked entities.
`read_paper`	Read inside a paper's full-text body. `section='methods'` / `role='method'` reads from `papers_fulltext.sections` JSONB (14.4M papers / 96.2%) when available, falling back to flat-body heuristic parsing. Also supports in-paper keyword search via `search_query`.

Citation Graph

Tool	Description
`citation_traverse`	Walk the citation graph: forward (papers that cite it), backward (references), or both. Each returned edge is annotated with `intent` (method / background / result_comparison) when covered by `citation_contexts`. Also supports `mode='chain'` for shortest-path search between two papers.
`citation_similarity`	Find structurally related papers via co-citation (cited together) or bibliographic coupling (shared references).
`cited_by_intent`	Filter incoming citations to a target by their structural-citation intent (method / background / result_comparison) — answers "which papers used X as their method?" or "which papers compared their results to X?" via the `citation_contexts.intent` classification.
`temporal_evolution`	Citations-per-year for a paper, or publications-per-year for a search query.

Entities

Tool Description

entity Cross-discipline entity lookup across 13 vocabularies (gene, software, mission, organism, target, observable, chemical, location, taxon, plus methods/datasets/instruments/materials — ~9M entities). action='resolve' maps free text to canonical entities; action='papers' returns papers tagged with an entity via document_entities (57.7M paper-entity links). Each result row includes precision_estimate + precision_band from the dbl.3 NER quality profile.

entity_context Full profile of a known entity by ID: canonical name, type, external identifiers, aliases, related entities, paper count.

Provenance & Replication

Tool	Description
`claim_blame`	Trace a natural-language claim to its earliest non-retracted origin paper by walking reverse references over citation contexts. Returns the origin bibcode plus a Hop chain with intent and intent_weight.
`find_replications`	Enumerate forward citations to a target paper, each annotated with citation intent and an inferred replication relation (replicates / refutes / qualifies / partial / unknown).

Graph Analytics & Session

Tool	Description
`graph_context`	Citation-graph analytics for a paper: PageRank, HITS hub/authority, community membership at coarse/medium/fine resolution. Optionally returns sibling papers in the same community.
`find_gaps`	Surface papers in unexplored communities that cite papers you already inspected. Reads from implicit session state across `get_paper` calls. Optional `query` parameter auto-seeds the working set via `concept_search` for single-call workflows.

Performance

MCP tool latency at 32M papers:

Operation	Latency
Semantic search (HNSW)	p95 < 10ms
Hybrid search (3-signal RRF)	p95 < 200ms

License

Apache License 2.0. See LICENSE for details.

NASA ADS/SciX metadata is subject to the ADS terms of service.

Name		Name	Last commit message	Last commit date
Latest commit History 584 Commits
.claude		.claude
config		config
data		data
deploy		deploy
docs		docs
eval		eval
logs/canary_rerank		logs/canary_rerank
migrations		migrations
prompts		prompts
results		results
scripts		scripts
src/scix		src/scix
tests		tests
web/viz		web/viz
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciX Agent

What This Does

Architecture

Project Structure

Setup

Prerequisites

Installation

Environment

Database

Running the MCP Server

Testing

MCP Tools

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SciX Agent

What This Does

Architecture

Project Structure

Setup

Prerequisites

Installation

Environment

Database

Running the MCP Server

Testing

MCP Tools

Performance

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages