ARF — Advanced Retrieval Framework

A zero-dependency retrieval pipeline toolkit. Plug in your own vector search, embedding model, LLM, ML model, and database — ARF provides the routing algorithms, feature engineering, rephrase-graph caching, and score blending.

pip install advanced-rag-framework

What ARF Does

Most RAG pipelines send every candidate to an expensive LLM for reranking. ARF eliminates this waste with a multi-stage filtering pipeline called R-Flow:

Query
  → Cache graph walk (free — returns instantly if seen before)
  → Vector search (your provider)
  → Threshold + gap filter (free — drops obvious junk)
  → MLP triage (free, <5ms — accept/reject/uncertain)
  → LLM verification ($$$ — only for the ~20% uncertain candidates)
  → Answer with summaries

Each stage filters candidates so the next stage does less work. Only the uncertain ~20% ever reach the LLM.

Quick Start

from arf import Pipeline, DocumentConfig, Triage

pipeline = Pipeline(
    doc_config=DocumentConfig(title_field="title", text_fields=["text"]),
    triage=Triage(min_score=0.65, accept_threshold=0.85, verify_threshold=0.70),
    search_fn=my_search,       # (embedding, top_k) → [(dict, float)]
    embed_fn=my_embed,         # (text) → [float]
)

results = pipeline.run("how does caching work?")

That's it. Two required functions. Everything else is optional.

Full Pipeline

from arf import Pipeline, DocumentConfig, Triage
from arf.trainer import load_reranker

pipeline = Pipeline(
    doc_config=DocumentConfig(
        title_field="title",
        text_fields=["text", "summary"],
        children_fields=["sections", "clauses"],
        hierarchy=["title", "chapter", "section"],
    ),
    triage=Triage(
        min_score=0.65,
        accept_threshold=0.85,
        verify_threshold=0.70,
        gap=0.20,
    ),

    # Required
    search_fn=my_search,           # any vector DB
    embed_fn=my_embed,             # any embedding model

    # Scoring (optional)
    predict_fn=load_reranker("model.joblib"),  # trained MLP
    llm_fn=my_llm_verify,         # any LLM

    # Cache (optional)
    cache_lookup=my_cache_get,     # any cache backend
    cache_store=my_cache_set,

    # Preprocessing (optional)
    preprocess_fn=my_clean,        # translate, normalize, etc.
    moderate_fn=my_moderate,       # content safety
    rephrase_fn=my_rephrase,       # retry with rephrased query

    # Hierarchy (optional)
    resolve_fn=my_get_parent,      # walk up document tree
    summarize_fn=my_summarize,     # generate answer
)

results = pipeline.run("what is due process?", top_k=5)
# [{"document": Document, "score": 0.94, "context": [...], "summary": "..."}, ...]

Components

ARF is 6 independent modules. Use them together or individually.

Document — DB-agnostic data model

from arf import Document, DocumentConfig

config = DocumentConfig(
    title_field="name",
    text_fields=["body", "content"],
    children_fields=["subsections"],
    hierarchy=["category", "name"],
)

doc = Document.from_dict({"name": "Guide", "body": "...", "category": "Medical"}, config)
# doc.depth = 2, doc.path = "Medical / Guide"

Works with any database. MongoDB, PostgreSQL, DynamoDB, Pinecone, FAISS — just map your fields.

Features — 15-feature extraction

from arf import FeatureExtractor

extractor = FeatureExtractor(config)
features = extractor.extract_features(query="...", document={...}, semantic_score=0.85)
vector = extractor.to_vector(features)  # [0.85, 4.2, 0, 0, ...]

Feature	Description
`semantic_score`	Raw cosine similarity from vector search
`bm25_score`	Term-frequency relevance approximation
`alias_match`	Whether query matches a document alias
`keyword_match`	Whether query matches via keyword pattern
`domain_type`	Encoded domain identifier
`document_length`	Log-scaled character count
`query_length`	Query character count
`section_depth`	Depth in document hierarchy
`embedding_cosine_similarity`	Direct embedding cosine similarity
`match_type`	0=none, 1=partial, 2=exact
`score_gap_from_top`	Gap from highest-scored document
`query_term_coverage`	Fraction of query terms in document
`title_similarity`	Jaccard similarity between query and title
`has_nested_content`	Whether document has children
`bias_adjustment`	Configurable per-document bias

Triage — threshold + gap + zone routing

from arf import Triage

triage = Triage(min_score=0.65, accept_threshold=0.85, verify_threshold=0.70, gap=0.20)
result = triage.classify(candidates)
# result.accepted, result.needs_review, result.rejected

QueryGraph — rephrase chain walk

from arf import follow_rephrase_chain

result = follow_rephrase_chain("due process clause", lookup_fn=my_db_lookup, max_hops=3)
# result.hit, result.cached_results, result.path, result.loop_detected

Walks a directed graph of query→rephrase edges with loop detection and early exit on cache hit. Storage-agnostic — you provide the lookup_fn.

ScoreParser — LLM output parsing + multiplier blending

from arf import extract_score, multiplier, adjust_score

extract_score('{"score": 7}')           # → 7
extract_score("Score: 8")               # → 8
multiplier(8)                           # → 1.39
adjust_score(0.72, "Score: 8")          # → min(0.72 * 1.39, 1.0)

Parses messy LLM output (JSON, bare numbers, "Score: N" lines) into a 0-9 score, converts to a multiplier, and blends with the retrieval score.

Trainer — MLP training

from arf.trainer import train_reranker, load_reranker

# Train
metrics = train_reranker(X, y, architecture=(64, 32, 16), save_path="model.joblib")

# Load as a predict_fn for Pipeline
predict_fn = load_reranker("model.joblib")

Requires pip install advanced-rag-framework[ml] (numpy + scikit-learn).

Ingest — document ingestion helper

from arf import ingest_documents, DocumentConfig

result = ingest_documents(
    documents,
    config=DocumentConfig(title_field="title", text_fields=["text"]),
    embed_fn=my_embed,     # your embedding function
    store_fn=my_store,     # your DB write function
)
# result.processed, result.skipped, result.errors

Validates documents, computes hierarchy metadata (depth, path), generates embeddings for parent and children, and stores via your function.

Bring Your Own Everything

Slot	What you provide	Examples
`search_fn`	Vector search	FAISS, Pinecone, Weaviate, Qdrant, MongoDB Atlas, pgvector
`embed_fn`	Embeddings	OpenAI, Voyage AI, Cohere, sentence-transformers, Ollama
`predict_fn`	ML model	scikit-learn, XGBoost, PyTorch, any callable
`llm_fn`	LLM verification	OpenAI, Anthropic, Ollama, Llama.cpp, any API
`cache_lookup/store`	Cache	Redis, MongoDB, SQLite, in-memory dict
`resolve_fn`	Parent lookup	Any database query
`summarize_fn`	Answer generation	Any LLM
`store_fn` (ingest)	Document storage	Any database write

Installation

# Core (zero dependencies)
pip install advanced-rag-framework

# With MLP training support (numpy + scikit-learn)
pip install advanced-rag-framework[ml]

Sample Project

See sample-project/ for a complete working example using:

FAISS for vector search
Voyage AI for embeddings
OpenAI for LLM verification
A cooking recipe dataset (non-legal, 46 recipes from 15 cuisines)

python sample-project/ingest.py                          # Embed recipes into FAISS
python sample-project/train.py                           # Train MLP reranker
python sample-project/query.py "spicy noodle soup"       # Full pipeline query

R-Flow Pipeline

The core innovation — each stage filters candidates so the next stage does less work:

                    ┌──────────────────────┐
                    │   Vector Search      │
                    │  (your provider)     │
                    └──────────┬───────────┘
                               │ candidates with scores
                    ┌──────────▼───────────┐
                    │  Threshold + Gap     │
                    │  Filter (~60% cut)   │
                    └──────────┬───────────┘
                               │ survivors
                    ┌──────────▼───────────┐
                    │  Feature Extraction  │
                    │  (15 features)       │
                    └──────────┬───────────┘
                               │ feature vectors
                    ┌──────────▼───────────┐
                    │   MLP Reranker       │
                    │  (<5ms, $0.00)       │
                    └──────────┬───────────┘
                        ┌──────┼──────┐
                   p≥0.6│  0.4<p<0.6  │p≤0.4
                        │      │      │
                   Accept   ┌──▼──┐  Reject
                   (free)   │ LLM │  (free)
                            │(20%)│
                            └──┬──┘
                          Accept/Reject

Development

git clone https://github.com/jager47X/ARF.git
cd ARF
pip install -e ".[dev]"

# Run library tests
pytest tests/test_arf/ -v

# Lint
ruff check arf/ tests/test_arf/

Contributing

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
Data/Knowledge		Data/Knowledge
arf		arf
benchmarks		benchmarks
fixtures		fixtures
media		media
models		models
preprocess		preprocess
rag_dependencies		rag_dependencies
sample-project		sample-project
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
RAG_interface.py		RAG_interface.py
README.md		README.md
__init__.py		__init__.py
config.py		config.py
config_schema.py		config_schema.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
standalone_setup.py		standalone_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARF — Advanced Retrieval Framework

What ARF Does

Quick Start

Full Pipeline

Components

Document — DB-agnostic data model

Features — 15-feature extraction

Triage — threshold + gap + zone routing

QueryGraph — rephrase chain walk

ScoreParser — LLM output parsing + multiplier blending

Trainer — MLP training

Ingest — document ingestion helper

Bring Your Own Everything

Installation

Sample Project

R-Flow Pipeline

Development

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARF — Advanced Retrieval Framework

What ARF Does

Quick Start

Full Pipeline

Components

Document — DB-agnostic data model

Features — 15-feature extraction

Triage — threshold + gap + zone routing

QueryGraph — rephrase chain walk

ScoreParser — LLM output parsing + multiplier blending

Trainer — MLP training

Ingest — document ingestion helper

Bring Your Own Everything

Installation

Sample Project

R-Flow Pipeline

Development

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages