Gutenberg OPDS 2.0 and Full-Text Search

A PostgreSQL-backed full-text search engine for Project Gutenberg and the new OPDS 2.0 catalog.

Recording.2025-12-29.171834.mp4

Video of DB and Search Engine Interface running on a server with 1GB ram and 1 core CPU. Search returns results with FULL metadata in around 20-30ms in most cases even on minimal hardware.

Requirements

Python Dependencies

python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt

PostgreSQL Extensions

The materialized view requires these extensions:

CREATE EXTENSION IF NOT EXISTS pg_trgm;    -- Fuzzy/trigram search *THE PROJECT GUTENBERG TRGM INSTALL MAY BE BROKEN AND NEED BE COMPLETELY MANUALLY REMOVED AND RE-INSTALLED

Database Setup

Import the Project Gutenberg database

Build the materialized view:

psql -U postgres -d gutendb -f mv_books_dc.sql

Materialized View: mv_books_dc

Denormalized view containing all book metadata for fast searching.

Columns

Column	Type	Description
book_id	int	Primary key
title	text	Book title
downloads	int	Download count (last 30 days)
release_date	date	Publication date (denormalized for fast sorting)
lang_codes	text[]	Array of language codes
copyrighted	int	0 = public domain, 1 = copyrighted
is_audio	bool	True if audiobook
locc_codes	text[]	Library of Congress classification codes
creator_ids	int[]	Creator IDs aligned to creator_names/roles
creator_names	text[]	Creator names aligned to creator_ids/roles
creator_roles	text[]	Creator roles aligned to creator_ids/names
subject_ids	int[]	Subject IDs aligned to subject_names
subject_names	text[]	Subject names aligned to subject_ids
bookshelf_ids	int[]	Bookshelf IDs aligned to bookshelf_names
bookshelf_names	text[]	Bookshelf names aligned to bookshelf_ids
dcmitypes	text[]	DCMI type labels (defaults to Text)
publisher	text	Publisher raw string
summary	text[]	MARC 520 summary entries
credits	text[]	MARC 508 credits entries
reading_level	text	MARC 908 reading level
coverpage	text[]	MARC 901 cover URLs
format_filenames	text[]	File URLs aligned to format_* arrays
format_filetypes	text[]	Filetype codes aligned to format_* arrays
format_hr_filetypes	text[]	Human-readable filetype labels
format_mediatypes	text[]	Media types aligned to format_* arrays
format_extents	bigint[]	File sizes aligned to format_* arrays

Indexes

Type	Columns	Use Case
B-tree	downloads, release_date, copyrighted, is_audio, author birth/death years	Sorting, equality filters
GIN tsvector	tsvec	Full-text search
GiST trigram	book_text	Fuzzy/typo-tolerant search
GIN array	lang_codes	Array containment

Refresh

The MV must be refreshed periodically to reflect database changes:

SELECT refresh_mv_books_dc();

Or via cron/systemd timer:

psql -U postgres -d gutendb -c "SELECT refresh_mv_books_dc();"

FullTextSearch API

Basic Usage

from FullTextSearch import FullTextSearch, SearchType, OrderBy, Crosswalk

fts = FullTextSearch()

# Simple search
result = fts.execute(fts.query().search("Shakespeare")[1, 28])
print(result["total"], result["results"])

# Get single book
book = fts.get(1342)

# Get multiple books
books = fts.get_many([1342, 84, 11])

# Count only
count = fts.count(fts.query().search("Shakespeare"))

Search Types

Type	Operator	Index	Speed	Use Case
FTS	`@@` websearch_to_tsquery	GIN tsvector	Fast	Stemmed word matching, supports operators
FUZZY	`<%` word_similarity	GiST trigram	Slower	Typo tolerance ("Shakspeare" matches "Shakespeare")

FTS (Full-Text Search):

Uses PostgreSQL websearch_to_tsquery
Supports boolean operators (see below)
Default sort: Downloads
Best for: Precise searches, exact word matching, speed

Fuzzy Search:

Uses trigram similarity for typo tolerance
Default sort: Downloads
Best for: User input with potential misspellings

from FullTextSearch import SearchType

# FTS (default)
q.search("Shakespeare")

# Fuzzy
q.search("Shakspeare", SearchType.FUZZY)

FTS Search Operators

FTS mode supports PostgreSQL websearch_to_tsquery syntax:

Operator	Example	Meaning
`"phrase"`	`"to be or not to be"`	Exact phrase match
`word1 word2`	`shakespeare and hamlet`	AND (both words required)
`word1 or word2`	`romeo or juliet`	OR (either word)
`-word`	`shakespeare -hamlet`	NOT (exclude word)

Search Fields

Fuzzy search currently targets only title/authors/subjects/bookshelves combined.

The fts targets Gutenbergs books public table tsvector field.

Filter Methods

from FullTextSearch import LanguageCode, LoccClass, FileType, Encoding

# By ID
q.etext(1342)                    # Single book
q.etexts([1342, 84, 11])         # Multiple books

# By downloads
q.downloads_gte(1000)            # >= 1000 downloads
q.downloads_lte(100)             # <= 100 downloads

# By copyright
q.public_domain()                # copyrighted = 0
q.copyrighted()                  # copyrighted = 1

# By language 
q.lang(LanguageCode.EN)          # English
q.lang(LanguageCode.DE)          # German

# By format
q.text_only()                    # Exclude audiobooks
q.audiobook()                    # Only audiobooks

# By author dates
q.author_born_after(1900)
q.author_born_before(1700)
q.author_died_after(1950)
q.author_died_before(1800)

# By release date (
q.released_after("2020-01-01")
q.released_before("2000-01-01")

# By classification 
q.locc(LoccClass.P)              # LoCC top-level class (prefix match)
q.locc("PS")                     # Optional: narrower LoCC prefix match

# By ID 
q.author_id(53)                  # Mark Twain
q.subject_id(1)                  # Uses mn_books_subjects
q.bookshelf_id(68)               # Uses mn_books_bookshelves
q.contributor_role("Illustrator")

# Custom SQL
q.where("COALESCE(array_length(creator_ids, 1), 0) > :n", n=2)

Chaining

All methods return self for chaining:

from FullTextSearch import OrderBy, LanguageCode

result = fts.execute(
    fts.query()
    .search("Adventure")
    .lang(LanguageCode.EN)
    .public_domain()
    .downloads_gte(100)
    .order_by(OrderBy.DOWNLOADS)
    [1, 28]
)

Ordering

from FullTextSearch import OrderBy, SortDirection

q.order_by(OrderBy.RELEVANCE)                    # ts_rank_cd (FTS) or word_similarity (Fuzzy)
q.order_by(OrderBy.DOWNLOADS)                   
q.order_by(OrderBy.TITLE)
q.order_by(OrderBy.AUTHOR)
q.order_by(OrderBy.RELEASE_DATE)                
q.order_by(OrderBy.RANDOM)
q.order_by(OrderBy.DOWNLOADS, SortDirection.ASC) # Explicit direction

Pagination

q[1, 28]   # Page 1, 28 results per page (default)
q[3, 50]   # Page 3, 50 results per page
q[2]       # Page 2, default 28 per page

Output Formats (Crosswalks)

from FullTextSearch import Crosswalk

fts.query(Crosswalk.PG)     # Project Gutenberg API format (default)
fts.query(Crosswalk.OPDS)   # OPDS 2.0 publication format
fts.query(Crosswalk.CUSTOM) # Custom transformer (see below)

Custom Crosswalks

def my_format(row):
    return {
        "id": row.book_id,
        "name": row.title,
        "author": row.all_authors,
    }

fts.set_custom_transformer(my_format)
result = fts.execute(fts.query(Crosswalk.CUSTOM).search("Shakespeare"))

Result Format

{
    "results": [...],      # List of books in crosswalk format
    "page": 1,             # Current page
    "page_size": 28,       # Results per page
    "total": 1234,         # Total matching books
    "total_pages": 45      # Total pages
}

OPDS 2.0 Server

What is OPDS?

OPDS (Open Publication Distribution System) is a standard for distributing digital publications. OPDS 2.0 uses JSON feeds that e-reader apps (like Thorium, FBReader, etc.) can browse to discover and download books.

Running the Server

python OPDS.py

Server runs at http://127.0.0.1:8080/opds/

OPDS Catalog Structure

Endpoints

Root Catalog

GET /opds/

Returns homepage with:

Navigation links (Search Fuzzy, Search FTS, Browse by LoCC, Bookshelf, Subjects Most Popular, Recently Added, Random)
Groups: All curated bookshelves with 20 sample publications each

Search

Query Parameters:

Parameter	Values	Default	Description
query	Search text	(empty)	Search query (supports operators in FTS mode)
field	`keyword`, `fts_keyword`, `fuzzy_keyword`, `title`, `author`, etc.	`keyword` (fuzzy)	Search field and type
lang	Language code (en, de, fr, etc.)	(any)	Filter by language
copyrighted	`true`, `false`	(any)	Filter by copyright status
audiobook	`true`, `false`	(any)	Filter by format
locc	LoCC code (A, B, PS, etc.)	(any)	Filter by Library of Congress classification
sort	`downloads`, `relevance`, `title`, `author`, `release_date`, `random`	`relevance` (FTS) or `downloads` (Fuzzy)	Sort order
sort_order	`asc`, `desc`	(default per field)	Sort direction
page	Page number	1	Pagination
limit	1-100	28	Results per page

Example searches:

/opds/search?query=shakespeare - Fuzzy search
/opds/search?query=shakespeare&field=fts_keyword - FTS search
/opds/search?query="romeo and juliet" - Exact phrase (FTS)
/opds/search?query=twain or clemens - OR operator (FTS)
/opds/search?query=science -fiction - Exclude word (FTS)

Browse by LoCC (Subject Classification)

GET /opds/loccs
GET /opds/loccs?parent=P
GET /opds/loccs?parent=PS&query=novel

Hierarchical navigation through Library of Congress Classification:

Without parent: Shows top-level classes (A, B, C, D, E, F, G, H, J, K, L, M, N, P, Q, R, S, T, U, V, Z)
With parent: Shows subclasses or books (if leaf node) *ONLY GOES TWO LAYERS SO FOR EXAMPLE ALL CODES STARTING A WILL BE LISTED UNDER loccs?parent=P
Supports search and filtering within any LoCC class
Shows Top Subjects facet when viewing books

Bookshelves

GET /opds/bookshelves
GET /opds/bookshelves?id=644
GET /opds/bookshelves?id=644&query=adventure

Curated bookshelves organized by category.

Subjects

GET /opds/subjects
GET /opds/subjects?id=1
GET /opds/subjects?id=1&query=novel

Lists top 50 or so subjects (by book count).

LOCC

GET /opds/genres
GET /opds/genres?code=P
GET /opds/genres?code=P&query=poetry

Allows users to explore the LOCC Heirarchy.

Facets (Dynamic Filters)

Facets appear on publication pages (search results, browse results) and provide dynamic filtering options:

Sort By:

Most Popular (downloads)
Relevance
Title (A-Z)
Author (A-Z)
Random

Top Subjects in Results:

Appears when query or filters are active
Shows up to 15 most common subjects in current results
Clicking a subject navigates to that subject page (facet disappears)

Broad Genre:

Library of Congress top-level classes
Only shown in main search (not in LoCC browse)

Copyright Status:

Any / Public Domain / Copyrighted

Format:

Any / Text / Audiobook

Language:

Any / English / German / French / etc.

Navigation Behavior

Search: Preserves search type (Fuzzy/FTS) and context (bookshelf, subject, LoCC)
Back navigation: All pages include rel="start" link to homepage
Up navigation: Parent pages include rel="up" link
Search template: Uses {?query} syntax for OPDS-compliant search URIs

Testing

python test.py

Runs all search types, filters, and crosswalks with timing output.

Architecture

flowchart TB
    subgraph Server["CherryPy OPDS Server"]
        TP[Thread Pool]
    end

    subgraph Stateless["FullTextSearch (stateless)"]
        FTS[FullTextSearch]
        ENGINE[SQLAlchemy Engine]
        FACTORY[Session Factory]
    end

    subgraph Stateful["Per-Request Objects"]
        SQ[SearchQuery]
        SESS[Session]
    end

    subgraph Pool["SQLAlchemy Pool"]
        C1[Connection 1]
        C2[Connection 2]
        CN[Connection N]
    end

    subgraph DB["PostgreSQL"]
        MV[Materialized View]
        SB[Shared Buffers]
    end

    TP --> FTS
    FTS -->|"create_engine()"| ENGINE
    ENGINE -->|"bind"| FACTORY
    FTS -->|".query()"| SQ
    SQ -->|".search().lang()[1,28]"| SQ
    FTS -->|".execute(q)"| FACTORY
    FACTORY -->|"with Session()"| SESS
    SESS -->|"checkout"| Pool
    Pool --> DB

Typical usage:

fts = FullTextSearch()                          # Creates engine + session factory
q = fts.query().search("Shakespeare")[1, 28]    # Builds SearchQuery (no DB hit)
result = fts.execute(q)                         # Session checks out connection, runs query

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
opds		opds
search		search
tables		tables
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

zachjesus/pg-search-opds

Folders and files

Latest commit

History

Repository files navigation