A PostgreSQL-backed full-text search engine for Project Gutenberg and the new OPDS 2.0 catalog.
Recording.2025-12-29.171834.mp4
Video of DB and Search Engine Interface running on a server with 1GB ram and 1 core CPU. Search returns results with FULL metadata in around 20-30ms in most cases even on minimal hardware.
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txtThe materialized view requires these extensions:
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy/trigram search *THE PROJECT GUTENBERG TRGM INSTALL MAY BE BROKEN AND NEED BE COMPLETELY MANUALLY REMOVED AND RE-INSTALLED- Import the Project Gutenberg database
- Build the materialized view:
psql -U postgres -d gutendb -f mv_books_dc.sql
Denormalized view containing all book metadata for fast searching.
| Column | Type | Description |
|---|---|---|
| book_id | int | Primary key |
| title | text | Book title |
| downloads | int | Download count (last 30 days) |
| release_date | date | Publication date (denormalized for fast sorting) |
| lang_codes | text[] | Array of language codes |
| copyrighted | int | 0 = public domain, 1 = copyrighted |
| is_audio | bool | True if audiobook |
| locc_codes | text[] | Library of Congress classification codes |
| creator_ids | int[] | Creator IDs aligned to creator_names/roles |
| creator_names | text[] | Creator names aligned to creator_ids/roles |
| creator_roles | text[] | Creator roles aligned to creator_ids/names |
| subject_ids | int[] | Subject IDs aligned to subject_names |
| subject_names | text[] | Subject names aligned to subject_ids |
| bookshelf_ids | int[] | Bookshelf IDs aligned to bookshelf_names |
| bookshelf_names | text[] | Bookshelf names aligned to bookshelf_ids |
| dcmitypes | text[] | DCMI type labels (defaults to Text) |
| publisher | text | Publisher raw string |
| summary | text[] | MARC 520 summary entries |
| credits | text[] | MARC 508 credits entries |
| reading_level | text | MARC 908 reading level |
| coverpage | text[] | MARC 901 cover URLs |
| format_filenames | text[] | File URLs aligned to format_* arrays |
| format_filetypes | text[] | Filetype codes aligned to format_* arrays |
| format_hr_filetypes | text[] | Human-readable filetype labels |
| format_mediatypes | text[] | Media types aligned to format_* arrays |
| format_extents | bigint[] | File sizes aligned to format_* arrays |
| Type | Columns | Use Case |
|---|---|---|
| B-tree | downloads, release_date, copyrighted, is_audio, author birth/death years | Sorting, equality filters |
| GIN tsvector | tsvec | Full-text search |
| GiST trigram | book_text | Fuzzy/typo-tolerant search |
| GIN array | lang_codes | Array containment |
The MV must be refreshed periodically to reflect database changes:
SELECT refresh_mv_books_dc();Or via cron/systemd timer:
psql -U postgres -d gutendb -c "SELECT refresh_mv_books_dc();"from FullTextSearch import FullTextSearch, SearchType, OrderBy, Crosswalk
fts = FullTextSearch()
# Simple search
result = fts.execute(fts.query().search("Shakespeare")[1, 28])
print(result["total"], result["results"])
# Get single book
book = fts.get(1342)
# Get multiple books
books = fts.get_many([1342, 84, 11])
# Count only
count = fts.count(fts.query().search("Shakespeare"))| Type | Operator | Index | Speed | Use Case |
|---|---|---|---|---|
| FTS | @@ websearch_to_tsquery |
GIN tsvector | Fast | Stemmed word matching, supports operators |
| FUZZY | <% word_similarity |
GiST trigram | Slower | Typo tolerance ("Shakspeare" matches "Shakespeare") |
FTS (Full-Text Search):
- Uses PostgreSQL
websearch_to_tsquery - Supports boolean operators (see below)
- Default sort: Downloads
- Best for: Precise searches, exact word matching, speed
Fuzzy Search:
- Uses trigram similarity for typo tolerance
- Default sort: Downloads
- Best for: User input with potential misspellings
from FullTextSearch import SearchType
# FTS (default)
q.search("Shakespeare")
# Fuzzy
q.search("Shakspeare", SearchType.FUZZY)FTS mode supports PostgreSQL websearch_to_tsquery syntax:
| Operator | Example | Meaning |
|---|---|---|
"phrase" |
"to be or not to be" |
Exact phrase match |
word1 word2 |
shakespeare and hamlet |
AND (both words required) |
word1 or word2 |
romeo or juliet |
OR (either word) |
-word |
shakespeare -hamlet |
NOT (exclude word) |
Fuzzy search currently targets only title/authors/subjects/bookshelves combined.
The fts targets Gutenbergs books public table tsvector field.
from FullTextSearch import LanguageCode, LoccClass, FileType, Encoding
# By ID
q.etext(1342) # Single book
q.etexts([1342, 84, 11]) # Multiple books
# By downloads
q.downloads_gte(1000) # >= 1000 downloads
q.downloads_lte(100) # <= 100 downloads
# By copyright
q.public_domain() # copyrighted = 0
q.copyrighted() # copyrighted = 1
# By language
q.lang(LanguageCode.EN) # English
q.lang(LanguageCode.DE) # German
# By format
q.text_only() # Exclude audiobooks
q.audiobook() # Only audiobooks
# By author dates
q.author_born_after(1900)
q.author_born_before(1700)
q.author_died_after(1950)
q.author_died_before(1800)
# By release date (
q.released_after("2020-01-01")
q.released_before("2000-01-01")
# By classification
q.locc(LoccClass.P) # LoCC top-level class (prefix match)
q.locc("PS") # Optional: narrower LoCC prefix match
# By ID
q.author_id(53) # Mark Twain
q.subject_id(1) # Uses mn_books_subjects
q.bookshelf_id(68) # Uses mn_books_bookshelves
q.contributor_role("Illustrator")
# Custom SQL
q.where("COALESCE(array_length(creator_ids, 1), 0) > :n", n=2)All methods return self for chaining:
from FullTextSearch import OrderBy, LanguageCode
result = fts.execute(
fts.query()
.search("Adventure")
.lang(LanguageCode.EN)
.public_domain()
.downloads_gte(100)
.order_by(OrderBy.DOWNLOADS)
[1, 28]
)from FullTextSearch import OrderBy, SortDirection
q.order_by(OrderBy.RELEVANCE) # ts_rank_cd (FTS) or word_similarity (Fuzzy)
q.order_by(OrderBy.DOWNLOADS)
q.order_by(OrderBy.TITLE)
q.order_by(OrderBy.AUTHOR)
q.order_by(OrderBy.RELEASE_DATE)
q.order_by(OrderBy.RANDOM)
q.order_by(OrderBy.DOWNLOADS, SortDirection.ASC) # Explicit directionq[1, 28] # Page 1, 28 results per page (default)
q[3, 50] # Page 3, 50 results per page
q[2] # Page 2, default 28 per pagefrom FullTextSearch import Crosswalk
fts.query(Crosswalk.PG) # Project Gutenberg API format (default)
fts.query(Crosswalk.OPDS) # OPDS 2.0 publication format
fts.query(Crosswalk.CUSTOM) # Custom transformer (see below)def my_format(row):
return {
"id": row.book_id,
"name": row.title,
"author": row.all_authors,
}
fts.set_custom_transformer(my_format)
result = fts.execute(fts.query(Crosswalk.CUSTOM).search("Shakespeare")){
"results": [...], # List of books in crosswalk format
"page": 1, # Current page
"page_size": 28, # Results per page
"total": 1234, # Total matching books
"total_pages": 45 # Total pages
}OPDS (Open Publication Distribution System) is a standard for distributing digital publications. OPDS 2.0 uses JSON feeds that e-reader apps (like Thorium, FBReader, etc.) can browse to discover and download books.
python OPDS.pyServer runs at http://127.0.0.1:8080/opds/
GET /opds/
Returns homepage with:
- Navigation links (Search Fuzzy, Search FTS, Browse by LoCC, Bookshelf, Subjects Most Popular, Recently Added, Random)
- Groups: All curated bookshelves with 20 sample publications each
Query Parameters:
| Parameter | Values | Default | Description |
|---|---|---|---|
| query | Search text | (empty) | Search query (supports operators in FTS mode) |
| field | keyword, fts_keyword, fuzzy_keyword, title, author, etc. |
keyword (fuzzy) |
Search field and type |
| lang | Language code (en, de, fr, etc.) | (any) | Filter by language |
| copyrighted | true, false |
(any) | Filter by copyright status |
| audiobook | true, false |
(any) | Filter by format |
| locc | LoCC code (A, B, PS, etc.) | (any) | Filter by Library of Congress classification |
| sort | downloads, relevance, title, author, release_date, random |
relevance (FTS) or downloads (Fuzzy) |
Sort order |
| sort_order | asc, desc |
(default per field) | Sort direction |
| page | Page number | 1 | Pagination |
| limit | 1-100 | 28 | Results per page |
Example searches:
/opds/search?query=shakespeare- Fuzzy search/opds/search?query=shakespeare&field=fts_keyword- FTS search/opds/search?query="romeo and juliet"- Exact phrase (FTS)/opds/search?query=twain or clemens- OR operator (FTS)/opds/search?query=science -fiction- Exclude word (FTS)
GET /opds/loccs
GET /opds/loccs?parent=P
GET /opds/loccs?parent=PS&query=novel
Hierarchical navigation through Library of Congress Classification:
- Without
parent: Shows top-level classes (A, B, C, D, E, F, G, H, J, K, L, M, N, P, Q, R, S, T, U, V, Z) - With
parent: Shows subclasses or books (if leaf node) *ONLY GOES TWO LAYERS SO FOR EXAMPLE ALL CODES STARTING A WILL BE LISTED UNDER loccs?parent=P - Supports search and filtering within any LoCC class
- Shows Top Subjects facet when viewing books
GET /opds/bookshelves
GET /opds/bookshelves?id=644
GET /opds/bookshelves?id=644&query=adventure
Curated bookshelves organized by category.
GET /opds/subjects
GET /opds/subjects?id=1
GET /opds/subjects?id=1&query=novel
Lists top 50 or so subjects (by book count).
GET /opds/genres
GET /opds/genres?code=P
GET /opds/genres?code=P&query=poetry
Allows users to explore the LOCC Heirarchy.
Facets appear on publication pages (search results, browse results) and provide dynamic filtering options:
Sort By:
- Most Popular (downloads)
- Relevance
- Title (A-Z)
- Author (A-Z)
- Random
Top Subjects in Results:
- Appears when query or filters are active
- Shows up to 15 most common subjects in current results
- Clicking a subject navigates to that subject page (facet disappears)
Broad Genre:
- Library of Congress top-level classes
- Only shown in main search (not in LoCC browse)
Copyright Status:
- Any / Public Domain / Copyrighted
Format:
- Any / Text / Audiobook
Language:
- Any / English / German / French / etc.
- Search: Preserves search type (Fuzzy/FTS) and context (bookshelf, subject, LoCC)
- Back navigation: All pages include
rel="start"link to homepage - Up navigation: Parent pages include
rel="up"link - Search template: Uses
{?query}syntax for OPDS-compliant search URIs
python test.pyRuns all search types, filters, and crosswalks with timing output.
flowchart TB
subgraph Server["CherryPy OPDS Server"]
TP[Thread Pool]
end
subgraph Stateless["FullTextSearch (stateless)"]
FTS[FullTextSearch]
ENGINE[SQLAlchemy Engine]
FACTORY[Session Factory]
end
subgraph Stateful["Per-Request Objects"]
SQ[SearchQuery]
SESS[Session]
end
subgraph Pool["SQLAlchemy Pool"]
C1[Connection 1]
C2[Connection 2]
CN[Connection N]
end
subgraph DB["PostgreSQL"]
MV[Materialized View]
SB[Shared Buffers]
end
TP --> FTS
FTS -->|"create_engine()"| ENGINE
ENGINE -->|"bind"| FACTORY
FTS -->|".query()"| SQ
SQ -->|".search().lang()[1,28]"| SQ
FTS -->|".execute(q)"| FACTORY
FACTORY -->|"with Session()"| SESS
SESS -->|"checkout"| Pool
Pool --> DB
Typical usage:
fts = FullTextSearch() # Creates engine + session factory
q = fts.query().search("Shakespeare")[1, 28] # Builds SearchQuery (no DB hit)
result = fts.execute(q) # Session checks out connection, runs query