fix: author field not extracted, stored, or counted by henosch · Pull Request #72 · PhialsBasement/LibreCrawl

henosch · 2026-05-27T17:02:38Z

Problem

The author field was never correctly populated, persisted, or counted in the E-E-A-T plugin — regardless of what HTML the page contained.

Three root causes were identified:

1. `<link rel="author">` was never extracted (`src/core/seo_extractor.py`)

The extractor only handled <meta name="author" content="...">. Many sites use the alternative <link rel="author" href="..."> format (including multi-value rel like rel="nofollow author"), which was silently ignored.

Fix: Added a fallback lookup for <link rel='author'> after the meta-tag loop. <meta name="author"> still takes precedence.

2. `author` was never saved to the database (`src/crawl_db.py`)

The crawled_urls table had no author column, and save_url_batch() did not include it in the INSERT. After any DB reload (saved crawl, resume, etc.) the field was always NULL.

The same was true for keywords, generator, and theme_color.

Fix:

Added the four columns to the CREATE TABLE schema
Added ALTER TABLE migrations so existing databases are updated automatically
Added all four fields to the save_url_batch() INSERT

3. E-E-A-T plugin checked the wrong field name (`web/static/plugins/e-e-a-t.js`)

The plugin checked url.meta_author — a field that has never existed in the data model. The actual field name is url.author. This caused pagesWithAuthor to always be 0.

A second issue: for DB-loaded crawls where url.author may still be null (existing data), url.meta_tags.author is a valid fallback since meta_tags is serialised as JSON and always persisted.

Fix: Changed the check to url.author || (url.meta_tags && url.meta_tags.author) || (url.og_tags && url.og_tags.author)

Summary of changes

File	Change
`src/core/seo_extractor.py`	Extract author from `<link rel="author">` as fallback
`src/crawl_db.py`	Add `author`, `keywords`, `generator`, `theme_color` columns + migrations + INSERT
`web/static/plugins/e-e-a-t.js`	Fix field name `meta_author` → `author`, add `meta_tags.author` fallback

🤖 Generated with Claude Code

…e="author"> Previously only <meta name="author" content="..."> was handled. Many sites use <link rel="author" href="..."> which was silently ignored, causing the author field to always be empty for those pages. - Add fallback extraction from <link rel='author'> after meta tags loop - <meta name="author"> still takes precedence if both are present - Uses BeautifulSoup's list-aware rel matching, so multi-value rel attributes like rel="nofollow author" are handled correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The backend stores the extracted author in result['author'], not result['meta_author']. The wrong field name caused pagesWithAuthor to always be 0 in the E-E-A-T plugin, regardless of page content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…a-t detection Three root causes for author always showing 0: 1. crawl_db.py: 'author', 'keywords', 'generator', 'theme_color' were missing from both the CREATE TABLE schema and the save_url_batch INSERT. After a DB reload these fields were always undefined/null. → Added the 4 columns to the schema, added ALTER TABLE migrations for existing databases, and included them in the batch INSERT. 2. e-e-a-t.js: no fallback for DB-loaded crawls where url.author may be null but url.meta_tags.author is still populated (meta_tags IS saved as JSON). → Added url.meta_tags.author as a fallback check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a lightweight translation system to LibreCrawl: - web/static/js/i18n.js — translation engine Uses data-i18n / data-i18n-placeholder / data-i18n-title attributes. Language preference is persisted in localStorage. Default language: German (de). - web/static/locales/de.json — ~200 German strings - web/static/locales/en.json — ~200 English strings (fallback) - login.html, register.html, index.html All user-visible text marked with data-i18n attributes. DE/EN toggle button added to header (main app) and top-right corner (login/register pages). JS validation and button state messages also use i18n.t().

- Duplication threshold help text - Issue exclusion description, Reset button, info box with bullet list - Custom CSS description and CSS Tips info box Uses data-i18n-html for blocks containing HTML markup.

henosch and others added 11 commits May 27, 2026 07:14

Run container with host user

e997b23

Persist data in Docker volume

a5ead72

Add account management script

0e02a3a

Promote verified users from guest tier

f702c7a

Expose guest login toggle in Compose

3d79619

fix(i18n): translate remaining settings help texts and info boxes

3bacdac

- Duplication threshold help text - Issue exclusion description, Reset button, info box with bullet list - Custom CSS description and CSS Tips info box Uses data-i18n-html for blocks containing HTML markup.

docs: add German README (README.de.md)

31c4023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: author field not extracted, stored, or counted#72

fix: author field not extracted, stored, or counted#72
henosch wants to merge 11 commits into
PhialsBasement:mainfrom
henosch:main

henosch commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henosch commented May 27, 2026

Problem

1. <link rel="author"> was never extracted (src/core/seo_extractor.py)

2. author was never saved to the database (src/crawl_db.py)

3. E-E-A-T plugin checked the wrong field name (web/static/plugins/e-e-a-t.js)

Summary of changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `<link rel="author">` was never extracted (`src/core/seo_extractor.py`)

2. `author` was never saved to the database (`src/crawl_db.py`)

3. E-E-A-T plugin checked the wrong field name (`web/static/plugins/e-e-a-t.js`)