Skip to content

fix: author field not extracted, stored, or counted#72

Open
henosch wants to merge 11 commits into
PhialsBasement:mainfrom
henosch:main
Open

fix: author field not extracted, stored, or counted#72
henosch wants to merge 11 commits into
PhialsBasement:mainfrom
henosch:main

Conversation

@henosch

@henosch henosch commented May 27, 2026

Copy link
Copy Markdown

Problem

The author field was never correctly populated, persisted, or counted in the E-E-A-T plugin — regardless of what HTML the page contained.

Three root causes were identified:


1. <link rel="author"> was never extracted (src/core/seo_extractor.py)

The extractor only handled <meta name="author" content="...">. Many sites use the alternative <link rel="author" href="..."> format (including multi-value rel like rel="nofollow author"), which was silently ignored.

Fix: Added a fallback lookup for <link rel='author'> after the meta-tag loop. <meta name="author"> still takes precedence.


2. author was never saved to the database (src/crawl_db.py)

The crawled_urls table had no author column, and save_url_batch() did not include it in the INSERT. After any DB reload (saved crawl, resume, etc.) the field was always NULL.

The same was true for keywords, generator, and theme_color.

Fix:

  • Added the four columns to the CREATE TABLE schema
  • Added ALTER TABLE migrations so existing databases are updated automatically
  • Added all four fields to the save_url_batch() INSERT

3. E-E-A-T plugin checked the wrong field name (web/static/plugins/e-e-a-t.js)

The plugin checked url.meta_author — a field that has never existed in the data model. The actual field name is url.author. This caused pagesWithAuthor to always be 0.

A second issue: for DB-loaded crawls where url.author may still be null (existing data), url.meta_tags.author is a valid fallback since meta_tags is serialised as JSON and always persisted.

Fix: Changed the check to url.author || (url.meta_tags && url.meta_tags.author) || (url.og_tags && url.og_tags.author)


Summary of changes

File Change
src/core/seo_extractor.py Extract author from <link rel="author"> as fallback
src/crawl_db.py Add author, keywords, generator, theme_color columns + migrations + INSERT
web/static/plugins/e-e-a-t.js Fix field name meta_authorauthor, add meta_tags.author fallback

🤖 Generated with Claude Code

henosch and others added 11 commits May 27, 2026 07:14
…e="author">

Previously only <meta name="author" content="..."> was handled.
Many sites use <link rel="author" href="..."> which was silently ignored,
causing the author field to always be empty for those pages.

- Add fallback extraction from <link rel='author'> after meta tags loop
- <meta name="author"> still takes precedence if both are present
- Uses BeautifulSoup's list-aware rel matching, so multi-value
  rel attributes like rel="nofollow author" are handled correctly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The backend stores the extracted author in result['author'], not
result['meta_author']. The wrong field name caused pagesWithAuthor
to always be 0 in the E-E-A-T plugin, regardless of page content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a-t detection

Three root causes for author always showing 0:

1. crawl_db.py: 'author', 'keywords', 'generator', 'theme_color' were missing
   from both the CREATE TABLE schema and the save_url_batch INSERT.
   After a DB reload these fields were always undefined/null.
   → Added the 4 columns to the schema, added ALTER TABLE migrations for
     existing databases, and included them in the batch INSERT.

2. e-e-a-t.js: no fallback for DB-loaded crawls where url.author may be null
   but url.meta_tags.author is still populated (meta_tags IS saved as JSON).
   → Added url.meta_tags.author as a fallback check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a lightweight translation system to LibreCrawl:

- web/static/js/i18n.js — translation engine
  Uses data-i18n / data-i18n-placeholder / data-i18n-title attributes.
  Language preference is persisted in localStorage.
  Default language: German (de).

- web/static/locales/de.json — ~200 German strings
- web/static/locales/en.json — ~200 English strings (fallback)

- login.html, register.html, index.html
  All user-visible text marked with data-i18n attributes.
  DE/EN toggle button added to header (main app) and
  top-right corner (login/register pages).
  JS validation and button state messages also use i18n.t().
- Duplication threshold help text
- Issue exclusion description, Reset button, info box with bullet list
- Custom CSS description and CSS Tips info box
Uses data-i18n-html for blocks containing HTML markup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant