fix: author field not extracted, stored, or counted#72
Open
henosch wants to merge 11 commits into
Open
Conversation
…e="author"> Previously only <meta name="author" content="..."> was handled. Many sites use <link rel="author" href="..."> which was silently ignored, causing the author field to always be empty for those pages. - Add fallback extraction from <link rel='author'> after meta tags loop - <meta name="author"> still takes precedence if both are present - Uses BeautifulSoup's list-aware rel matching, so multi-value rel attributes like rel="nofollow author" are handled correctly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The backend stores the extracted author in result['author'], not result['meta_author']. The wrong field name caused pagesWithAuthor to always be 0 in the E-E-A-T plugin, regardless of page content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a-t detection
Three root causes for author always showing 0:
1. crawl_db.py: 'author', 'keywords', 'generator', 'theme_color' were missing
from both the CREATE TABLE schema and the save_url_batch INSERT.
After a DB reload these fields were always undefined/null.
→ Added the 4 columns to the schema, added ALTER TABLE migrations for
existing databases, and included them in the batch INSERT.
2. e-e-a-t.js: no fallback for DB-loaded crawls where url.author may be null
but url.meta_tags.author is still populated (meta_tags IS saved as JSON).
→ Added url.meta_tags.author as a fallback check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a lightweight translation system to LibreCrawl: - web/static/js/i18n.js — translation engine Uses data-i18n / data-i18n-placeholder / data-i18n-title attributes. Language preference is persisted in localStorage. Default language: German (de). - web/static/locales/de.json — ~200 German strings - web/static/locales/en.json — ~200 English strings (fallback) - login.html, register.html, index.html All user-visible text marked with data-i18n attributes. DE/EN toggle button added to header (main app) and top-right corner (login/register pages). JS validation and button state messages also use i18n.t().
- Duplication threshold help text - Issue exclusion description, Reset button, info box with bullet list - Custom CSS description and CSS Tips info box Uses data-i18n-html for blocks containing HTML markup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
authorfield was never correctly populated, persisted, or counted in the E-E-A-T plugin — regardless of what HTML the page contained.Three root causes were identified:
1.
<link rel="author">was never extracted (src/core/seo_extractor.py)The extractor only handled
<meta name="author" content="...">. Many sites use the alternative<link rel="author" href="...">format (including multi-value rel likerel="nofollow author"), which was silently ignored.Fix: Added a fallback lookup for
<link rel='author'>after the meta-tag loop.<meta name="author">still takes precedence.2.
authorwas never saved to the database (src/crawl_db.py)The
crawled_urlstable had noauthorcolumn, andsave_url_batch()did not include it in the INSERT. After any DB reload (saved crawl, resume, etc.) the field was alwaysNULL.The same was true for
keywords,generator, andtheme_color.Fix:
CREATE TABLEschemaALTER TABLEmigrations so existing databases are updated automaticallysave_url_batch()INSERT3. E-E-A-T plugin checked the wrong field name (
web/static/plugins/e-e-a-t.js)The plugin checked
url.meta_author— a field that has never existed in the data model. The actual field name isurl.author. This causedpagesWithAuthorto always be0.A second issue: for DB-loaded crawls where
url.authormay still benull(existing data),url.meta_tags.authoris a valid fallback sincemeta_tagsis serialised as JSON and always persisted.Fix: Changed the check to
url.author || (url.meta_tags && url.meta_tags.author) || (url.og_tags && url.og_tags.author)Summary of changes
src/core/seo_extractor.py<link rel="author">as fallbacksrc/crawl_db.pyauthor,keywords,generator,theme_colorcolumns + migrations + INSERTweb/static/plugins/e-e-a-t.jsmeta_author→author, addmeta_tags.authorfallback🤖 Generated with Claude Code