A tool for processing and analyzing text content from various sources, including books and news articles.
- Text metadata extraction
- EPUB file conversion
- News scraping and summarization
- Front page summaries
- Daily news digests
- S3 integration for file storage
process_files: Upload and process files from the input folderconvert_epub: Convert EPUB files to processable formatscrape_news: Collect news from configured sourcesfront_page_summary: Generate summaries of scraped newsdaily_news_summary: Create consolidated daily news digestsgood_morning: Run a complete news collection and summary workflow
The project includes a script for automated news scraping and summarization:
main.py: Runs the news scraping and front page summary functions on a schedule- Executes immediately when started
- Automatically runs every 12 hours
- Can be run in the background as a service
To start the automated news scraping:
# Run in the foreground
./main.py
# Run in the background
nohup ./main.py > news_scraping.log 2>&1 &/epub: EPUB processing utilities/news: News source configurations/scraping: Web scraping tools/text_vector_db: Vector database operations/book_processor_db: Database operations/ollama_apis: AI text processing
Sometimes embeddings glitch and treat the vector as all 0s or all 1s. This sql script will delete them: Can't delete these easily with pgvector Instead I just retry the embedding until they aren't all 1 or 0
There's a lot of anger about how psycopg connection pools don't handle reconnection logic: https://stackoverflow.com/questions/64603192/psycopg2-pool-crashes-when-the-thread-pool-runs-out
- Clone the repository
- Create and activate a virtual environment
- Install dependencies:
pip install -r requirements.txt - Copy
.env.sampleto.envand configure your environment variables