A modern, feature-rich CLI tool for managing a curated collection of research papers with beautiful terminal output, interactive visualizations, and comprehensive testing.
MyBible is a comprehensive bibliography management system designed to help researchers organize, track, and analyze their research paper collections. This repository contains a curated list of important research papers (primarily from AI research) with tools to:
- ✨ Add papers from arXiv or manually with beautiful CLI prompts
- 🔗 Add repositories from GitHub and Hugging Face
- 📊 Generate markdown tables for easy navigation and sharing
- 📖 Export to BibTeX for use in LaTeX documents
- 🕸️ Visualize citation networks with interactive HTML graphs
- ✅ Track duplicates with intelligent detection
- 🧪 Comprehensive test coverage for reliability
The papers are organized into categories based on their topics, with each entry including title, authors, journal, publication year, and DOI links for easy access.
# Clone and set up the environment
git clone git@github.com:art-test-stack/MyBible.git
cd MyBible
uv syncmybib add-arxiv <arxiv_url> --category <category_name>Example:
mybib add-arxiv https://arxiv.org/abs/2401.00001 --category "Machine Learning"mybib add --title "Attention is all you need" --category "Machine Learning"mybib add-repo <repo_url> --category <category_name>Examples:
mybib add-repo https://github.com/openai/whisper --category "Speech"
mybib add-repo https://huggingface.co/google/gemma-2-2b --category "LLM"mybib add --title "<title>" --authors "<author1>, <author2>, ..." \
--journal "<journal>" --year <year> --doi "<doi>" --category <category>mybib markdown --file references.csv --output references.mdmybib bibtex --file references.csv --output references.bibmybib sync-bibtex --file references.csvThis stores BibTeX entries in files under bibtex_entries/ (configurable), and keeps only paths in references.csv.
mybib graph --file references.csv --output citation_graph.htmlAll commands feature:
- Colored output: Success (✓), error (✗), warning (⚠), and info (ℹ) messages
- Progress indicators: Smooth animations when fetching from APIs
- Confirmation prompts: Safe defaults for destructive actions
- Beautiful tables: Rich formatting for better readability
Example of adding a paper:
Title: Attention is all you need
Authors: Vaswani et al.
Journal: NeurIPS
Year: 2017
DOI: 10.1038/nature12373
Add 'Attention is all you need' to category 'Machine Learning'? [y/N]: y
✓ Reference added successfully to category 'Machine Learning'
Automatically generate organized markdown tables from your bibliography:
- Tables organized by category
- Columns: Title, Authors, Journal, Year, DOI
- Clickable DOI links
- Proper author name formatting (et al. for 3+ authors)
- Sorted by category and publication year
Generate standard BibTeX files for LaTeX documents with properly formatted entries including:
- Author names
- Publication title
- Journal/Venue
- Publication year
- DOI formatting
When available, MyBible now fetches and stores source BibTeX entries in files:
add-arxivfetches citation BibTeX from arXivadd-repofetches citation BibTeX from repository README files (GitHub/Hugging Face)sync-bibtexbackfills missing BibTeX for references already stored in CSV
Build interactive citation graphs showing how papers in your library cite each other:
- Network building: Queries Crossref API for citation relationships
- Interactive visualization: Zoom, pan, and drag nodes
- Physics simulation: Automatic layout using Barnes-Hut algorithm
- Metadata on hover: View paper details without clicking
Features:
- Directed graph representation (A → B means A cites B)
- Only includes edges between papers in your library
- Color-coded visualization
- Handles network errors gracefully with retry logic
Usage:
# Generate citation graph with verbose output
mybib graph --file references.csv --output my_citations.html --verboseBuilt-in duplicate detection when adding new papers:
- DOI-based matching
- Case-insensitive comparison
- Whitespace normalization
- Prevents accidental duplicates in your bibliography
Authors Formatting
- Proper "FirstAuthor et al." format instead of just "al."
- Team name detection and display (K2 Team, DeepSeek-Ai, Mistral, etc.)
- Intelligently handles both individual and organizational authors
ArxivID Precision
- Fixed float rounding errors (2405.10938 now displays correctly, not 2405.11)
- ArxivID stored as string to preserve full precision
Scholar Metadata Extraction
- Improved year extraction for Google Scholar articles (full 4-digit years)
- Better DOI extraction with intelligent fallback to Scholar IDs
- Enhanced regex patterns for robust metadata parsing
- ID-based categories: Each category assigned a unique ID with persistent mappings
- Case-insensitive normalization: "LLM Basics" and "llm basics" treated as same category
- Interactive selection: Choose categories by ID or create new ones on-the-fly
- Category persistence: All mappings stored in
categories.json
# Interactive category selection during add
mybib add-arxiv https://arxiv.org/abs/2301.00001
# Shows: Available categories: 1: Alignment, 2: Deep Learning, 3: Machine LearningScalable SQL database support for advanced features:
New Commands:
# Initialize database
mybib db-init --db-url sqlite:///bibliography.db
# Migrate existing CSV to database
mybib db-migrate --file references.csv --db-url sqlite:///bibliography.db
# Export database back to CSV
mybib db-export --output backup.csv --db-url sqlite:///bibliography.dbFeatures:
- SQLite default, supports any SQLAlchemy-compatible database (PostgreSQL, MySQL, etc.)
- Full referential integrity with foreign keys
- Indexed queries for common search patterns
- Non-destructive migration (export back to CSV anytime)
- Duplicate detection based on DOI
Benefits:
- Foundation for advanced search and filtering
- Ready for future enhancements (tags, annotations, full-text search)
- Better performance with large reference collections
- API layer ready for remote access
MyBible/
├── pkg/mybib/ # Main package
│ ├── __init__.py
│ ├── cli.py # CLI command handlers
│ ├── storage.py # CSV storage operations
│ ├── arxiv.py # arXiv API integration
│ ├── scholar.py # Google Scholar integration
│ ├── citation.py # Citation BibTeX fetching and parsing
│ ├── metadata.py # Metadata management
│ ├── markdown.py # Markdown generation
│ ├── bibtex.py # BibTeX export
│ ├── graph.py # Citation graph features
│ ├── ui.py # Terminal UI utilities
│ ├── utils.py # Utility functions
│ ├── categories.py # Category management system
│ ├── models.py # SQLAlchemy ORM models
│ └── db_storage.py # Database storage adapter
├── tests/ # Test suite
│ ├── test_storage.py
│ ├── test_arxiv.py
│ ├── test_markdown.py
│ ├── test_metadata.py
│ ├── test_scholar.py
│ └── __init__.py
├── references.csv # Bibliography database (CSV)
├── categories.json # Category ID mappings
├── pyproject.toml # Project configuration
├── pytest.ini # Pytest configuration
├── IMPROVEMENTS_SUMMARY.md # Detailed changelog for v2.0
└── README.md # This file
cli.py: Command-line interface with rich formatting and category promptsstorage.py: CSV file handling with ArxivID support and duplicate detectionarxiv.py: arXiv metadata fetching with error handlingscholar.py: Google Scholar integration with improved metadata extractioncitation.py: BibTeX extraction from arXiv and repository README filesmetadata.py: Reference metadata managementmarkdown.py: Markdown table generation with category support and author formattingbibtex.py: BibTeX export functionalitygraph.py: Citation network building and visualizationui.py: Terminal UI components (colors, progress, confirmations)categories.py: Category management with ID-based persistencemodels.py: SQLAlchemy ORM models for database supportdb_storage.py: Database storage adapter with migration capabilitiesutils.py: Utility functions including enhanced author name formatting
Core dependencies (installed via uv sync):
pandas: CSV data handlingrequests: HTTP requests for APIsrich: Beautiful terminal outputnetworkx: Graph algorithms and data structurespyvis: Interactive network visualizationsqlalchemy: ORM framework for database abstraction
Development dependencies:
pytest: Testing frameworkpytest-cov: Code coverage reporting
[!Note]
See tests/README.md for details on the comprehensive test suite covering modules.
# View help
mybib --help
# Add reference from arXiv
mybib add-arxiv https://arxiv.org/abs/2301.00001 [--category <name>]
# Add reference from GitHub or Hugging Face repository
mybib add-repo <repo_url> [--category <name>]
# Add reference from Google Scholar (with interactive search)
mybib add-scholar --title "<article name>" [--category <name>]
# Add reference manually
mybib add --title "<title>" [--authors] [--journal] [--year] [--doi] [--category]
# Fetch and store missing BibTeX for existing references
mybib sync-bibtex [--file references.csv] [--force] [--bibtex-dir bibtex_entries]
# View help for specific commands
mybib add-arxiv --help
mybib add-repo --help
mybib add-scholar --help
mybib add --help
mybib sync-bibtex --help# Generate markdown tables
mybib markdown --file references.csv --output references.md [--by-category]
# Generate BibTeX file
mybib bibtex --file references.csv --output references.bib
# Build citation network graph
mybib graph --file references.csv --output citation_graph.html [--verbose]# Initialize database
mybib db-init [--db-url sqlite:///bibliography.db]
# Migrate CSV to database
mybib db-migrate --file references.csv [--db-url sqlite:///bibliography.db]
# Export database back to CSV
mybib db-export --output backup.csv [--db-url sqlite:///bibliography.db]References are stored in references.csv with the following columns:
- Title: Paper title
- Authors: Author names (comma-separated)
- Journal: Publication venue
- Year: Publication year
- DOI: Digital Object Identifier
- Category: Research topic category
- Link: URL (optional)
- ArxivID: arXiv identifier (optional)
- BibTeX: Legacy inline BibTeX field (kept for backward compatibility)
- BibTeXPath: Path to BibTeX file stored on disk
Categories are managed in categories.json with ID-to-name mappings for case-insensitive organization.
Major improvements to data quality and scalability:
✨ Improvements:
- Auto format authors as "FirstAuthor et al." with team name detection
- Fixed ArxivID display precision (no more float rounding errors)
- Enhanced Scholar metadata extraction (full year extraction, better DOI finding)
- New category management system with persistent ID mappings
- Foundation for database support with SQLAlchemy ORM
New Features:
- Database initialization and migration commands
- CSV ↔ Database conversion tools
- Interactive category selection by ID during reference addition
See IMPROVEMENTS_SUMMARY.md for detailed technical documentation.
Initial release with CSV-based storage, arXiv/Scholar/manual entry, markdown/BibTeX export, and citation graph visualization.
Potential features enabled by v2.0 database foundation:
- Advanced search and filtering
- Paper summaries and reading notes
- Reading progress tracking
- Topic clustering visualization
- Export to other formats (RIS, Zotero)
- Full-text search capabilities
- Tag and annotation system
- API layer for remote access
Contributions are welcome! Feel free to:
- Improve the CLI interface
- Enhance visualization features
- Expand test coverage
- Report bugs or suggest improvements
- Inspired by my need for better bibliography management tools. After struggling with manual CSV files and clunky reference managers, I wanted a modern, customizable solution that fits my workflow. MyBible is the result of that vision. Alternatively, there are paperlib which seems to be a better tool for general use cases.
- I have started this project with "traditional" coding practices, but at some point (exactly from commit d8f992f) I have switched to "vibe coding" practices with Claude Haiku 4.5. Hence, I have not written most of the features.
- The project is still in early stages, so there are many rough edges and missing features. Hence, it is mainly for my personal use, so it works well for computer science research. I am open to contributions and suggestions to make it better!
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| Attention is all you need | Vaswani et al. | arXiv | 2017 | 1706.03762 |
| Shampoo: Preconditioned Stochastic Tensor Optimization | Gupta et al. | arXiv | 2018 | 1802.09568 |
| BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Devlin et al. | arXiv | 2018 | 1810.04805 |
| Language models are unsupervised multitask learners | Radford et al. | OpenAI | 2019 | unsupervised-multitask |
| Language Models are Few-Shot Learners | Brown et al. | arXiv | 2020 | 2005.14165 |
| Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | Katharopoulos et al. | arXiv | 2020 | 2006.16236 |
| Efficient Transformers: A Survey | Tay et al. | arXiv | 2020 | 2009.06732 |
| Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | Fedus et al. | ICML | 2022 | 2101.03961 |
| RoFormer: Enhanced Transformer with Rotary Position Embedding | Su et al. | arXiv | 2021 | 2104.09864 |
| LoRA: Low-Rank Adaptation of Large Language Models | Hu et al. | ICLR | 2021 | 2106.09685 |
| Training Compute-Optimal Large Language Models | Hoffmann et al. | arXiv | 2022 | 2203.15556 |
| PaLM: Scaling Language Modeling with Pathways | Chowdhery et al. | arXiv | 2022 | 2204.02311 |
| FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Dao et al. | NeurIPS | 2022 | 2205.14135 |
| QLoRA: Efficient Finetuning of Quantized LLMs | Dettmers et al. | arXiv | 2023 | 2305.14314 |
| FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | Dao | arXiv | 2023 | 2307.08691 |
| YaRN: Efficient Context Window Extension of Large Language Models | Peng et al. | arXiv | 2023 | 2309.00071 |
| Effective Long-Context Scaling of Foundation Models | Xiong et al. | arXiv | 2023 | 2309.16039 |
| Mistral 7B | Jiang et al. | arXiv | 2023 | 2310.06825 |
| Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Gu and Dao | NeurIPS | 2023 | 2312.00752 |
| How to Train Long-Context Language Models (Effectively) | Gao et al. | arXiv | 2024 | 2410.02660 |
| The Zamba2 Suite: Technical Report | Glorion et al. | arXiv | 2024 | 2411.15242 |
| Muon is Scalable for LLM Training | Liu et al. | 2025 | arXiv | 2502.16982 |
| KIMI K2: OPEN AGENTIC INTELLIGENCE | Kimi Team | arXiv | 2025 | 2507.20534 |
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| SQUAD: 100,000+ Questions for Machine Comprehension of Text | Rajpurkar et al. | arXiv | 2016 | 1606.05250 |
| StarCoder 2 and The Stack v2: The Next Generation | Lozhkov et al. | arXiv | 2024 | 2402.19173 |
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting | Lim et al. | arXiv | 2020 | 1912.09363 |
| N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting | Challu et al. | arXiv | 2022 | 2201.12886 |
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| Denoising Diffusion Probabilistic Models | Ho et al. | NeurIPS | 2020 | 2006.11239 |
| Generative Diffusion Models on Graphs: Methods and Applications | Liu et al. | arXiv | 2023 | 2302.02591 |
| dLLM: Simple Diffusion Language Modeling | Zhou et al. | arXiv | 2026 | 2602.22661 |
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| A tutorial on Energy-Based Learning | LeCun et al. | MIT Press | 2006 | eb-learning |
| Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One | Grathwohl et al. | arXiv | 2019 | 1912.03263 |
| How to Train Your Energy-Based Models | Song et al. | arXiv | 2021 | 2101.03288 |
| HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly | Yen et al. | arXiv | 2024 | 2410.02694 |
| Energy-Based Transformers are Scalable Learners and Thinkers | Gladstone et al. | arXiv | 2025 | 2507.02092 |
| Title | Author(s) | Journal | Year | DOI |
|---|---|---|---|---|
| The Tsetlin Machine -- A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic | Ole-Christoffer Granmo | arXiv | 2018 | 1804.01508 |
| Label-Critic Tsetlin Machine: A Novel Self-supervised Learning Scheme for Interpretable Clustering | Abouzeid et al. | IEEE | 2022 | 10.1109/ISTM54910.2022.00016 |