This package implements a complete Croissant dataset crawler system that:
- Discovers datasets from AI Institute portals
- Parses Croissant metadata automatically
- Integrates with your MCP server as a new tool
- Provides a beautiful web interface for browsing datasets
croissant_crawler.py- Main crawler module with all crawling logicmcp_server_updates.py- Code updates needed for your MCP serverweb_interface_updates.py- Code updates needed for your web interfacecroissant_datasets.html- HTML template for dataset displayIMPLEMENTATION_INSTRUCTIONS.md- Detailed implementation guideCROISSANT_CRAWLER_README.md- This summary file
- Copy files to your server
- Update your MCP server with the provided code
- Update your web interface with the provided code
- Restart services and visit
/croissant_datasets
- AIFARMS Data Portal:
data.aifarms.org - CyVerse Sierra:
sierra.cyverse.org/datasets - AgAID GitHub:
github.com/TrevorBuchanan/AgAIDResearch
- Automatic dataset discovery from multiple portals
- Rich metadata parsing with fields and keywords
- Beautiful web interface for browsing datasets
- MCP server integration with new tool
- Confidence-scoring search compatibility
- Setup: 15-30 minutes
- Testing: 10-15 minutes
- Total: ~30-45 minutes
After implementation, you'll have:
- Automatic dataset discovery from AI Institute portals
- Rich metadata display showing all dataset information
- Beautiful web interface for browsing discovered datasets
- Integration with your existing confidence-scoring search system
Ready to implement? Follow the instructions in IMPLEMENTATION_INSTRUCTIONS.md! 🚀