Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(crawler): add support for Crawl4AI as alternative to Firecrawl #100

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

brettdavies
Copy link

This commit introduces a dynamic crawler selection system, allowing the deep research agent to use either Firecrawl or Crawl4AI as its web crawling backend. The changes maintain the existing functionality while adding flexibility in crawler choice.

Key Changes:

  • Remove static Firecrawl import and instance creation
  • Implement dynamic crawler selection based on CRAWLER environment variable
  • Add HTTP API integration for Crawl4AI with proper authentication and polling
  • Update environment variable handling for both crawlers
  • Standardize response format between both crawlers

Technical Details:

  • Add new environment variables:
    • CRAWLER: Toggle between "FIRECRAWL" and "CRAWL4AI"
    • CRAWL4AI_API_TOKEN: Authentication token for Crawl4AI
    • CRAWL4AI_BASE_URL: Optional custom endpoint (default: localhost:11235)
  • Implement polling mechanism for Crawl4AI's asynchronous API
  • Transform Crawl4AI responses to match Firecrawl's data structure
  • Add proper error handling and timeout management for HTTP requests
  • Update .env.example with comprehensive documentation

The implementation ensures that the deep research functionality remains unchanged while providing users the flexibility to choose their preferred crawler backend. Error handling and timeout mechanisms have been carefully considered to maintain robustness regardless of the chosen crawler.

This commit introduces a dynamic crawler selection system, allowing the deep research agent to use either Firecrawl or Crawl4AI as its web crawling backend. The changes maintain the existing functionality while adding flexibility in crawler choice.

Key Changes:
- Remove static Firecrawl import and instance creation
- Implement dynamic crawler selection based on CRAWLER environment variable
- Add HTTP API integration for Crawl4AI with proper authentication and polling
- Update environment variable handling for both crawlers
- Standardize response format between both crawlers

Technical Details:
- Add new environment variables:
  * CRAWLER: Toggle between "FIRECRAWL" and "CRAWL4AI"
  * CRAWL4AI_API_TOKEN: Authentication token for Crawl4AI
  * CRAWL4AI_BASE_URL: Optional custom endpoint (default: localhost:11235)
- Implement polling mechanism for Crawl4AI's asynchronous API
- Transform Crawl4AI responses to match Firecrawl's data structure
- Add proper error handling and timeout management for HTTP requests
- Update .env.example with comprehensive documentation

The implementation ensures that the deep research functionality remains unchanged while providing users the flexibility to choose their preferred crawler backend.
Error handling and timeout mechanisms have been carefully considered to maintain robustness regardless of the chosen crawler.
@brettdavies
Copy link
Author

@dzhng ,

Adding the dynamic crawler selection logic introduces additional complexity to deep-research.ts. Depending on how you want to maintain the repo, this may run afoul of SOLID principles by adding a second responsibility. If you agree, I can extract this functionality into a dedicated service/helper module (e.g., src/services/crawler.ts or src/helpers/crawler-factory.ts).

This refactor would:

  • Encapsulate crawler-specific logic and configuration
  • Provide a clean factory pattern for crawler instantiation
  • Make the main research logic more focused and testable
  • Simplify future additions of other crawler implementations

Let me know if you'd like me to make this change, and I'll resubmit the changes to this PR.

This commit enhances the README's markdown formatting to align with standard practices and improve readability across different markdown renderers.

Detailed Changes:
- Standardize code block indentation throughout the document
  * Ensure all code blocks are properly nested under their sections
  * Add consistent 3-space indentation for code blocks within lists
- Fix code block formatting
  * Add proper line breaks before and after code blocks
  * Ensure all bash commands are properly tagged with ```bash
- Improve documentation clarity
  * Add backticks around URLs in configuration examples
  * Fix inconsistent spacing in list items

The changes maintain the same content while ensuring:
- Consistent presentation across different markdown viewers
- Proper nesting of code blocks within numbered lists
- Clear visual hierarchy in the documentation
- Better readability of configuration examples

These formatting improvements help maintain a professional documentation standard while making the README more accessible to new contributors.
@brettdavies
Copy link
Author

Added a second commit to the PR. This commit enhances the README's markdown formatting to align with standard practices and improve readability across different markdown renderers.

Detailed Changes:

  • Standardize code block indentation throughout the document
    • Ensure all code blocks are properly nested under their sections
    • Add consistent 3-space indentation for code blocks within lists
  • Fix code block formatting
    • Add proper line breaks before and after code blocks
    • Ensure all bash commands are properly tagged with ```bash
  • Improve documentation clarity
    • Add backticks around URLs in configuration examples
    • Fix inconsistent spacing in list items

The changes maintain the same content while ensuring:

  • Consistent presentation across different markdown viewers
  • Proper nesting of code blocks within numbered lists
  • Clear visual hierarchy in the documentation
  • Better readability of configuration examples

These formatting improvements help maintain a professional documentation standard while making the README more accessible to new contributors.

@didlawowo
Copy link

didlawowo commented Feb 19, 2025

what advantage did you get to have crawl4ai instead of firewall ?

@lucasgreenwell
Copy link

@didlawowo Price. No need to pay for firecrawl as a separate service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants