Files
compose-anything/src/firecrawl/README.md

6.6 KiB

Firecrawl

English | 中文

This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.

Services

  • api: The main Firecrawl API server with integrated workers
  • redis: Redis for job queue and caching
  • playwright-service: Playwright service for browser automation
  • nuq-postgres: PostgreSQL database for queue management and data storage

Environment Variables

Variable Name Description Default Value
FIRECRAWL_VERSION Firecrawl image version latest
REDIS_VERSION Redis image version alpine
PLAYWRIGHT_VERSION Playwright service version latest
NUQ_POSTGRES_VERSION NUQ PostgreSQL image version latest
POSTGRES_USER PostgreSQL username postgres
POSTGRES_PASSWORD PostgreSQL password postgres
POSTGRES_DB PostgreSQL database name postgres
POSTGRES_PORT_OVERRIDE PostgreSQL port mapping 5432
INTERNAL_PORT Internal API port 3002
FIRECRAWL_PORT_OVERRIDE External API port mapping 3002
EXTRACT_WORKER_PORT Extract worker port 3004
WORKER_PORT Worker port 3005
USE_DB_AUTHENTICATION Enable database authentication false
OPENAI_API_KEY OpenAI API key for AI features (optional) ""
OPENAI_BASE_URL OpenAI API base URL (optional) ""
MODEL_NAME AI model name (optional) ""
MODEL_EMBEDDING_NAME Embedding model name (optional) ""
OLLAMA_BASE_URL Ollama base URL (optional) ""
BULL_AUTH_KEY Bull queue admin panel authentication key @
TEST_API_KEY Test API key (optional) ""
SLACK_WEBHOOK_URL Slack webhook for notifications (optional) ""
POSTHOG_API_KEY PostHog API key (optional) ""
POSTHOG_HOST PostHog host (optional) ""
SUPABASE_ANON_TOKEN Supabase anonymous token (optional) ""
SUPABASE_URL Supabase URL (optional) ""
SUPABASE_SERVICE_TOKEN Supabase service token (optional) ""
SELF_HOSTED_WEBHOOK_URL Self-hosted webhook URL (optional) ""
SERPER_API_KEY Serper API key for search (optional) ""
SEARCHAPI_API_KEY SearchAPI key (optional) ""
LOGGING_LEVEL Logging level info
PROXY_SERVER Proxy server URL (optional) ""
PROXY_USERNAME Proxy username (optional) ""
PROXY_PASSWORD Proxy password (optional) ""
BLOCK_MEDIA Block media content true
SEARXNG_ENDPOINT SearXNG endpoint (optional) ""
SEARXNG_ENGINES SearXNG engines (optional) ""
SEARXNG_CATEGORIES SearXNG categories (optional) ""

Please modify the .env file as needed for your use case.

Volumes

  • redis_data: Redis data storage for job queues and caching
  • postgres_data: PostgreSQL data storage for queue management and metadata

Usage

Start the Services

docker compose up -d

Access the API

The Firecrawl API will be available at:

http://localhost:3002

Admin Panel

Access the Bull queue admin panel at:

http://localhost:3002/admin/@/queues

Replace @ with your BULL_AUTH_KEY value if changed.

Example API Calls

Scrape a Single Page:

curl -X POST http://localhost:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Crawl a Website:

curl -X POST http://localhost:3002/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "limit": 100
  }'

Extract Structured Data:

curl -X POST http://localhost:3002/v1/extract \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "description": {"type": "string"}
      }
    }
  }'

Features

  • Web Scraping: Extract clean content from any webpage
  • Web Crawling: Recursively crawl entire websites
  • JavaScript Rendering: Full support for dynamic JavaScript-rendered pages
  • Markdown Output: Clean markdown conversion of web content
  • Structured Data Extraction: Extract data using JSON schemas
  • Queue Management: Built-in job queue with Bull
  • Rate Limiting: Configurable rate limiting
  • Proxy Support: Optional proxy configuration for all requests
  • AI-Powered Features: Optional OpenAI integration for advanced extraction

Architecture

This deployment uses the official Firecrawl architecture:

  • API Server: Handles HTTP requests and manages the job queue
  • Workers: Built into the main container, processes scraping jobs
  • PostgreSQL: Stores queue metadata and job information
  • Redis: Handles job queue and caching
  • Playwright Service: Provides browser automation capabilities

Notes

  • The service uses the official ghcr.io/firecrawl/firecrawl image
  • PostgreSQL uses the official ghcr.io/firecrawl/nuq-postgres image for queue management (NUQ - Not Quite Bull)
  • Redis is used for job queuing without password by default (runs on private network)
  • For production use, enable USE_DB_AUTHENTICATION and configure Supabase
  • The BULL_AUTH_KEY should be changed in production deployments
  • AI features require an OPENAI_API_KEY or OLLAMA_BASE_URL
  • All workers run within the single API container using the harness mode

License

Firecrawl is licensed under the AGPL-3.0 License.