compose-anything/src/firecrawl/README.md

# Firecrawl

[English](./README.md) | [中文](./README.zh.md)

This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.

## Services

- `api`: The main Firecrawl API server with integrated workers
- `redis`: Redis for job queue and caching
- `playwright-service`: Playwright service for browser automation
- `nuq-postgres`: PostgreSQL database for queue management and data storage

## Environment Variables

| Variable Name           | Description                                | Default Value |
| ----------------------- | ------------------------------------------ | ------------- |
| FIRECRAWL_VERSION       | Firecrawl image version                    | `latest`      |
| REDIS_VERSION           | Redis image version                        | `alpine`      |
| PLAYWRIGHT_VERSION      | Playwright service version                 | `latest`      |
| NUQ_POSTGRES_VERSION    | NUQ PostgreSQL image version               | `latest`      |
| POSTGRES_USER           | PostgreSQL username                        | `postgres`    |
| POSTGRES_PASSWORD       | PostgreSQL password                        | `postgres`    |
| POSTGRES_DB             | PostgreSQL database name                   | `postgres`    |
| POSTGRES_PORT_OVERRIDE  | PostgreSQL port mapping                    | `5432`        |
| INTERNAL_PORT           | Internal API port                          | `3002`        |
| FIRECRAWL_PORT_OVERRIDE | External API port mapping                  | `3002`        |
| EXTRACT_WORKER_PORT     | Extract worker port                        | `3004`        |
| WORKER_PORT             | Worker port                                | `3005`        |
| USE_DB_AUTHENTICATION   | Enable database authentication             | `false`       |
| OPENAI_API_KEY          | OpenAI API key for AI features (optional)  | `""`          |
| OPENAI_BASE_URL         | OpenAI API base URL (optional)             | `""`          |
| MODEL_NAME              | AI model name (optional)                   | `""`          |
| MODEL_EMBEDDING_NAME    | Embedding model name (optional)            | `""`          |
| OLLAMA_BASE_URL         | Ollama base URL (optional)                 | `""`          |
| BULL_AUTH_KEY           | Bull queue admin panel authentication key  | `@`           |
| TEST_API_KEY            | Test API key (optional)                    | `""`          |
| SLACK_WEBHOOK_URL       | Slack webhook for notifications (optional) | `""`          |
| POSTHOG_API_KEY         | PostHog API key (optional)                 | `""`          |
| POSTHOG_HOST            | PostHog host (optional)                    | `""`          |
| SUPABASE_ANON_TOKEN     | Supabase anonymous token (optional)        | `""`          |
| SUPABASE_URL            | Supabase URL (optional)                    | `""`          |
| SUPABASE_SERVICE_TOKEN  | Supabase service token (optional)          | `""`          |
| SELF_HOSTED_WEBHOOK_URL | Self-hosted webhook URL (optional)         | `""`          |
| SERPER_API_KEY          | Serper API key for search (optional)       | `""`          |
| SEARCHAPI_API_KEY       | SearchAPI key (optional)                   | `""`          |
| LOGGING_LEVEL           | Logging level                              | `info`        |
| PROXY_SERVER            | Proxy server URL (optional)                | `""`          |
| PROXY_USERNAME          | Proxy username (optional)                  | `""`          |
| PROXY_PASSWORD          | Proxy password (optional)                  | `""`          |
| BLOCK_MEDIA             | Block media content                        | `true`        |
| SEARXNG_ENDPOINT        | SearXNG endpoint (optional)                | `""`          |
| SEARXNG_ENGINES         | SearXNG engines (optional)                 | `""`          |
| SEARXNG_CATEGORIES      | SearXNG categories (optional)              | `""`          |

Please modify the `.env` file as needed for your use case.

## Volumes

- `redis_data`: Redis data storage for job queues and caching
- `postgres_data`: PostgreSQL data storage for queue management and metadata

## Usage

### Start the Services

```bash
docker compose up -d
```

### Access the API

The Firecrawl API will be available at:

```text
http://localhost:3002
```

### Admin Panel

Access the Bull queue admin panel at:

```text
http://localhost:3002/admin/@/queues
```

Replace `@` with your `BULL_AUTH_KEY` value if changed.

### Example API Calls

**Scrape a Single Page:**

```bash
curl -X POST http://localhost:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'
```

**Crawl a Website:**

```bash
curl -X POST http://localhost:3002/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "limit": 100
  }'
```

**Extract Structured Data:**

```bash
curl -X POST http://localhost:3002/v1/extract \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "description": {"type": "string"}
      }
    }
  }'
```

## Features

- **Web Scraping**: Extract clean content from any webpage
- **Web Crawling**: Recursively crawl entire websites
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
- **Markdown Output**: Clean markdown conversion of web content
- **Structured Data Extraction**: Extract data using JSON schemas
- **Queue Management**: Built-in job queue with Bull
- **Rate Limiting**: Configurable rate limiting
- **Proxy Support**: Optional proxy configuration for all requests
- **AI-Powered Features**: Optional OpenAI integration for advanced extraction

## Architecture

This deployment uses the official Firecrawl architecture:

- **API Server**: Handles HTTP requests and manages the job queue
- **Workers**: Built into the main container, processes scraping jobs
- **PostgreSQL**: Stores queue metadata and job information
- **Redis**: Handles job queue and caching
- **Playwright Service**: Provides browser automation capabilities

## Notes

- The service uses the official `ghcr.io/firecrawl/firecrawl` image
- PostgreSQL uses the official `ghcr.io/firecrawl/nuq-postgres` image for queue management (NUQ - Not Quite Bull)
- Redis is used for job queuing without password by default (runs on private network)
- For production use, enable `USE_DB_AUTHENTICATION` and configure Supabase
- The `BULL_AUTH_KEY` should be changed in production deployments
- AI features require an `OPENAI_API_KEY` or `OLLAMA_BASE_URL`
- All workers run within the single API container using the harness mode

## License

Firecrawl is licensed under the AGPL-3.0 License.