Files
compose-anything/src/firecrawl/README.md

164 lines
6.6 KiB
Markdown

# Firecrawl
[English](./README.md) | [中文](./README.zh.md)
This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.
## Services
- `api`: The main Firecrawl API server with integrated workers
- `redis`: Redis for job queue and caching
- `playwright-service`: Playwright service for browser automation
- `nuq-postgres`: PostgreSQL database for queue management and data storage
## Environment Variables
| Variable Name | Description | Default Value |
| ----------------------- | ------------------------------------------ | ------------- |
| FIRECRAWL_VERSION | Firecrawl image version | `latest` |
| REDIS_VERSION | Redis image version | `alpine` |
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
| NUQ_POSTGRES_VERSION | NUQ PostgreSQL image version | `latest` |
| POSTGRES_USER | PostgreSQL username | `postgres` |
| POSTGRES_PASSWORD | PostgreSQL password | `postgres` |
| POSTGRES_DB | PostgreSQL database name | `postgres` |
| POSTGRES_PORT_OVERRIDE | PostgreSQL port mapping | `5432` |
| INTERNAL_PORT | Internal API port | `3002` |
| FIRECRAWL_PORT_OVERRIDE | External API port mapping | `3002` |
| EXTRACT_WORKER_PORT | Extract worker port | `3004` |
| WORKER_PORT | Worker port | `3005` |
| USE_DB_AUTHENTICATION | Enable database authentication | `false` |
| OPENAI_API_KEY | OpenAI API key for AI features (optional) | `""` |
| OPENAI_BASE_URL | OpenAI API base URL (optional) | `""` |
| MODEL_NAME | AI model name (optional) | `""` |
| MODEL_EMBEDDING_NAME | Embedding model name (optional) | `""` |
| OLLAMA_BASE_URL | Ollama base URL (optional) | `""` |
| BULL_AUTH_KEY | Bull queue admin panel authentication key | `@` |
| TEST_API_KEY | Test API key (optional) | `""` |
| SLACK_WEBHOOK_URL | Slack webhook for notifications (optional) | `""` |
| POSTHOG_API_KEY | PostHog API key (optional) | `""` |
| POSTHOG_HOST | PostHog host (optional) | `""` |
| SUPABASE_ANON_TOKEN | Supabase anonymous token (optional) | `""` |
| SUPABASE_URL | Supabase URL (optional) | `""` |
| SUPABASE_SERVICE_TOKEN | Supabase service token (optional) | `""` |
| SELF_HOSTED_WEBHOOK_URL | Self-hosted webhook URL (optional) | `""` |
| SERPER_API_KEY | Serper API key for search (optional) | `""` |
| SEARCHAPI_API_KEY | SearchAPI key (optional) | `""` |
| LOGGING_LEVEL | Logging level | `info` |
| PROXY_SERVER | Proxy server URL (optional) | `""` |
| PROXY_USERNAME | Proxy username (optional) | `""` |
| PROXY_PASSWORD | Proxy password (optional) | `""` |
| BLOCK_MEDIA | Block media content | `true` |
| SEARXNG_ENDPOINT | SearXNG endpoint (optional) | `""` |
| SEARXNG_ENGINES | SearXNG engines (optional) | `""` |
| SEARXNG_CATEGORIES | SearXNG categories (optional) | `""` |
Please modify the `.env` file as needed for your use case.
## Volumes
- `redis_data`: Redis data storage for job queues and caching
- `postgres_data`: PostgreSQL data storage for queue management and metadata
## Usage
### Start the Services
```bash
docker compose up -d
```
### Access the API
The Firecrawl API will be available at:
```text
http://localhost:3002
```
### Admin Panel
Access the Bull queue admin panel at:
```text
http://localhost:3002/admin/@/queues
```
Replace `@` with your `BULL_AUTH_KEY` value if changed.
### Example API Calls
**Scrape a Single Page:**
```bash
curl -X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
```
**Crawl a Website:**
```bash
curl -X POST http://localhost:3002/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100
}'
```
**Extract Structured Data:**
```bash
curl -X POST http://localhost:3002/v1/extract \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"}
}
}
}'
```
## Features
- **Web Scraping**: Extract clean content from any webpage
- **Web Crawling**: Recursively crawl entire websites
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
- **Markdown Output**: Clean markdown conversion of web content
- **Structured Data Extraction**: Extract data using JSON schemas
- **Queue Management**: Built-in job queue with Bull
- **Rate Limiting**: Configurable rate limiting
- **Proxy Support**: Optional proxy configuration for all requests
- **AI-Powered Features**: Optional OpenAI integration for advanced extraction
## Architecture
This deployment uses the official Firecrawl architecture:
- **API Server**: Handles HTTP requests and manages the job queue
- **Workers**: Built into the main container, processes scraping jobs
- **PostgreSQL**: Stores queue metadata and job information
- **Redis**: Handles job queue and caching
- **Playwright Service**: Provides browser automation capabilities
## Notes
- The service uses the official `ghcr.io/firecrawl/firecrawl` image
- PostgreSQL uses the official `ghcr.io/firecrawl/nuq-postgres` image for queue management (NUQ - Not Quite Bull)
- Redis is used for job queuing without password by default (runs on private network)
- For production use, enable `USE_DB_AUTHENTICATION` and configure Supabase
- The `BULL_AUTH_KEY` should be changed in production deployments
- AI features require an `OPENAI_API_KEY` or `OLLAMA_BASE_URL`
- All workers run within the single API container using the harness mode
## License
Firecrawl is licensed under the AGPL-3.0 License.