6.6 KiB
6.6 KiB
Firecrawl
This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.
Services
api: The main Firecrawl API server with integrated workersredis: Redis for job queue and cachingplaywright-service: Playwright service for browser automationnuq-postgres: PostgreSQL database for queue management and data storage
Environment Variables
| Variable Name | Description | Default Value |
|---|---|---|
| FIRECRAWL_VERSION | Firecrawl image version | latest |
| REDIS_VERSION | Redis image version | alpine |
| PLAYWRIGHT_VERSION | Playwright service version | latest |
| NUQ_POSTGRES_VERSION | NUQ PostgreSQL image version | latest |
| POSTGRES_USER | PostgreSQL username | postgres |
| POSTGRES_PASSWORD | PostgreSQL password | postgres |
| POSTGRES_DB | PostgreSQL database name | postgres |
| POSTGRES_PORT_OVERRIDE | PostgreSQL port mapping | 5432 |
| INTERNAL_PORT | Internal API port | 3002 |
| FIRECRAWL_PORT_OVERRIDE | External API port mapping | 3002 |
| EXTRACT_WORKER_PORT | Extract worker port | 3004 |
| WORKER_PORT | Worker port | 3005 |
| USE_DB_AUTHENTICATION | Enable database authentication | false |
| OPENAI_API_KEY | OpenAI API key for AI features (optional) | "" |
| OPENAI_BASE_URL | OpenAI API base URL (optional) | "" |
| MODEL_NAME | AI model name (optional) | "" |
| MODEL_EMBEDDING_NAME | Embedding model name (optional) | "" |
| OLLAMA_BASE_URL | Ollama base URL (optional) | "" |
| BULL_AUTH_KEY | Bull queue admin panel authentication key | @ |
| TEST_API_KEY | Test API key (optional) | "" |
| SLACK_WEBHOOK_URL | Slack webhook for notifications (optional) | "" |
| POSTHOG_API_KEY | PostHog API key (optional) | "" |
| POSTHOG_HOST | PostHog host (optional) | "" |
| SUPABASE_ANON_TOKEN | Supabase anonymous token (optional) | "" |
| SUPABASE_URL | Supabase URL (optional) | "" |
| SUPABASE_SERVICE_TOKEN | Supabase service token (optional) | "" |
| SELF_HOSTED_WEBHOOK_URL | Self-hosted webhook URL (optional) | "" |
| SERPER_API_KEY | Serper API key for search (optional) | "" |
| SEARCHAPI_API_KEY | SearchAPI key (optional) | "" |
| LOGGING_LEVEL | Logging level | info |
| PROXY_SERVER | Proxy server URL (optional) | "" |
| PROXY_USERNAME | Proxy username (optional) | "" |
| PROXY_PASSWORD | Proxy password (optional) | "" |
| BLOCK_MEDIA | Block media content | true |
| SEARXNG_ENDPOINT | SearXNG endpoint (optional) | "" |
| SEARXNG_ENGINES | SearXNG engines (optional) | "" |
| SEARXNG_CATEGORIES | SearXNG categories (optional) | "" |
Please modify the .env file as needed for your use case.
Volumes
redis_data: Redis data storage for job queues and cachingpostgres_data: PostgreSQL data storage for queue management and metadata
Usage
Start the Services
docker compose up -d
Access the API
The Firecrawl API will be available at:
http://localhost:3002
Admin Panel
Access the Bull queue admin panel at:
http://localhost:3002/admin/@/queues
Replace @ with your BULL_AUTH_KEY value if changed.
Example API Calls
Scrape a Single Page:
curl -X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
Crawl a Website:
curl -X POST http://localhost:3002/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"limit": 100
}'
Extract Structured Data:
curl -X POST http://localhost:3002/v1/extract \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"}
}
}
}'
Features
- Web Scraping: Extract clean content from any webpage
- Web Crawling: Recursively crawl entire websites
- JavaScript Rendering: Full support for dynamic JavaScript-rendered pages
- Markdown Output: Clean markdown conversion of web content
- Structured Data Extraction: Extract data using JSON schemas
- Queue Management: Built-in job queue with Bull
- Rate Limiting: Configurable rate limiting
- Proxy Support: Optional proxy configuration for all requests
- AI-Powered Features: Optional OpenAI integration for advanced extraction
Architecture
This deployment uses the official Firecrawl architecture:
- API Server: Handles HTTP requests and manages the job queue
- Workers: Built into the main container, processes scraping jobs
- PostgreSQL: Stores queue metadata and job information
- Redis: Handles job queue and caching
- Playwright Service: Provides browser automation capabilities
Notes
- The service uses the official
ghcr.io/firecrawl/firecrawlimage - PostgreSQL uses the official
ghcr.io/firecrawl/nuq-postgresimage for queue management (NUQ - Not Quite Bull) - Redis is used for job queuing without password by default (runs on private network)
- For production use, enable
USE_DB_AUTHENTICATIONand configure Supabase - The
BULL_AUTH_KEYshould be changed in production deployments - AI features require an
OPENAI_API_KEYorOLLAMA_BASE_URL - All workers run within the single API container using the harness mode
License
Firecrawl is licensed under the AGPL-3.0 License.