feat: add portkey-gateway/libreoffice/jodconverter/bolt-diy
This commit is contained in:
@@ -6,39 +6,66 @@ This service deploys Firecrawl, a web scraping and crawling API powered by Playw
|
||||
|
||||
## Services
|
||||
|
||||
- `firecrawl`: The main Firecrawl API server.
|
||||
- `redis`: Redis for job queue and caching.
|
||||
- `playwright`: Playwright service for browser automation.
|
||||
- `api`: The main Firecrawl API server with integrated workers
|
||||
- `redis`: Redis for job queue and caching
|
||||
- `playwright-service`: Playwright service for browser automation
|
||||
- `nuq-postgres`: PostgreSQL database for queue management and data storage
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| ------------------------------------- | ----------------------------------- | -------------- |
|
||||
| FIRECRAWL_VERSION | Firecrawl image version | `v1.16.0` |
|
||||
| REDIS_VERSION | Redis image version | `7.4.2-alpine` |
|
||||
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
|
||||
| REDIS_PASSWORD | Redis password | `firecrawl` |
|
||||
| NUM_WORKERS_PER_QUEUE | Number of workers per queue | `8` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | Token bucket size for rate limiting | `20` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | Token refill rate per second | `1` |
|
||||
| PROXY_SERVER | Proxy server URL (optional) | `""` |
|
||||
| PROXY_USERNAME | Proxy username (optional) | `""` |
|
||||
| PROXY_PASSWORD | Proxy password (optional) | `""` |
|
||||
| BLOCK_MEDIA | Block media content | `true` |
|
||||
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API port | `3002` |
|
||||
| Variable Name | Description | Default Value |
|
||||
| ----------------------- | ------------------------------------------ | ------------- |
|
||||
| FIRECRAWL_VERSION | Firecrawl image version | `latest` |
|
||||
| REDIS_VERSION | Redis image version | `alpine` |
|
||||
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
|
||||
| NUQ_POSTGRES_VERSION | NUQ PostgreSQL image version | `latest` |
|
||||
| POSTGRES_USER | PostgreSQL username | `postgres` |
|
||||
| POSTGRES_PASSWORD | PostgreSQL password | `postgres` |
|
||||
| POSTGRES_DB | PostgreSQL database name | `postgres` |
|
||||
| POSTGRES_PORT_OVERRIDE | PostgreSQL port mapping | `5432` |
|
||||
| INTERNAL_PORT | Internal API port | `3002` |
|
||||
| FIRECRAWL_PORT_OVERRIDE | External API port mapping | `3002` |
|
||||
| EXTRACT_WORKER_PORT | Extract worker port | `3004` |
|
||||
| WORKER_PORT | Worker port | `3005` |
|
||||
| USE_DB_AUTHENTICATION | Enable database authentication | `false` |
|
||||
| OPENAI_API_KEY | OpenAI API key for AI features (optional) | `""` |
|
||||
| OPENAI_BASE_URL | OpenAI API base URL (optional) | `""` |
|
||||
| MODEL_NAME | AI model name (optional) | `""` |
|
||||
| MODEL_EMBEDDING_NAME | Embedding model name (optional) | `""` |
|
||||
| OLLAMA_BASE_URL | Ollama base URL (optional) | `""` |
|
||||
| BULL_AUTH_KEY | Bull queue admin panel authentication key | `@` |
|
||||
| TEST_API_KEY | Test API key (optional) | `""` |
|
||||
| SLACK_WEBHOOK_URL | Slack webhook for notifications (optional) | `""` |
|
||||
| POSTHOG_API_KEY | PostHog API key (optional) | `""` |
|
||||
| POSTHOG_HOST | PostHog host (optional) | `""` |
|
||||
| SUPABASE_ANON_TOKEN | Supabase anonymous token (optional) | `""` |
|
||||
| SUPABASE_URL | Supabase URL (optional) | `""` |
|
||||
| SUPABASE_SERVICE_TOKEN | Supabase service token (optional) | `""` |
|
||||
| SELF_HOSTED_WEBHOOK_URL | Self-hosted webhook URL (optional) | `""` |
|
||||
| SERPER_API_KEY | Serper API key for search (optional) | `""` |
|
||||
| SEARCHAPI_API_KEY | SearchAPI key (optional) | `""` |
|
||||
| LOGGING_LEVEL | Logging level | `info` |
|
||||
| PROXY_SERVER | Proxy server URL (optional) | `""` |
|
||||
| PROXY_USERNAME | Proxy username (optional) | `""` |
|
||||
| PROXY_PASSWORD | Proxy password (optional) | `""` |
|
||||
| BLOCK_MEDIA | Block media content | `true` |
|
||||
| SEARXNG_ENDPOINT | SearXNG endpoint (optional) | `""` |
|
||||
| SEARXNG_ENGINES | SearXNG engines (optional) | `""` |
|
||||
| SEARXNG_CATEGORIES | SearXNG categories (optional) | `""` |
|
||||
|
||||
Please modify the `.env` file as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `redis_data`: Redis data storage for job queues and caching.
|
||||
- `redis_data`: Redis data storage for job queues and caching
|
||||
- `postgres_data`: PostgreSQL data storage for queue management and metadata
|
||||
|
||||
## Usage
|
||||
|
||||
### Start the Services
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Access the API
|
||||
@@ -49,12 +76,22 @@ The Firecrawl API will be available at:
|
||||
http://localhost:3002
|
||||
```
|
||||
|
||||
### Admin Panel
|
||||
|
||||
Access the Bull queue admin panel at:
|
||||
|
||||
```text
|
||||
http://localhost:3002/admin/@/queues
|
||||
```
|
||||
|
||||
Replace `@` with your `BULL_AUTH_KEY` value if changed.
|
||||
|
||||
### Example API Calls
|
||||
|
||||
**Scrape a Single Page:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/scrape \
|
||||
curl -X POST http://localhost:3002/v1/scrape \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com"
|
||||
@@ -64,12 +101,27 @@ curl -X POST http://localhost:3002/v0/scrape \
|
||||
**Crawl a Website:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/crawl \
|
||||
curl -X POST http://localhost:3002/v1/crawl \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"crawlerOptions": {
|
||||
"limit": 100
|
||||
"limit": 100
|
||||
}'
|
||||
```
|
||||
|
||||
**Extract Structured Data:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v1/extract \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"urls": ["https://example.com"],
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"description": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
@@ -80,16 +132,31 @@ curl -X POST http://localhost:3002/v0/crawl \
|
||||
- **Web Crawling**: Recursively crawl entire websites
|
||||
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
|
||||
- **Markdown Output**: Clean markdown conversion of web content
|
||||
- **Rate Limiting**: Built-in rate limiting to prevent abuse
|
||||
- **Structured Data Extraction**: Extract data using JSON schemas
|
||||
- **Queue Management**: Built-in job queue with Bull
|
||||
- **Rate Limiting**: Configurable rate limiting
|
||||
- **Proxy Support**: Optional proxy configuration for all requests
|
||||
- **AI-Powered Features**: Optional OpenAI integration for advanced extraction
|
||||
|
||||
## Architecture
|
||||
|
||||
This deployment uses the official Firecrawl architecture:
|
||||
|
||||
- **API Server**: Handles HTTP requests and manages the job queue
|
||||
- **Workers**: Built into the main container, processes scraping jobs
|
||||
- **PostgreSQL**: Stores queue metadata and job information
|
||||
- **Redis**: Handles job queue and caching
|
||||
- **Playwright Service**: Provides browser automation capabilities
|
||||
|
||||
## Notes
|
||||
|
||||
- The service uses Playwright for browser automation, supporting complex web pages
|
||||
- Redis is used for job queuing and caching
|
||||
- Rate limiting is configurable via environment variables
|
||||
- For production use, consider scaling the number of workers
|
||||
- BLOCK_MEDIA can reduce memory usage by blocking images/videos
|
||||
- The service uses the official `ghcr.io/firecrawl/firecrawl` image
|
||||
- PostgreSQL uses the official `ghcr.io/firecrawl/nuq-postgres` image for queue management (NUQ - Not Quite Bull)
|
||||
- Redis is used for job queuing without password by default (runs on private network)
|
||||
- For production use, enable `USE_DB_AUTHENTICATION` and configure Supabase
|
||||
- The `BULL_AUTH_KEY` should be changed in production deployments
|
||||
- AI features require an `OPENAI_API_KEY` or `OLLAMA_BASE_URL`
|
||||
- All workers run within the single API container using the harness mode
|
||||
|
||||
## License
|
||||
|
||||
|
||||
Reference in New Issue
Block a user