feat: add more
This commit is contained in:
25
src/firecrawl/.env.example
Normal file
25
src/firecrawl/.env.example
Normal file
@@ -0,0 +1,25 @@
|
||||
# Firecrawl version
|
||||
FIRECRAWL_VERSION="v1.16.0"
|
||||
|
||||
# Redis version
|
||||
REDIS_VERSION="7.4.2-alpine"
|
||||
|
||||
# Playwright version
|
||||
PLAYWRIGHT_VERSION="latest"
|
||||
|
||||
# Redis configuration
|
||||
REDIS_PASSWORD="firecrawl"
|
||||
|
||||
# Firecrawl configuration
|
||||
NUM_WORKERS_PER_QUEUE=8
|
||||
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE=20
|
||||
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL=1
|
||||
|
||||
# Playwright configuration (optional)
|
||||
PROXY_SERVER=""
|
||||
PROXY_USERNAME=""
|
||||
PROXY_PASSWORD=""
|
||||
BLOCK_MEDIA="true"
|
||||
|
||||
# Port overrides
|
||||
FIRECRAWL_PORT_OVERRIDE=3002
|
||||
96
src/firecrawl/README.md
Normal file
96
src/firecrawl/README.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# Firecrawl
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.
|
||||
|
||||
## Services
|
||||
|
||||
- `firecrawl`: The main Firecrawl API server.
|
||||
- `redis`: Redis for job queue and caching.
|
||||
- `playwright`: Playwright service for browser automation.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| ------------------------------------- | ----------------------------------- | -------------- |
|
||||
| FIRECRAWL_VERSION | Firecrawl image version | `v1.16.0` |
|
||||
| REDIS_VERSION | Redis image version | `7.4.2-alpine` |
|
||||
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
|
||||
| REDIS_PASSWORD | Redis password | `firecrawl` |
|
||||
| NUM_WORKERS_PER_QUEUE | Number of workers per queue | `8` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | Token bucket size for rate limiting | `20` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | Token refill rate per second | `1` |
|
||||
| PROXY_SERVER | Proxy server URL (optional) | `""` |
|
||||
| PROXY_USERNAME | Proxy username (optional) | `""` |
|
||||
| PROXY_PASSWORD | Proxy password (optional) | `""` |
|
||||
| BLOCK_MEDIA | Block media content | `true` |
|
||||
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API port | `3002` |
|
||||
|
||||
Please modify the `.env` file as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `redis_data`: Redis data storage for job queues and caching.
|
||||
|
||||
## Usage
|
||||
|
||||
### Start the Services
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Access the API
|
||||
|
||||
The Firecrawl API will be available at:
|
||||
|
||||
```text
|
||||
http://localhost:3002
|
||||
```
|
||||
|
||||
### Example API Calls
|
||||
|
||||
**Scrape a Single Page:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/scrape \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com"
|
||||
}'
|
||||
```
|
||||
|
||||
**Crawl a Website:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/crawl \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"crawlerOptions": {
|
||||
"limit": 100
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Web Scraping**: Extract clean content from any webpage
|
||||
- **Web Crawling**: Recursively crawl entire websites
|
||||
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
|
||||
- **Markdown Output**: Clean markdown conversion of web content
|
||||
- **Rate Limiting**: Built-in rate limiting to prevent abuse
|
||||
- **Proxy Support**: Optional proxy configuration for all requests
|
||||
|
||||
## Notes
|
||||
|
||||
- The service uses Playwright for browser automation, supporting complex web pages
|
||||
- Redis is used for job queuing and caching
|
||||
- Rate limiting is configurable via environment variables
|
||||
- For production use, consider scaling the number of workers
|
||||
- BLOCK_MEDIA can reduce memory usage by blocking images/videos
|
||||
|
||||
## License
|
||||
|
||||
Firecrawl is licensed under the AGPL-3.0 License.
|
||||
96
src/firecrawl/README.zh.md
Normal file
96
src/firecrawl/README.zh.md
Normal file
@@ -0,0 +1,96 @@
|
||||
# Firecrawl
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
此服务用于部署 Firecrawl,一个由 Playwright 和无头浏览器驱动的网页抓取和爬取 API。
|
||||
|
||||
## 服务
|
||||
|
||||
- `firecrawl`: Firecrawl API 主服务器。
|
||||
- `redis`: 用于作业队列和缓存的 Redis。
|
||||
- `playwright`: 用于浏览器自动化的 Playwright 服务。
|
||||
|
||||
## 环境变量
|
||||
|
||||
| 变量名 | 说明 | 默认值 |
|
||||
| ------------------------------------- | ---------------------- | -------------- |
|
||||
| FIRECRAWL_VERSION | Firecrawl 镜像版本 | `v1.16.0` |
|
||||
| REDIS_VERSION | Redis 镜像版本 | `7.4.2-alpine` |
|
||||
| PLAYWRIGHT_VERSION | Playwright 服务版本 | `latest` |
|
||||
| REDIS_PASSWORD | Redis 密码 | `firecrawl` |
|
||||
| NUM_WORKERS_PER_QUEUE | 每个队列的工作进程数 | `8` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | 速率限制的令牌桶大小 | `20` |
|
||||
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | 每秒令牌填充速率 | `1` |
|
||||
| PROXY_SERVER | 代理服务器 URL(可选) | `""` |
|
||||
| PROXY_USERNAME | 代理用户名(可选) | `""` |
|
||||
| PROXY_PASSWORD | 代理密码(可选) | `""` |
|
||||
| BLOCK_MEDIA | 阻止媒体内容 | `true` |
|
||||
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API 端口 | `3002` |
|
||||
|
||||
请根据实际需求修改 `.env` 文件。
|
||||
|
||||
## 卷
|
||||
|
||||
- `redis_data`: 用于作业队列和缓存的 Redis 数据存储。
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 启动服务
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 访问 API
|
||||
|
||||
Firecrawl API 可在以下地址访问:
|
||||
|
||||
```text
|
||||
http://localhost:3002
|
||||
```
|
||||
|
||||
### API 调用示例
|
||||
|
||||
**抓取单个页面:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/scrape \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com"
|
||||
}'
|
||||
```
|
||||
|
||||
**爬取网站:**
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3002/v0/crawl \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"crawlerOptions": {
|
||||
"limit": 100
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## 功能
|
||||
|
||||
- **网页抓取**: 从任何网页提取干净的内容
|
||||
- **网站爬取**: 递归爬取整个网站
|
||||
- **JavaScript 渲染**: 完全支持动态 JavaScript 渲染的页面
|
||||
- **Markdown 输出**: 将网页内容清晰地转换为 markdown
|
||||
- **速率限制**: 内置速率限制以防止滥用
|
||||
- **代理支持**: 所有请求的可选代理配置
|
||||
|
||||
## 注意事项
|
||||
|
||||
- 该服务使用 Playwright 进行浏览器自动化,支持复杂的网页
|
||||
- Redis 用于作业队列和缓存
|
||||
- 速率限制可通过环境变量配置
|
||||
- 对于生产环境,考虑扩展工作进程数量
|
||||
- BLOCK_MEDIA 可以通过阻止图像/视频来减少内存使用
|
||||
|
||||
## 许可证
|
||||
|
||||
Firecrawl 使用 AGPL-3.0 许可证授权。
|
||||
75
src/firecrawl/docker-compose.yaml
Normal file
75
src/firecrawl/docker-compose.yaml
Normal file
@@ -0,0 +1,75 @@
|
||||
x-default: &default
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- &localtime /etc/localtime:/etc/localtime:ro
|
||||
- &timezone /etc/timezone:/etc/timezone:ro
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: 100m
|
||||
|
||||
services:
|
||||
firecrawl:
|
||||
<<: *default
|
||||
image: mendableai/firecrawl:${FIRECRAWL_VERSION:-v1.16.0}
|
||||
container_name: firecrawl
|
||||
ports:
|
||||
- "${FIRECRAWL_PORT_OVERRIDE:-3002}:3002"
|
||||
environment:
|
||||
REDIS_URL: redis://:${REDIS_PASSWORD:-firecrawl}@redis:6379
|
||||
PLAYWRIGHT_MICROSERVICE_URL: http://playwright:3000
|
||||
PORT: 3002
|
||||
NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8}
|
||||
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE: ${SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE:-20}
|
||||
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL: ${SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL:-1}
|
||||
depends_on:
|
||||
- redis
|
||||
- playwright
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2.0'
|
||||
memory: 4G
|
||||
reservations:
|
||||
cpus: '1.0'
|
||||
memory: 2G
|
||||
|
||||
redis:
|
||||
<<: *default
|
||||
image: redis:${REDIS_VERSION:-7.4.2-alpine}
|
||||
container_name: firecrawl-redis
|
||||
command: redis-server --requirepass ${REDIS_PASSWORD:-firecrawl} --appendonly yes
|
||||
volumes:
|
||||
- *localtime
|
||||
- *timezone
|
||||
- redis_data:/data
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '1.0'
|
||||
memory: 512M
|
||||
reservations:
|
||||
cpus: '0.5'
|
||||
memory: 256M
|
||||
|
||||
playwright:
|
||||
<<: *default
|
||||
image: mendableai/firecrawl-playwright:${PLAYWRIGHT_VERSION:-latest}
|
||||
container_name: firecrawl-playwright
|
||||
environment:
|
||||
PORT: 3000
|
||||
PROXY_SERVER: ${PROXY_SERVER:-}
|
||||
PROXY_USERNAME: ${PROXY_USERNAME:-}
|
||||
PROXY_PASSWORD: ${PROXY_PASSWORD:-}
|
||||
BLOCK_MEDIA: ${BLOCK_MEDIA:-true}
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2.0'
|
||||
memory: 2G
|
||||
reservations:
|
||||
cpus: '1.0'
|
||||
memory: 1G
|
||||
|
||||
volumes:
|
||||
redis_data:
|
||||
Reference in New Issue
Block a user