feat: add more

This commit is contained in:
Sun-ZhenXing
2025-10-06 21:48:39 +08:00
parent f330e00fa0
commit 3c609b5989
120 changed files with 7698 additions and 59 deletions

View File

@@ -0,0 +1,25 @@
# Firecrawl version
FIRECRAWL_VERSION="v1.16.0"
# Redis version
REDIS_VERSION="7.4.2-alpine"
# Playwright version
PLAYWRIGHT_VERSION="latest"
# Redis configuration
REDIS_PASSWORD="firecrawl"
# Firecrawl configuration
NUM_WORKERS_PER_QUEUE=8
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE=20
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL=1
# Playwright configuration (optional)
PROXY_SERVER=""
PROXY_USERNAME=""
PROXY_PASSWORD=""
BLOCK_MEDIA="true"
# Port overrides
FIRECRAWL_PORT_OVERRIDE=3002

96
src/firecrawl/README.md Normal file
View File

@@ -0,0 +1,96 @@
# Firecrawl
[English](./README.md) | [中文](./README.zh.md)
This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.
## Services
- `firecrawl`: The main Firecrawl API server.
- `redis`: Redis for job queue and caching.
- `playwright`: Playwright service for browser automation.
## Environment Variables
| Variable Name | Description | Default Value |
| ------------------------------------- | ----------------------------------- | -------------- |
| FIRECRAWL_VERSION | Firecrawl image version | `v1.16.0` |
| REDIS_VERSION | Redis image version | `7.4.2-alpine` |
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
| REDIS_PASSWORD | Redis password | `firecrawl` |
| NUM_WORKERS_PER_QUEUE | Number of workers per queue | `8` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | Token bucket size for rate limiting | `20` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | Token refill rate per second | `1` |
| PROXY_SERVER | Proxy server URL (optional) | `""` |
| PROXY_USERNAME | Proxy username (optional) | `""` |
| PROXY_PASSWORD | Proxy password (optional) | `""` |
| BLOCK_MEDIA | Block media content | `true` |
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API port | `3002` |
Please modify the `.env` file as needed for your use case.
## Volumes
- `redis_data`: Redis data storage for job queues and caching.
## Usage
### Start the Services
```bash
docker-compose up -d
```
### Access the API
The Firecrawl API will be available at:
```text
http://localhost:3002
```
### Example API Calls
**Scrape a Single Page:**
```bash
curl -X POST http://localhost:3002/v0/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
```
**Crawl a Website:**
```bash
curl -X POST http://localhost:3002/v0/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"crawlerOptions": {
"limit": 100
}
}'
```
## Features
- **Web Scraping**: Extract clean content from any webpage
- **Web Crawling**: Recursively crawl entire websites
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
- **Markdown Output**: Clean markdown conversion of web content
- **Rate Limiting**: Built-in rate limiting to prevent abuse
- **Proxy Support**: Optional proxy configuration for all requests
## Notes
- The service uses Playwright for browser automation, supporting complex web pages
- Redis is used for job queuing and caching
- Rate limiting is configurable via environment variables
- For production use, consider scaling the number of workers
- BLOCK_MEDIA can reduce memory usage by blocking images/videos
## License
Firecrawl is licensed under the AGPL-3.0 License.

View File

@@ -0,0 +1,96 @@
# Firecrawl
[English](./README.md) | [中文](./README.zh.md)
此服务用于部署 Firecrawl一个由 Playwright 和无头浏览器驱动的网页抓取和爬取 API。
## 服务
- `firecrawl`: Firecrawl API 主服务器。
- `redis`: 用于作业队列和缓存的 Redis。
- `playwright`: 用于浏览器自动化的 Playwright 服务。
## 环境变量
| 变量名 | 说明 | 默认值 |
| ------------------------------------- | ---------------------- | -------------- |
| FIRECRAWL_VERSION | Firecrawl 镜像版本 | `v1.16.0` |
| REDIS_VERSION | Redis 镜像版本 | `7.4.2-alpine` |
| PLAYWRIGHT_VERSION | Playwright 服务版本 | `latest` |
| REDIS_PASSWORD | Redis 密码 | `firecrawl` |
| NUM_WORKERS_PER_QUEUE | 每个队列的工作进程数 | `8` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | 速率限制的令牌桶大小 | `20` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | 每秒令牌填充速率 | `1` |
| PROXY_SERVER | 代理服务器 URL可选 | `""` |
| PROXY_USERNAME | 代理用户名(可选) | `""` |
| PROXY_PASSWORD | 代理密码(可选) | `""` |
| BLOCK_MEDIA | 阻止媒体内容 | `true` |
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API 端口 | `3002` |
请根据实际需求修改 `.env` 文件。
## 卷
- `redis_data`: 用于作业队列和缓存的 Redis 数据存储。
## 使用方法
### 启动服务
```bash
docker-compose up -d
```
### 访问 API
Firecrawl API 可在以下地址访问:
```text
http://localhost:3002
```
### API 调用示例
**抓取单个页面:**
```bash
curl -X POST http://localhost:3002/v0/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
```
**爬取网站:**
```bash
curl -X POST http://localhost:3002/v0/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"crawlerOptions": {
"limit": 100
}
}'
```
## 功能
- **网页抓取**: 从任何网页提取干净的内容
- **网站爬取**: 递归爬取整个网站
- **JavaScript 渲染**: 完全支持动态 JavaScript 渲染的页面
- **Markdown 输出**: 将网页内容清晰地转换为 markdown
- **速率限制**: 内置速率限制以防止滥用
- **代理支持**: 所有请求的可选代理配置
## 注意事项
- 该服务使用 Playwright 进行浏览器自动化,支持复杂的网页
- Redis 用于作业队列和缓存
- 速率限制可通过环境变量配置
- 对于生产环境,考虑扩展工作进程数量
- BLOCK_MEDIA 可以通过阻止图像/视频来减少内存使用
## 许可证
Firecrawl 使用 AGPL-3.0 许可证授权。

View File

@@ -0,0 +1,75 @@
x-default: &default
restart: unless-stopped
volumes:
- &localtime /etc/localtime:/etc/localtime:ro
- &timezone /etc/timezone:/etc/timezone:ro
logging:
driver: json-file
options:
max-size: 100m
services:
firecrawl:
<<: *default
image: mendableai/firecrawl:${FIRECRAWL_VERSION:-v1.16.0}
container_name: firecrawl
ports:
- "${FIRECRAWL_PORT_OVERRIDE:-3002}:3002"
environment:
REDIS_URL: redis://:${REDIS_PASSWORD:-firecrawl}@redis:6379
PLAYWRIGHT_MICROSERVICE_URL: http://playwright:3000
PORT: 3002
NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8}
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE: ${SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE:-20}
SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL: ${SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL:-1}
depends_on:
- redis
- playwright
deploy:
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2G
redis:
<<: *default
image: redis:${REDIS_VERSION:-7.4.2-alpine}
container_name: firecrawl-redis
command: redis-server --requirepass ${REDIS_PASSWORD:-firecrawl} --appendonly yes
volumes:
- *localtime
- *timezone
- redis_data:/data
deploy:
resources:
limits:
cpus: '1.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256M
playwright:
<<: *default
image: mendableai/firecrawl-playwright:${PLAYWRIGHT_VERSION:-latest}
container_name: firecrawl-playwright
environment:
PORT: 3000
PROXY_SERVER: ${PROXY_SERVER:-}
PROXY_USERNAME: ${PROXY_USERNAME:-}
PROXY_PASSWORD: ${PROXY_PASSWORD:-}
BLOCK_MEDIA: ${BLOCK_MEDIA:-true}
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '1.0'
memory: 1G
volumes:
redis_data: