feat: add more

This commit is contained in:
Sun-ZhenXing
2025-10-06 21:48:39 +08:00
parent f330e00fa0
commit 3c609b5989
120 changed files with 7698 additions and 59 deletions

96
src/firecrawl/README.md Normal file
View File

@@ -0,0 +1,96 @@
# Firecrawl
[English](./README.md) | [中文](./README.zh.md)
This service deploys Firecrawl, a web scraping and crawling API powered by Playwright and headless browsers.
## Services
- `firecrawl`: The main Firecrawl API server.
- `redis`: Redis for job queue and caching.
- `playwright`: Playwright service for browser automation.
## Environment Variables
| Variable Name | Description | Default Value |
| ------------------------------------- | ----------------------------------- | -------------- |
| FIRECRAWL_VERSION | Firecrawl image version | `v1.16.0` |
| REDIS_VERSION | Redis image version | `7.4.2-alpine` |
| PLAYWRIGHT_VERSION | Playwright service version | `latest` |
| REDIS_PASSWORD | Redis password | `firecrawl` |
| NUM_WORKERS_PER_QUEUE | Number of workers per queue | `8` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_SIZE | Token bucket size for rate limiting | `20` |
| SCRAPE_RATE_LIMIT_TOKEN_BUCKET_REFILL | Token refill rate per second | `1` |
| PROXY_SERVER | Proxy server URL (optional) | `""` |
| PROXY_USERNAME | Proxy username (optional) | `""` |
| PROXY_PASSWORD | Proxy password (optional) | `""` |
| BLOCK_MEDIA | Block media content | `true` |
| FIRECRAWL_PORT_OVERRIDE | Firecrawl API port | `3002` |
Please modify the `.env` file as needed for your use case.
## Volumes
- `redis_data`: Redis data storage for job queues and caching.
## Usage
### Start the Services
```bash
docker-compose up -d
```
### Access the API
The Firecrawl API will be available at:
```text
http://localhost:3002
```
### Example API Calls
**Scrape a Single Page:**
```bash
curl -X POST http://localhost:3002/v0/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com"
}'
```
**Crawl a Website:**
```bash
curl -X POST http://localhost:3002/v0/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"crawlerOptions": {
"limit": 100
}
}'
```
## Features
- **Web Scraping**: Extract clean content from any webpage
- **Web Crawling**: Recursively crawl entire websites
- **JavaScript Rendering**: Full support for dynamic JavaScript-rendered pages
- **Markdown Output**: Clean markdown conversion of web content
- **Rate Limiting**: Built-in rate limiting to prevent abuse
- **Proxy Support**: Optional proxy configuration for all requests
## Notes
- The service uses Playwright for browser automation, supporting complex web pages
- Redis is used for job queuing and caching
- Rate limiting is configurable via environment variables
- For production use, consider scaling the number of workers
- BLOCK_MEDIA can reduce memory usage by blocking images/videos
## License
Firecrawl is licensed under the AGPL-3.0 License.