refactor: ./apps/*
This commit is contained in:
13
apps/easy-dataset/.env.example
Normal file
13
apps/easy-dataset/.env.example
Normal file
@@ -0,0 +1,13 @@
|
||||
# Easy Dataset Configuration
|
||||
|
||||
# Image version
|
||||
# Find latest releases at: https://github.com/ConardLi/easy-dataset/releases
|
||||
EASY_DATASET_VERSION=1.5.1
|
||||
|
||||
# Port configuration
|
||||
# Override the host port for the web interface
|
||||
EASY_DATASET_PORT_OVERRIDE=1717
|
||||
|
||||
# Timezone configuration
|
||||
# Set your timezone (e.g., UTC, Asia/Shanghai, America/New_York)
|
||||
TZ=UTC
|
||||
161
apps/easy-dataset/README.md
Normal file
161
apps/easy-dataset/README.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Easy Dataset
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
This service deploys Easy Dataset, a powerful tool for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
|
||||
|
||||
## Services
|
||||
|
||||
- `easy-dataset`: The main Easy Dataset application server with built-in SQLite database.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| -------------------------- | ----------------------------------- | ------------- |
|
||||
| EASY_DATASET_VERSION | Easy Dataset image version | `1.5.1` |
|
||||
| EASY_DATASET_PORT_OVERRIDE | Host port mapping for web interface | `1717` |
|
||||
| TZ | System timezone | `UTC` |
|
||||
|
||||
Please create a `.env` file and modify it as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `easy_dataset_db`: A named volume for storing the SQLite database and uploaded files.
|
||||
- `easy_dataset_prisma`: (Optional) A named volume for Prisma database files if needed.
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Quick Start (Recommended)
|
||||
|
||||
1. (Optional) Create a `.env` file to customize settings:
|
||||
|
||||
```env
|
||||
EASY_DATASET_VERSION=1.5.1
|
||||
EASY_DATASET_PORT_OVERRIDE=1717
|
||||
TZ=Asia/Shanghai
|
||||
```
|
||||
|
||||
2. Start the service:
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
3. Access Easy Dataset at `http://localhost:1717`
|
||||
|
||||
### With Prisma Database Mount (Advanced)
|
||||
|
||||
If you need to mount the Prisma database files:
|
||||
|
||||
1. Initialize the database first:
|
||||
|
||||
```bash
|
||||
# Clone the repository and initialize database
|
||||
git clone https://github.com/ConardLi/easy-dataset.git
|
||||
cd easy-dataset
|
||||
npm install
|
||||
npm run db:push
|
||||
```
|
||||
|
||||
2. Uncomment the Prisma volume mount in `docker-compose.yaml`:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- easy_dataset_db:/app/local-db
|
||||
- easy_dataset_prisma:/app/prisma # Uncomment this line
|
||||
```
|
||||
|
||||
3. Start the service:
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Intelligent Document Processing**: Supports PDF, Markdown, DOCX, and more
|
||||
- **Smart Text Splitting**: Multiple algorithms with customizable segmentation
|
||||
- **Question Generation**: Automatically extracts relevant questions from text
|
||||
- **Domain Labels**: Builds global domain labels with understanding capabilities
|
||||
- **Answer Generation**: Uses LLM APIs to generate comprehensive answers and Chain of Thought (COT)
|
||||
- **Flexible Editing**: Edit questions, answers, and datasets at any stage
|
||||
- **Multiple Export Formats**: Alpaca, ShareGPT, multilingual-thinking (JSON/JSONL)
|
||||
- **Wide Model Support**: Compatible with all LLM APIs following OpenAI format
|
||||
|
||||
## Usage Workflow
|
||||
|
||||
1. **Create a Project**: Set up a new project with LLM API configuration
|
||||
2. **Upload Documents**: Add your domain-specific files (PDF, Markdown, etc.)
|
||||
3. **Text Splitting**: Review and adjust automatically split text segments
|
||||
4. **Generate Questions**: Batch construct questions from text blocks
|
||||
5. **Create Datasets**: Generate answers using configured LLM
|
||||
6. **Export**: Export datasets in your preferred format
|
||||
|
||||
## Default Credentials
|
||||
|
||||
Easy Dataset does not require authentication by default. Access control should be implemented at the infrastructure level (e.g., reverse proxy, firewall rules).
|
||||
|
||||
## Resource Limits
|
||||
|
||||
The service is configured with the following resource limits:
|
||||
|
||||
- **CPU**: 0.5-2.0 cores
|
||||
- **Memory**: 1-4 GB
|
||||
|
||||
These limits can be adjusted in `docker-compose.yaml` based on your workload requirements.
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Data Privacy**: All data processing happens locally
|
||||
- **API Keys**: Store LLM API keys securely within the application
|
||||
- **Access Control**: Implement network-level access restrictions as needed
|
||||
- **Updates**: Regularly update to the latest version for security patches
|
||||
|
||||
## Documentation
|
||||
|
||||
- Official Documentation: [https://docs.easy-dataset.com/](https://docs.easy-dataset.com/)
|
||||
- GitHub Repository: [https://github.com/ConardLi/easy-dataset](https://github.com/ConardLi/easy-dataset)
|
||||
- Video Tutorial: [Bilibili](https://www.bilibili.com/video/BV1y8QpYGE57/)
|
||||
- Research Paper: [arXiv:2507.04009](https://arxiv.org/abs/2507.04009v1)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Container Won't Start
|
||||
|
||||
- Check logs: `docker compose logs easy-dataset`
|
||||
- Verify port 1717 is not already in use
|
||||
- Ensure sufficient system resources
|
||||
|
||||
### Database Issues
|
||||
|
||||
- For SQLite issues, remove and recreate the volume:
|
||||
|
||||
```bash
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Permission Errors
|
||||
|
||||
- Ensure the container has write access to mounted volumes
|
||||
- Check Docker volume permissions
|
||||
|
||||
## License
|
||||
|
||||
Easy Dataset is licensed under AGPL 3.0. See the [LICENSE](https://github.com/ConardLi/easy-dataset/blob/main/LICENSE) file for details.
|
||||
|
||||
## Citation
|
||||
|
||||
If this work is helpful, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{miao2025easydataset,
|
||||
title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
|
||||
author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
|
||||
year={2025},
|
||||
eprint={2507.04009},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2507.04009}
|
||||
}
|
||||
```
|
||||
145
apps/easy-dataset/README.zh.md
Normal file
145
apps/easy-dataset/README.zh.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Easy Dataset
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
这个服务部署 Easy Dataset,一个用于创建大语言模型(LLM)微调数据集的强大工具。它提供了直观的界面,可以上传特定领域的文件、智能分割内容、生成问题,并产生高质量的模型微调训练数据。
|
||||
|
||||
## 服务
|
||||
|
||||
- `easy-dataset`:主应用服务器,内置 SQLite 数据库。
|
||||
|
||||
## 环境变量
|
||||
|
||||
| 变量名 | 描述 | 默认值 |
|
||||
| -------------------------- | ---------------------- | ------- |
|
||||
| EASY_DATASET_VERSION | Easy Dataset 镜像版本 | `1.5.1` |
|
||||
| EASY_DATASET_PORT_OVERRIDE | Web 界面的主机端口映射 | `1717` |
|
||||
| TZ | 系统时区 | `UTC` |
|
||||
|
||||
请创建 `.env` 文件并根据您的使用场景进行修改。
|
||||
|
||||
## 数据卷
|
||||
|
||||
- `easy_dataset_db`:用于存储 SQLite 数据库和上传文件的命名卷。
|
||||
- `easy_dataset_prisma`:(可选)如需要可用于 Prisma 数据库文件的命名卷。
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 快速启动(推荐)
|
||||
|
||||
1. (可选)创建 `.env` 文件以自定义设置:
|
||||
|
||||
```env
|
||||
EASY_DATASET_VERSION=1.5.1
|
||||
EASY_DATASET_PORT_OVERRIDE=1717
|
||||
TZ=Asia/Shanghai
|
||||
```
|
||||
|
||||
2. 启动服务:
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
3. 访问 Easy Dataset:`http://localhost:1717`
|
||||
|
||||
### 使用 Prisma 数据库挂载(高级)
|
||||
|
||||
如果需要挂载 Prisma 数据库文件:
|
||||
|
||||
1. 首先初始化数据库:
|
||||
|
||||
```bash
|
||||
# 克隆仓库并初始化数据库
|
||||
git clone https://github.com/ConardLi/easy-dataset.git
|
||||
cd easy-dataset
|
||||
npm install
|
||||
npm run db:push
|
||||
```
|
||||
|
||||
2. 在 `docker-compose.yaml` 中取消注释 Prisma 卷挂载:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- easy_dataset_db:/app/local-db
|
||||
- easy_dataset_prisma:/app/prisma # 取消此行注释
|
||||
```
|
||||
|
||||
3. 启动服务:
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## 功能特性
|
||||
|
||||
- **智能文档处理**:支持 PDF、Markdown、DOCX 等多种格式
|
||||
- **智能文本分割**:多种算法,可自定义分段
|
||||
- **问题生成**:从文本中自动提取相关问题
|
||||
- **领域标签**:构建全局领域标签,具有理解能力
|
||||
- **答案生成**:使用 LLM API 生成全面的答案和思维链(COT)
|
||||
- **灵活编辑**:在任何阶段编辑问题、答案和数据集
|
||||
- **多种导出格式**:Alpaca、ShareGPT、multilingual-thinking(JSON/JSONL)
|
||||
- **广泛的模型支持**:兼容所有遵循 OpenAI 格式的 LLM API
|
||||
|
||||
## 使用流程
|
||||
|
||||
1. **创建项目**:设置新项目并配置 LLM API
|
||||
2. **上传文档**:添加您的特定领域文件(PDF、Markdown 等)
|
||||
3. **文本分割**:查看并调整自动分割的文本段
|
||||
4. **生成问题**:从文本块批量构造问题
|
||||
5. **创建数据集**:使用配置的 LLM 生成答案
|
||||
6. **导出**:以您喜欢的格式导出数据集
|
||||
|
||||
## 默认凭据
|
||||
|
||||
Easy Dataset 默认不需要身份验证。应在基础设施层面实现访问控制(例如反向代理、防火墙规则)。
|
||||
|
||||
## 资源限制
|
||||
|
||||
该服务配置了以下资源限制:
|
||||
|
||||
- **CPU**:0.5-2.0 核心
|
||||
- **内存**:1-4 GB
|
||||
|
||||
可以根据您的工作负载需求在 `docker-compose.yaml` 中调整这些限制。
|
||||
|
||||
## 安全注意事项
|
||||
|
||||
- **数据隐私**:所有数据处理都在本地进行
|
||||
- **API 密钥**:在应用程序内安全存储 LLM API 密钥
|
||||
- **访问控制**:根据需要实施网络级访问限制
|
||||
- **更新**:定期更新到最新版本以获取安全补丁
|
||||
|
||||
## 文档
|
||||
|
||||
- 官方文档:[https://docs.easy-dataset.com/](https://docs.easy-dataset.com/)
|
||||
- GitHub 仓库:[https://github.com/ConardLi/easy-dataset](https://github.com/ConardLi/easy-dataset)
|
||||
- 视频教程:[Bilibili](https://www.bilibili.com/video/BV1y8QpYGE57/)
|
||||
- 研究论文:[arXiv:2507.04009](https://arxiv.org/abs/2507.04009v1)
|
||||
|
||||
## 故障排除
|
||||
|
||||
### 容器无法启动
|
||||
|
||||
- 查看日志:`docker compose logs easy-dataset`
|
||||
- 验证端口 1717 未被占用
|
||||
- 确保系统资源充足
|
||||
|
||||
### 数据库问题
|
||||
|
||||
- 如遇到 SQLite 问题,删除并重新创建卷:
|
||||
|
||||
```bash
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### 权限错误
|
||||
|
||||
- 确保容器对挂载卷有写入权限
|
||||
- 检查 Docker 卷权限
|
||||
|
||||
## 许可证
|
||||
|
||||
Easy Dataset 采用 AGPL 3.0 许可证。详见 [LICENSE](https://github.com/ConardLi/easy-dataset/blob/main/LICENSE) 文件。
|
||||
40
apps/easy-dataset/docker-compose.yaml
Normal file
40
apps/easy-dataset/docker-compose.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
x-defaults: &defaults
|
||||
restart: unless-stopped
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: 100m
|
||||
max-file: "3"
|
||||
|
||||
services:
|
||||
easy-dataset:
|
||||
<<: *defaults
|
||||
image: ${GHCR_IO_REGISTRY:-ghcr.io}/conardli/easy-dataset:${EASY_DATASET_VERSION:-1.5.1}
|
||||
ports:
|
||||
- "${EASY_DATASET_PORT_OVERRIDE:-1717}:1717"
|
||||
volumes:
|
||||
- easy_dataset_db:/app/local-db
|
||||
# Uncomment the following line if you need to mount Prisma database files
|
||||
# Note: You need to run 'npm run db:push' first to initialize the database
|
||||
# - easy_dataset_prisma:/app/prisma
|
||||
environment:
|
||||
- TZ=${TZ:-UTC}
|
||||
- NODE_ENV=production
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: ${EASY_DATASET_CPU_LIMIT:-2.0}
|
||||
memory: ${EASY_DATASET_MEMORY_LIMIT:-4G}
|
||||
reservations:
|
||||
cpus: ${EASY_DATASET_CPU_RESERVATION:-0.5}
|
||||
memory: ${EASY_DATASET_MEMORY_RESERVATION:-1G}
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:1717"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
volumes:
|
||||
easy_dataset_db:
|
||||
# easy_dataset_prisma:
|
||||
Reference in New Issue
Block a user