Files

Sun-ZhenXing 922068b8af refactor: ./apps/*

2026-01-01 18:32:10 +08:00

5.2 KiB

Raw Blame History

Easy Dataset

This service deploys Easy Dataset, a powerful tool for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

Services

easy-dataset: The main Easy Dataset application server with built-in SQLite database.

Environment Variables

Variable Name	Description	Default Value
EASY_DATASET_VERSION	Easy Dataset image version	`1.5.1`
EASY_DATASET_PORT_OVERRIDE	Host port mapping for web interface	`1717`
TZ	System timezone	`UTC`

Please create a .env file and modify it as needed for your use case.

Volumes

easy_dataset_db: A named volume for storing the SQLite database and uploaded files.
easy_dataset_prisma: (Optional) A named volume for Prisma database files if needed.

Getting Started

Quick Start (Recommended)

(Optional) Create a .env file to customize settings:

EASY_DATASET_VERSION=1.5.1
EASY_DATASET_PORT_OVERRIDE=1717
TZ=Asia/Shanghai

Start the service:
```
docker compose up -d
```
Access Easy Dataset at http://localhost:1717

With Prisma Database Mount (Advanced)

If you need to mount the Prisma database files:

Initialize the database first:

# Clone the repository and initialize database
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
npm install
npm run db:push

Uncomment the Prisma volume mount in docker-compose.yaml:

volumes:
  - easy_dataset_db:/app/local-db
  - easy_dataset_prisma:/app/prisma  # Uncomment this line

Start the service:
```
docker compose up -d
```

Features

Intelligent Document Processing: Supports PDF, Markdown, DOCX, and more
Smart Text Splitting: Multiple algorithms with customizable segmentation
Question Generation: Automatically extracts relevant questions from text
Domain Labels: Builds global domain labels with understanding capabilities
Answer Generation: Uses LLM APIs to generate comprehensive answers and Chain of Thought (COT)
Flexible Editing: Edit questions, answers, and datasets at any stage
Multiple Export Formats: Alpaca, ShareGPT, multilingual-thinking (JSON/JSONL)
Wide Model Support: Compatible with all LLM APIs following OpenAI format

Usage Workflow

Create a Project: Set up a new project with LLM API configuration
Upload Documents: Add your domain-specific files (PDF, Markdown, etc.)
Text Splitting: Review and adjust automatically split text segments
Generate Questions: Batch construct questions from text blocks
Create Datasets: Generate answers using configured LLM
Export: Export datasets in your preferred format

Default Credentials

Easy Dataset does not require authentication by default. Access control should be implemented at the infrastructure level (e.g., reverse proxy, firewall rules).

Resource Limits

The service is configured with the following resource limits:

CPU: 0.5-2.0 cores
Memory: 1-4 GB

These limits can be adjusted in docker-compose.yaml based on your workload requirements.

Security Considerations

Data Privacy: All data processing happens locally
API Keys: Store LLM API keys securely within the application
Access Control: Implement network-level access restrictions as needed
Updates: Regularly update to the latest version for security patches

Documentation

Official Documentation: https://docs.easy-dataset.com/
GitHub Repository: https://github.com/ConardLi/easy-dataset
Video Tutorial: Bilibili
Research Paper: arXiv:2507.04009

Troubleshooting

Container Won't Start

Check logs: docker compose logs easy-dataset
Verify port 1717 is not already in use
Ensure sufficient system resources

Database Issues

For SQLite issues, remove and recreate the volume:
```
docker compose down -v
docker compose up -d
```

Permission Errors

Ensure the container has write access to mounted volumes
Check Docker volume permissions

License

Easy Dataset is licensed under AGPL 3.0. See the LICENSE file for details.

Citation

If this work is helpful, please cite:

@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

5.2 KiB Raw Blame History