5.2 KiB
Easy Dataset
This service deploys Easy Dataset, a powerful tool for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
Services
easy-dataset: The main Easy Dataset application server with built-in SQLite database.
Environment Variables
| Variable Name | Description | Default Value |
|---|---|---|
| EASY_DATASET_VERSION | Easy Dataset image version | 1.5.1 |
| EASY_DATASET_PORT_OVERRIDE | Host port mapping for web interface | 1717 |
| TZ | System timezone | UTC |
Please create a .env file and modify it as needed for your use case.
Volumes
easy_dataset_db: A named volume for storing the SQLite database and uploaded files.easy_dataset_prisma: (Optional) A named volume for Prisma database files if needed.
Getting Started
Quick Start (Recommended)
-
(Optional) Create a
.envfile to customize settings:EASY_DATASET_VERSION=1.5.1 EASY_DATASET_PORT_OVERRIDE=1717 TZ=Asia/Shanghai -
Start the service:
docker compose up -d -
Access Easy Dataset at
http://localhost:1717
With Prisma Database Mount (Advanced)
If you need to mount the Prisma database files:
-
Initialize the database first:
# Clone the repository and initialize database git clone https://github.com/ConardLi/easy-dataset.git cd easy-dataset npm install npm run db:push -
Uncomment the Prisma volume mount in
docker-compose.yaml:volumes: - easy_dataset_db:/app/local-db - easy_dataset_prisma:/app/prisma # Uncomment this line -
Start the service:
docker compose up -d
Features
- Intelligent Document Processing: Supports PDF, Markdown, DOCX, and more
- Smart Text Splitting: Multiple algorithms with customizable segmentation
- Question Generation: Automatically extracts relevant questions from text
- Domain Labels: Builds global domain labels with understanding capabilities
- Answer Generation: Uses LLM APIs to generate comprehensive answers and Chain of Thought (COT)
- Flexible Editing: Edit questions, answers, and datasets at any stage
- Multiple Export Formats: Alpaca, ShareGPT, multilingual-thinking (JSON/JSONL)
- Wide Model Support: Compatible with all LLM APIs following OpenAI format
Usage Workflow
- Create a Project: Set up a new project with LLM API configuration
- Upload Documents: Add your domain-specific files (PDF, Markdown, etc.)
- Text Splitting: Review and adjust automatically split text segments
- Generate Questions: Batch construct questions from text blocks
- Create Datasets: Generate answers using configured LLM
- Export: Export datasets in your preferred format
Default Credentials
Easy Dataset does not require authentication by default. Access control should be implemented at the infrastructure level (e.g., reverse proxy, firewall rules).
Resource Limits
The service is configured with the following resource limits:
- CPU: 0.5-2.0 cores
- Memory: 1-4 GB
These limits can be adjusted in docker-compose.yaml based on your workload requirements.
Security Considerations
- Data Privacy: All data processing happens locally
- API Keys: Store LLM API keys securely within the application
- Access Control: Implement network-level access restrictions as needed
- Updates: Regularly update to the latest version for security patches
Documentation
- Official Documentation: https://docs.easy-dataset.com/
- GitHub Repository: https://github.com/ConardLi/easy-dataset
- Video Tutorial: Bilibili
- Research Paper: arXiv:2507.04009
Troubleshooting
Container Won't Start
- Check logs:
docker compose logs easy-dataset - Verify port 1717 is not already in use
- Ensure sufficient system resources
Database Issues
-
For SQLite issues, remove and recreate the volume:
docker compose down -v docker compose up -d
Permission Errors
- Ensure the container has write access to mounted volumes
- Check Docker volume permissions
License
Easy Dataset is licensed under AGPL 3.0. See the LICENSE file for details.
Citation
If this work is helpful, please cite:
@misc{miao2025easydataset,
title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
year={2025},
eprint={2507.04009},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04009}
}