Skip to content

amarnath3003/Dataset-Creator-App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Lab Logo

Dataset Lab

A powerful, file-based dataset engineering system for creating, refining, and exporting high-quality instruction-style QA datasets.

Python Node.js License PRs Welcome

FeaturesQuick StartOverviewScreenshotsTroubleshooting


✨ Features

  • 🚀 Fully Automated Setup: One-command installer configures everything (Python venv, pip, npm) automatically.
  • 🛡️ Bulletproof CLI Runner: Custom-built terminal interface that gracefully handles older Windows terminals with safe unicode fallbacks—no more UnicodeEncodeError crashes.
  • 🧠 Local & Cloud LLMs: Seamlessly use local models via Ollama or cloud models via OpenAI/Anthropic APIs.
  • 📂 File-based Engineering: Ingest, chunk, generate, and refine high-quality datasets directly from your local documents.
  • 🌐 Cross-Platform: Works flawlessly on Windows, macOS, and Linux.

🚀 Quick Start

Dataset Lab is designed to be as easy to start as possible. Choose your OS below:

Windows

:: 1. Clone the repository
git clone https://github.com/amarnath123456789/Dataset-Creator-App.git
cd Dataset-Creator-App

:: 2. Run the quick-start batch file
start.bat

Tip: You can also simply double-click start.bat from your File Explorer to install dependencies and boot the servers without touching a terminal!

macOS / Linux

# 1. Clone the repository
git clone https://github.com/amarnath123456789/Dataset-Creator-App.git
cd Dataset-Creator-App

# 2. Run the quick-start shell script
bash start.sh

That's it! The app will open automatically in your browser at http://localhost:5173 🎉.


🖥️ CLI Reference

If you prefer to run components manually or view logs, use the built-in Python CLI runner.

python datasetlab.py <command>
Command Description
start Starts the backend and frontend servers.
stop Gracefully stops all running servers.
status Shows live status and PIDs for running servers.
open Opens the Dataset Lab dashboard in your default browser.
logs Tails the recent console output from the servers.

📖 Project Overview

Creating a high-quality dataset is a multi-step pipeline. Dataset Lab streamlines this process into four intuitive stages:

  1. Upload & Chunking: Ingest source documents (.pdf, .txt, .docx) and split them into context-aware chunks.
  2. Generation: Leverage local LLMs (via Ollama) or Cloud APIs to automatically generate Question-Answer pairs based on the chunks.
  3. Refinement: Clean, filter, edit, and format the generated records in an easy-to-use tabular UI.
  4. Export: Export the finalized dataset to .json, .jsonl, or .csv, perfectly formatted for supervised fine-tuning.

💻 System Requirements

Requirement Details
OS Windows 10/11, macOS (M1/M2/Intel), Linux
Python 3.9+Download
Node.js 18+Download
RAM 8GB minimum (16GB+ recommended if running local LLMs)
Disk 2GB+ for the app (Additional 4–10GB per local model)
Ollama Optional — Required for offline local LLMs — Download

ℹ️ Note: The install.py script automatically checks these requirements and will warn you if anything is missing.


🤖 Using Local LLMs (Ollama)

Dataset Lab has native integration with Ollama for 100% offline dataset generation.

  1. Install Ollama from ollama.com.
  2. Pull a model from your terminal:
    ollama run llama3
  3. Dataset Lab will automatically detect your local Ollama instance running at http://localhost:11434.

⚙️ Configuration (Environment Variables)

The installer generates a .env file for you automatically. You can edit it manually at dataset-lab/.env if you need to tweak defaults or add API keys:

# Document Processing Defaults
DEFAULT_CHUNK_SIZE=800
DEFAULT_CHUNK_OVERLAP=100
DEFAULT_SIMILARITY_THRESHOLD=0.92

# Cloud LLM API Keys (Optional — Only needed for Cloud generation)
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here

🎞️ Screenshots

Click to view Dashboard Screenshots

🏗️ Manual Setup (Advanced)

View Manual Setup Instructions

If you prefer to bypass the installer and set up the servers manually:

# Backend Setup
cd dataset-lab
python -m venv .venv
.venv\Scripts\activate        # Windows
source .venv/bin/activate     # macOS/Linux
pip install -r backend/requirements.txt

# Frontend Setup
cd frontend
npm install

# Run Backend (Terminal 1)
python -m backend.main

# Run Frontend (Terminal 2)
npm run dev

📁 Folder Structure

View Repository Structure
Dataset-Creator-App/
├── install.py          ← One-command installer
├── datasetlab.py       ← CLI runner (start/stop/status/open/logs)
├── start.bat           ← Windows double-click starter
├── start.sh            ← macOS/Linux shell starter
└── dataset-lab/
    ├── backend/        ← FastAPI backend (LLM engines, API endpoints)
    ├── frontend/       ← React / Vite frontend
    ├── projects/       ← Generated project data
    ├── .venv/          ← Python virtual environment (Created by installer)
    ├── .logs/          ← Server logs (Created on start)
    └── .env            ← Global config (Created by installer)

🚑 Troubleshooting & Common Errors

Error Cause Solution
python install.py fails Missing Dependencies Ensure Python 3.9+ and Node 18+ are installed and added to your system PATH.
Pipeline stuck in "Running" Ghost Processes Delete the .running or .stop files inside dataset-lab/projects/<project>/.
Cannot connect to Ollama Engine Not Running Run ollama run llama3 in your terminal and verify it serves at http://localhost:11434.
Port 8000/5173 in use Conflicting Process Stop the conflicting process or change the port in backend/main.py and vite.config.js.
ModuleNotFoundError Missing Python package Run python install.py again to reinstall dependencies.
npm error: … Missing Node modules Run python install.py again to reinstall frontend packages.

If the backend or frontend crashes unexpectedly, run python datasetlab.py logs to see what went wrong.


🤝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature/awesome-feature
  3. Commit your changes with clear messages.
  4. Push to the branch and open a Pull Request.

Thank you for using Dataset Lab! Found a bug? Open an issue.

About

Create Datasets , FineTune Models , Test Models - All in one place. Everything simplified

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors